Responsive image
博碩士論文 etd-0814103-140550 詳細資訊
Title page for etd-0814103-140550
論文名稱
Title
個人化文件分群:技術發展與實證評估
Personalized Document Clustering: Technique Development and Empirical Evaluation
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
51
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2003-07-22
繳交日期
Date of Submission
2003-08-14
關鍵字
Keywords
個人化文件分群、文件分群、監督式文件分群、階層式叢集分群
Supervised Document Clustering, Personalized Document Clustering, Document Clustering, Hierarchical Agglomerative Clustering
統計
Statistics
本論文已被瀏覽 5757 次,被下載 2519
The thesis/dissertation has been browsed 5757 times, has been downloaded 2519 times.
中文摘要
隨著資訊科技與網際網路的發展,使得人們可以容易地從網路上獲取大量所需資訊(通常為文字格式的文件),因此,需要管理的電子化文件也與日遽增。在文件管理方面,傳統上人們習慣用類別的概念來整理其檔案或文件。然而,面對越來越多的電子化文件,以手動方式來管理文件類別,將耗費使用者相當多的時間、精神與體力。因此,一個能夠自動化地進行文件分群管理的工具對許多使用者來說是相當需要的。此外,每個人的分類標準與偏好往往是不盡相同的,基於自動化文件分群的需求以及個人化概念的重要性,我們提出「個人化文件分群」的技術來滿足個人在自動化文件管理上需求。為達成個人化文件分群的目的,本論文提出採用個人的部份分群(partial clustering)資訊當成擷取使用者分類偏好的來源,其中,個人的部份分群指的是由使用者提供對部分文件所作的分群結果。本研究針對文件表達方式(document representation)提出feature refinement與feature weighting二種方法,同時對分群過程提出pre-cluster-based HAC與atomic-based HAC二種方法。在以傳統的文件分群技術當作是比較的基準中,實證結果顯示本論文所提出的四種個人化文件分群技術比傳統的文件分群技術更能夠接近個人分群的結果。此外,在提出的四種個人化文件分群技術中,pre-cluster-based HAC比atomic-based HAC有較優異的分群表現;另一方面,以feature weighting方法進行的文件分群結果優於feature refinement的文件分群結果。
Abstract
With the proliferation of an electronic commerce and knowledge economy environment, both organizations and individuals generate and consume a large amount of online information, typically available as textual documents. To manage the ever-increasing volume of documents, organizations and individuals typically organize their documents into categories to facilitate document management and subsequent information access and browsing. However, document grouping behaviors are intentional acts, reflecting individuals’ (or organizations’) preferential perspective on semantic coherency or relevant groupings between subjects. Thus, an effective document clustering needs to address the described preferential perspective on document grouping and support personalized document clustering. In this thesis, we designed and implemented a personalized document clustering approach by incorporating individual’s partial clustering into the document clustering process. Combining two document representation methods (i.e., feature refinement and feature weighting) with two clustering processes (i.e., pre-cluster-based and atomic-based), four personalized document clustering techniques are proposed. Using the clustering effectiveness achieved by a traditional content-based document clustering technique as performance benchmarks, our evaluation results suggest that use of partial clusters would improve the document clustering effectiveness. Moreover, the pre-cluster-based technique outperforms the atomic-based one, and the feature weighting method for document representation achieves a higher clustering effectiveness than the feature refinement method does.
目次 Table of Contents
CHAPTER 1 INTRODUCTION 1
1.1 BACKGROUND 1
1.2 RESEARCH MOTIVATION AND OBJECTIVES 2
1.3 ORGANIZATION OF THE THESIS 4
CHAPTER 2 LITERATURE REVIEW 5
2.1 GENERAL PROCESS OF CONTENT-BASED DOCUMENT CLUSTERING 5
2.2 MAJOR CLUSTERING APPROACHES 8
2.3 SUPERVISED DOCUMENT CLUSTERING TECHNIQUE 11
CHAPTER 3 DESIGN OF PERSONALIZED DOCUMENT CLUSTERING 15
3.1 FEATURE EXTRACTION, SELECTION, AND CONSOLIDATION PHASE 16
3.1.1 Feature Extraction 16
3.1.2 Feature Selection 16
3.1.3 Feature Consolidation 20
3.2 DOCUMENT REPRESENTATION PHASE 21
3.3 CLUSTERING PHASE 22
3.3.1 Atomic-based HAC Method 23
3.3.2 Pre-cluster-based HAC Method 25
CHAPTER 4 EMPIRICAL EVALUATION FOR PERSONALIZED DOCUMENT CLUSTERING 28
4.1 EVALUATION DESIGN 28
4.1.1 Collection of Data Sets 28
4.1.2 Personal Categorization Data Collection 29
4.1.3 Evaluation Criteria 31
4.1.4 Evaluation Procedure 32
4.2 EVALUATION RESULT 33
4.2.1 Effect of Number of Features 33
4.2.2 Comparative Evaluation Results 38
4.2.3 Sensitivity of Size of Partial Clustering on Clustering Effectiveness 41
CHAPTER 5 CONCLUSION AND FUTURE RESEARCH DIRECTIONS 45
REFERENCES 48
參考文獻 References
[A73] Anderberg, M. R., Cluster Analysis for Applications, Academic Press, Inc., 1973.
[B92] Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992.
[B94] Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994.
[ABS00] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-based Interactive Management of Text Databases,” Proceedings of the Seventh Conference on Extending Database Technology (EDBT00), 2000, pp.365-379.
[BGR99] G Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J., “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems, Vol. 27, No. 3, 1999, pp.329-341.
[BS97] Berson, A. and Smith, S. J., Data Warehousing, Data Mining & OLAP, McGraw-Hill, Inc., 1997.
[CDO97] Carol, L., David, A., and Ophir, F., “Improving Relevance Feedback in the Vector Space Model,” Proceedings of the 6th International Conference on Information and Knowledge Management, 1997, pp.16-23.
[CKP92] Cutting, D., Karger, D., Pedersen, J. ,and Tukey, J., “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.318-329.
[DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representations for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ‘98), 1998, pp.148-155.
[EM97] Estivill-Castro, V. and Murray, A. T., “Spatial Clustering for Data Mining with Generic Algorithms,” Technical Report FIT-TR-97-10, Queensland University of Technology, Faculty of Information Management, September 1997.
[EW86] El-Hamdouchi, A. and Willett, P., “Hierarchical Document Clustering Using Ward’s Method,” Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp.149-156.
[HK01] Han, J and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001.
[H92] Harman, D., “Relevance Feedback Revisited,” Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.1-10.
[K89] Kohonen, T., Self-Organization and Associative Memory, Springer, 1989.
[K95] Kohonen, T., Self-Organizing Maps, Springer, 1995.
[KL00] Kim, H. and Lee, S., “A Semi-Supervised Document Clustering Technique for Information Organization,” Proceedings of the 9th International Conference on Information and Knowledge Management, November 2000, pp.30-37.
[KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An introduction to Cluster Analysis, John Wiley & Sons, Inc., New York, NY, 1990.
[LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.16-22.
[LHK96] Lagus, K., Honkela, T., Kaski, S., and Kohonen, T., “Self-organizing Maps of Document Collections: A New Approach to Interactive Exploration,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996.
[NH94] Ng, R. and Han, J., “Efficient and Effective Clustering methods for spatial Data Mining,” Proceedings of International Conference on Very Large Data Bases, Santiago, Chile, September 1994, pp.144-155.
[OMK91] Ogawa, Y., Moria, T., and Kobayashi, K., “A Fuzzy Document Retrieval System Using the Key Word Connection Matrix and a Learning Method,” Fuzzy Sets and Systems, Vol. 39, 1991, pp.163-179.
[RC99] Roussinov, D. G. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No.1-2, November 1999, pp.67-79.
[RM99] Rauber, A. and Merkl, D., “Using Self-organizing Maps to Organize Document Archives and to Characterize Subject Matters: How to Make A Map Tell the News of the World,” Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA’98), Florence, Italy, 1999.
[RP97] Rucker, J. and Polanco, M. J., “Siteseer: Personalized Navigation for the Web,” Communications of the ACM, Vol. 40, No. 3, March 1997, pp.73-75.
[TB99] Talavera, L. and Bejar, J., “Integrating Declarative Knowledge in Hierarchical Clustering Tasks,” Proceedings of the 3th International Symposium on Intelligent Data Analysis, 1999, pp.211-222.
[V86] Voorhees, E. M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol.22, 1986, pp.465-476.
[V93] Voutilainen, A., “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993.
[WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-Commerce Environments: An Evolution-Based Approach,” European Journal of Information Systems, Vol. 11, No. 3, September 2002, pp.208-222.
[WPS03] Wei, C., Piramuthu, S. and Shaw, M. J., “Knowledge Discovery and Data Mining,” Chapter 41 in Handbook of Knowledge Management, Vol. 2, C. W. Holsapple (Ed.), Springer-Verlag, Berlin, Germany, 2003, pp.157-189.
[YC94] Yang, Y. and Chute, C. G., “An Expample-based Mapping Method for Text Categorization and Retrieval,” ACM Transactions on Information Systems, Vol. 12, No. 3, 1994, pp.252-277.
[YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp.412-420.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內立即公開,校外一年後公開 off campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code