國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以本體論為基礎之個人化文件分群技術,An Ontology-Based Personalized Document Clustering Approach

論文名稱 Title	以本體論為基礎之個人化文件分群技術 An Ontology-Based Personalized Document Clustering Approach
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	92 學年度第 2 學期 The spring semester of Academic Year 92	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	72
研究生 Author	黃哲修 Tse-hsiu Huang
指導教授 Advisor	魏志平 Chih-Ping Wei
召集委員 Convenor	鄭興 none
口試委員 Advisory Committee	胡仁華 none
口試日期 Date of Exam	2004-07-27	繳交日期 Date of Submission	2004-08-05
關鍵字 Keywords	本體論、階層式叢集分群、個人化文件分群、文件分群、本體論為基礎之文件分群、本體論學習 Document clustering, Hierarchical agglomerative clustering, Ontology learning, Ontology, Personalized document clustering, Ontology-based document clustering
統計 Statistics	本論文已被瀏覽 5811 次，被下載 24 次 The thesis/dissertation has been browsed 5811 times, has been downloaded 24 times.

中文摘要
隨著網際網路與知識經濟環境的發展，使得人們以及組織可以快速地在網路上產生並擷取大量所需資訊，其中大部分為文字格式的文件，因此，對於電子化文件管理的需求也隨之增加。為了處理這些大量增加的文件，人們習慣用類別或文件夾的概念來整理其檔案或文件。除此之外，每個人對於分類往往有不同的標準與偏好，因而形成不同的文件群集。基於自動化文件管理的需求以及個人化概念的重要性，在我們的研究中，提出了一個以本體論為基礎之文件分群技術(OnPEC)，採用個人的部份分群(partial clustering)資訊以及一個本體論(ontology)來幫助我們達成個人化的文件分群。其中，個人的部份分群可以當成擷取個人的分類偏好的來源，而採用本體論可以將文件分群技術由特徵值的基礎(feature-based)提升至概念的基礎(concept-based)。同時，本論文對分群過程採用atomic-based HAC與pre-cluster-based HAC二種方法。以傳統的文件分群技術及先前學者所提之特徵值為基礎的個人化文件分群技術(PEC)做為分群效能比較基準，本研究實證結果顯示，採用個人部份分群資訊的文件分群技術能夠更接近個人分群的結果。此外，不管在OnPEC以及PEC的方法中，pre-cluster-based HAC比起atomic-based HAC都有較優異的分群表現。
Abstract
With the proliferation of electronic commerce and knowledge economy environments, both persons and organizations increasingly have generated and consumed large amounts of online information, typically available as textual documents. To manage this rapid growth of the number of textual documents, people often use categories or folders to organize their documents. These document grouping behaviors are intentional acts that reflect the persons’ (or organizations’) preferences with regard to semantic coherency, or relevant groupings between subjects. For this thesis, we design and implement an ontology-based personalized document clustering (OnPEC) technique by incorporating both an individual user’s partial clustering and an ontology into the document clustering process. Our use of a target user’s partial clustering supports the personalization of document categorization, whereas our use of the ontology turns document clustering from a feature-based to a concept-based approach. In addition, we combine two hierarchical agglomerative clustering (HAC) approaches (i.e., pre-cluster-based and atomic-based) in our proposed OnPEC technique. Using the clustering effectiveness achieved by a traditional content-based document clustering technique and previously proposed feature-based document clustering (PEC) techniques as performance benchmarks, we find that use of partial clusters improves document clustering effectiveness, as measured by cluster precision and cluster recall. Moreover, for both OnPEC and PEC techniques, the clustering effectiveness of pre-cluster-based HAC methods greatly outperforms that of atomic-based HAC methods.

目次 Table of Contents
CHAPTER 1 INTRODUCTION 1 1.1 BACKGROUND 1 1.2 RESEARCH MOTIVATIONS AND OBJECTIVE 2 1.3 ORGANIZATION OF THE THESIS 5 CHAPTER 2 LITERATURE REVIEW 6 2.1 GENERAL PROCESS OF CONTENT-BASED DOCUMENT CLUSTERING 6 2.2 PERSONALIZED DOCUMENT CLUSTERING 11 2.4 OVERVIEW OF ONTOLOGY 16 2.5 ONTOLOGY-BASED TEXT CLUSTERING 19 CHAPTER 3 DESIGN OF ONTOLOGY-BASED PERSONALIZED DOCUMENT CLUSTERING (OnPEC) TECHNIQUE 22 3.1 ONTOLOGY LEARNING 23 3.1.1 Feature Extraction 23 3.1.2 Concept Feature Selection 24 3.2 ONTOLOGY-BASED PERSONALIZED DOCUMENT CLUSTERING PROCESS 26 3.2.1 Feature Extraction 27 3.2.2 Concept Mapping 27 3.2.3 Concept Selection 27 3.2.4 Concept-Based Document Representation 30 3.2.5 Clustering 30 CHAPTER 4 EMPIRICAL EVALUATION OF THE ONTOLOGY-BASED PERSONALIZED DOCUMENT CLUSTERING (OnPEC) TECHNIQUE 32 4.1 EVALUATION DESIGN 32 4.1.1 Document Corpus and Personal Categorization Collection 32 4.1.2 Document Corpus and Concept Hierarchy for Ontology Learning 33 4.1.3 Evaluation Criteria 35 4.1.4 Evaluation Procedure 36 4.2 TUNING EXPERIMENTS 37 4.2.1 Effects of Number of Features in Benchmark Techniques 37 4.2.2 Parameter Tuning for the OnPEC Technique 39 4.3 COMPARATIVE EVALUATION RESULTS 42 4.4 SENSITIVITY OF THE SIZE OF PARTIAL CLUSTERS FOR CLUSTERING EFFECTIVENESS 45 CHAPTER 5 CONCLUSIONS AND FUTURE RESEARCH DIRECTION 50 REFERENCES 52 APPENDIX A: ACM COMPUTING CLASSIFICATION SYSTEM (CCS), 1998 56 APPENDIX B: ACM CCS CONCEPT HIERARCHY IN EMPIRICAL EVALUATION 67

參考文獻 References
[A73] Anderberg, M. R., Cluster Analysis for Applications, Academic Press, Inc., 1973, New York. [B92] Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992, pp. 152-155. [B94] Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994, pp. 722-727. [BGG99] Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J. “Partitioning-Based Clustering for Web Document Categorization,” Decision Support Systems, 27, 3 (1999), pp. 329-341. [BS97] Berson, A. and Smith, S. J., Data Warehousing, Data Mining & OLAP, McGraw-Hill, Inc., 1997, New York. [C04] C. Maria (Marijke) Keet, Aspects of Ontology Integration. Literature research & background information for the PhD proposal, School of Computing, Napier University, Scotland, 2004. [CKP92] Cutting, D., Karger, D., Pedersen, J. ,and Tukey, J., “Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections,” Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp. 318-329. [DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representations for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ‘98), 1998, pp. 148-155. [DR86] Deogun, J. and Raghavan, V., “User-Oriented Document Clustering: A Framework for Learning in Information Retrieval,” Proceedings of the 9th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1986, pp. 157-163. [EM97] Estivill-Castro, V. and Murray, A. T., “Spatial Clustering for Data Mining with Generic Algorithms,” Technical Report FIT-TR-97-10, Queensland University of Technology, Faculty of Information Management, September 1997. [EW86] El-Hamdouchi, A. and Willett, P., “Hierarchical Document Clustering Using Ward’s Method,” Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp. 149-156. [F00] Fensel, D., Ontologies: Silver Bullet for Knowledge Management and Electronic Commerce. Springer-Verlag, Berlin, 2000. [G91] Gordon, M., “User-Based Document Clustering by Redescribing Subject Description with a Genetic Algorithm,” Journal of the American Society for Information Science, 42, 5, 1991, pp. 311-322. [G93] Gruber, T. R., “A Translation Approach to Portable Ontology Specifications,” Knowledge Acquisition, 5, 1993, pp. 199-220. [HK01] Han, J and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001, San Francisco. [HMS01] Hotho, A., Maedche, A., and Staab, S., “Ontology-Based Text Clustering,” Proceedings of the IJCAI-001 Workshop "Text Learning: Beyond Supervision," August, Seattle, USA, 2001. [K89] Kohonen, T., Self-Organization and Associative Memory, Springer, Berlin, 1989. [K95] Kohonen, T., Self-Organizing Maps, Springer, Berlin, 1995. [KL02] Kim, H. and Lee, S., “An Effective Document Clustering Method Using User-Adaptable Distance Metrics,” Proceedings of the 2002 ACM Symposium on Applied Computing, 2002, pp. 16-20. [KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, Inc., New York, 1990. [LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 16-22. [LHK96] Lagus, K., Honkela, T., Kaski, S., and Kohonen, T., “Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996. [NH94] Ng, R. and Han, J., “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proceedings of International Conference on Very Large Data Bases, Santiago, Chile, September 1994, pp. 144-155. [RC99] Roussinov, D. G. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, 27, 1-2, November 1999, pp. 67-79. [RM99] Rauber, A. and Merkl, D., “Using Self-Organizing Maps to Organize Document Archives and to Characterize Subject Matters: How to Make A Map Tell the News of the World,” Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA’98), Florence, Italy, 1999, pp. 302-311. [RP97] Rucker, J. and Polanco, M. J., “Siteseer: Personalized Navigation for the Web,” Communications of the ACM, 40, 3, March 1997, pp. 73-75. [TB99] Talavera, L. and Bejar, J., “Integrating Declarative Knowledge in Hierarchical Clustering Tasks,” Proceedings of the 3th International Symposium on Intelligent Data Analysis, 1999, pp. 211-222. [V86] Voorhees, E. M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, 22, 1986, pp. 465-476. [V93] Voutilainen, A., “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993, pp. 48-57. [WCW04] Wei, C., Chiang, R. H. L., and Wu, C. C., “Accommodating Individual Categorization Preferences: A Personalized Document Clustering Approach,” Working Paper, Department of Information Management, National Sun Yat-sen University (December 2003). [WPS03] Wei, C., Piramuthu, S., and Shaw, M. J., “Knowledge Discovery and Data Mining,” Handbook of Knowledge Management, Vol. 2, C. W. Holsapple (Ed.), Springer-Verlag, Berlin, Germany, 2003, pp. 157-189. [YC94] Yang, Y. and Chute, C. G., “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Transactions on Information Systems, 12, 3, 1994, pp. 252-277. [YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp. 412-420.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內一年後公開，校外永不公開 campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.118.1.232 論文開放下載的時間是校外不公開 Your IP address is 18.118.1.232 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS