國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,階層式個人化文件分群技術之研究,Development of Personalized Document Clustering Technique for Accommodating Hierarchical Categorization Preferences

論文名稱 Title	階層式個人化文件分群技術之研究 Development of Personalized Document Clustering Technique for Accommodating Hierarchical Categorization Preferences
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	94 學年度第 2 學期 The spring semester of Academic Year 94	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	53
研究生 Author	李冠儀 Kuan-yi Lee
指導教授 Advisor	魏志平 Chih-ping Wei
召集委員 Convenor	楊傳智 none
口試委員 Advisory Committee	盧文祥 Wen-hsiang Lu
口試日期 Date of Exam	2006-07-17	繳交日期 Date of Submission	2006-07-27
關鍵字 Keywords	個人化、文件分群、文件探勘、階層式文件分群、個人化文件分群 Hierarchical document management, Personalized document clustering, Text mining, Personalization, Document clustering
統計 Statistics	本論文已被瀏覽 5729 次，被下載 6 次 The thesis/dissertation has been browsed 5729 times, has been downloaded 6 times.

中文摘要
隨著資訊科技與網際網路的日益發達，電子商務及知識管理的相關應用快速增加，相對的，個人與企業所需要面對的資訊量也呈現巨幅的成長，其中又以文字類型的文件為多數。為了有效管理這些數量龐大的文件，個人及企業常以單層或多層的類別將這些文件進行分類，便於日後的檢索及瀏覽，而文件分群技術也是協助管理文件的方法之一。文件分群是一種隱含個人分群偏好的行為，每個人會依照他對這篇文章的語意認知及類別上的判斷，來進行分群。因此一個有效的文件分群技術，必須考慮每個人的分群偏好，讓分群的結果能符合個人需求，且在形式上也必須能適用於階層式的群集。然而傳統的文件分群技術主要是分析文件的內容，因此無法產生符合個人偏好的分群結果。此外現存的文件分群技術，多是產生單層的分群結果，而非多層式的階層架構。基於上述理由，本研究發展出一種階層式的個人文件分群技術（hierarchical personalized document-clustering），簡稱HPEC。此方法不僅可依個人的分群偏好來產生他們所需要分群結果，所產生的群集形式也是階層式的。在實驗評估結果中，本研究發現HPEC在招回率上（cluster recall）比它的基準方法（HAC+P）來得優異，而在準確率（cluster precision）及距離差（location discrepancy）的表現上，也能得到相似的水平。
Abstract
With the advances in information and networking technologies and the proliferation of e-commerce and knowledge management applications, individuals and organizations generate and acquire tremendous amount of online information that is typically available as textual documents. To manage the ever-increasing volume of documents, an individual or organization frequently organizes his/her documents into a set or hierarchy of categories in order to facilitate document management and subsequent information access and browsing. Furthermore, document clustering is an intentional act that reflects individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, effective document-clustering must consider individual preferences for supporting personalization in document categorization and should be capable of organizing documents into a category hierarchy. However, document-clustering research traditionally has been anchored in analyses of document content. As a consequence, most of existing document-clustering techniques are not tailored to individuals’ preferences and therefore are unable to facilitate personalization. On the other hand, existing document-clustering techniques generally are designed to generate from a document collection a set of document clusters rather than a hierarchy of document clusters. In response, we develop in this study a hierarchical personalized document-clustering (HPEC) technique that takes into account an individual’s folder hierarchy representing the individual’s categorization preferences and produces document-clusters in a hierarchical structure for the target individual. Our empirical evaluation results suggest that the proposed HPEC technique outperformed its benchmark technique (i.e., HAC+P) in cluster recall while maintaining the same level of cluster precision and location discrepancy as its benchmark technique did.

目次 Table of Contents
Chapter 1 Introduction 1 Chapter 2 Literature Review 5 2.1 Content-based Document-Clustering 5 2.2 Non–Content-Based and Hybrid Document-Clustering 7 2.3 Partial-Clustering-Based Personalized Document-Clustering (PEC) Technique 9 Chapter 3 Design of Hierarchical Personalized Document-Clustering (HPEC) Technique 13 3.1 Feature Extraction, Selection, and Consolidation 13 3.2 Document Representation 17 3.3 Clustering 18 Chapter 4 Empirical Evaluation 22 4.1 Data Collection 22 4.2 Evaluation Criteria 24 4.3 Tuning the Representation Scheme and the Number of Features 26 4.4 Comparative Evaluation Results 31 4.5 Sensitivity of the Size of Partial Clustering 34 4.6 Sensitivity of Cluster Size and Depth 37 Chapter 5 Conclusions 40 References 42

參考文獻 References
Anderberg, M.R. Cluster Analysis for Applications. New York: Academic Press, Inc., 1973. Barreau, D.K., “Context as A Factor in Personal Information Management Systems,” Journal of the American Society for Information Science (46:5), June 1991, pp.327-339. Billhardt, H., Borrajo, D., and Maojo, V. “A Context Vector Model for Information Retrieval,” Journal of the American Society for Information Science and Technology (53:3), 2002, pp.236-249. Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, L. “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems (27:3), 1999, pp.329-341. Brill, E. “A Simple Rule-based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, 1992, pp.152-155. Brill, E. “Some Advances in Rule-based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994, pp.722-727. Case, D.O., “Conceptual Organization and Retrieval of Text by Historians: The Role of Memory and Metaphor,” Journal of the American Society for Information Science (42:9), October 1991, pp.657-668. Chuang, S.L. and Chien, L.F., “Taxonomy Generation for Text Segments: A Practical Web-based Approach,” ACM Transactions on Information Systems (23: 4), October 2005, pp.363-396. Cutting, D.; Karger, D.; Pedersen, J.; and Tukey, J. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp. 318-329. Deogun, J. and Raghavan, V. “User-oriented Document Clustering: A Framework for Learning in Information Retrieval,” Proceedings of the 9th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1986, pp.157-163. Donovan, J., “Patrons Expectations about Collocation: Measuring the Difference between Psychologically Real and the Really Real,” Cataloging and Classification Quarterly (13:2), 1991, pp.23-43. Dumais, S., Platt, J., Heckerman, D., and Sahami, M. “Inductive Learning Algorithms and Representations for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM '98), Bethesda, MD, 1998, pp.148-155. El-Hamdouchi, A. and Willett, P. Hierarchical document clustering using Ward’s method. In Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp. 149-156. Gordon, M. “User-based Document Clustering by Redescribing Subject Description with a Genetic Algorithm,” Journal of the American Society for Information Science (42:5), 1991, pp.311-322. Guerrero Bote, V.P., Moya Aneg

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內一年後公開，校外永不公開 campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.14.70.203 論文開放下載的時間是校外不公開 Your IP address is 3.14.70.203 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS