國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以演進技術支援階層式文件類別管理,Evolutionary Approach for Supporting Document Category Hierarchy Management

論文名稱 Title	以演進技術支援階層式文件類別管理 Evolutionary Approach for Supporting Document Category Hierarchy Management
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	92 學年度第 1 學期 The fall semester of Academic Year 92	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	53
研究生 Author	吳明融 Ming-jung Wu
指導教授 Advisor	魏志平 Chih-ping Wei
召集委員 Convenor	黃三益 San-yi Huang
口試委員 Advisory Committee	邱兆民 Chao-min Chiu
口試日期 Date of Exam	2003-07-22	繳交日期 Date of Submission	2004-02-02
關鍵字 Keywords	類別演進、文件分群、資料探勘、階層式分類架構 Data Mining, Category Evolution, Document Clustering, Hierarchical Categorization
統計 Statistics	本論文已被瀏覽 5714 次，被下載 2333 次 The thesis/dissertation has been browsed 5714 times, has been downloaded 2333 times.

中文摘要
隨著網際網路的興起，資訊的傳播與取得隨著線上應用程式的使用頻繁，越來越簡單且快速。大量的文件與資訊在網路上流通，如何對資訊進行管理與應用變得越來越重要，其中文件自動分類技術（Document Clustering and Classification）為最基本且有效的管理方式之一，已經廣泛使用在新聞、搜尋引擎等網站上。過去在文件分類領域的研究，大多偏重在演算法效率的改進與分類正確性的提升，而忽略了隨著文件不斷地增加，文件的類別會隨著有所變動，而造成原始分類類別不適用的情況。在分類的結構方面，階層式分類架構是常用的分類方式，尤其是在處理大量文件資料的時候，透過階層式的架構可以降低使用者搜尋的時間成本，同時提高文件管理的效率。明顯地，在一個適當的階層式文件類別架構當中，隱含著架構者對於該文件領域的知識與個人分類上的偏好，而這些對於文件自動分類技術都是相當有用的資訊。本研究的目的在於發展以資料探勘為基礎的階層式文件類別演進技術（Category Hierarchy Evolution, CHE），以改善分類類別之品質。不同於Arawal等人（1999）所提之文件類別探索之技術，本研究的文件類別演進（CHE）技術利用文件庫中原來的分類知識，再結合各類別中所包含文件的特性，以演進的方式，進行類別的重新整合，使得該類別架構能隨著文件的日益增加而進行動態調整，且持續地適用。本實證研究結果顯示，本研究所提出的文件類別演進（CHE）技術能改良部分原有的分類架構，可適用於不同品質的文件類別之演進，且提升文件分類之正確性。
Abstract
Observations of textual document management by individuals and organizations have suggested the popularity of using categories (e.g., folders) to organize, archive and access documents. The document grouping behavior is intentional acts, reflecting a user’s preferential perspective on semantic coherency or relevant groupings between subjects. Although becoming less adequate as new documents are accumulated, the existing category set or hierarchy may preserve to some extent the user’s preferential perspective on document grouping. Thus, when deriving a new category set or hierarchy, the category set or hierarchy previously established by the user (i.e., semantic coherency of the documents embedded in the existing category set or category hierarchy) should be taken into consideration. In this study, we have proposed an evolution-based technique, Category Hierarchy Evolution (CHE), for managing category hierarchy rather than category set. Specifically, in CHE, the overall similarity between two documents is measured not only by their content similarity but also by their location similarity in the existing category hierarchy. Our empirical evaluation results suggest that the proposed CHE technique outperformed the discovery-based technique (i.e., the traditional content-based document-clustering technique).

目次 Table of Contents
CHAPTER 1. INTRODUCTION 1.1 Research Background 1.2 Research Motivation and Objective 1.3 Organization of the Thesis CHAPTER 2. LITERATURE REVIEW 2.1 Document Clustering 2.1.1 Feature Extraction and Selection Phase 2.1.2 Document Representation Phase 2.1.3 Clustering Phase 2.2 Category Evolution (CE) Technique 2.2.1 Category Decomposition 2.2.2 Category Amalgamation CHAPTER 3. DESIGN OF CATEGORY HIERARCHY EVOLUTION (CHE) TECHNIQUE 3.1 Feature Extraction and Selection 3.2 Document Representation 3.3 Document Similarity Assessment 3.3.1 Content Similarity 3.3.2 Location Similarity 3.3.3 Overall Similarity 3.4 Category Hierarchy Formation 3.4.1 Analysis of Scenarios of Using Category Hierarchy 3.4.2 Algorithm of Category Hierarchy Formation CHAPTER 4. EMPIRICAL EVALUATION 4.1 Evaluation Design 4.1.1 Data Collection 4.1.2 Evaluation Criteria 4.1.3 Evaluation Procedure 4.2 Evaluation Results 4.2.1 Determining Appropriate Values for Parameters 4.2.2 Comparative Evaluation of CHE Technique 4.2.3 Effects of Location Similarity Measures CHAPTER 5. CONCLUSION AND FUTURE RESEARCH DIRECTIONS REFERENCES

參考文獻 References
[A73] Anderberg, M. R., Cluster Analysis for Applications, Academic Press, Inc., 1973. [B92] Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992. [B94] Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994. [BGG99] Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Movasher, B., and Moore, L., “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems, Vol. 27, No. 3, December 1999, pp.329-341. [BS97] Berson, A. and Smith, S. J., Data Warehousing, Data Mining & OLAP, McGraw-Hill, Inc., 1997. [CKP92] Cutting, D., Karger, D., Pedersen, J. ,and Tukey, J., “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.318-329. [DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representations for Text Categorization,”Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ’98), 1998, pp.148-155. [EW86] El-Hamdouchi, A. and Willett, P., “Hierarchical Document Clustering Using Ward’s Method,” Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp.149-156. [K89] Kohonen, T., Self-Organization and Associative Memory, Springer, 1989. [K95] Kohonen, T., Self-Organizing Maps, Springer, 1995. [KL00] Kim, H. J. and Lee, S. G., “A Semi-Supervised Document Clustering Techniques for Information Organization,” Proceedings of the 2000 ACM 9th International Conference on Information and Knowledge Management (CIKM '00), 2000, pp.30-37. [KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An introduction to Cluster Analysis, John Wiley & Sons, Inc., New York, NY, 1990. [LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,”Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.16-22. [LHK96] Lagus, K., Honkela, T., Kaski, S., and Kohonen, T., “Self-organizing Maps of Document Collections: A New Approach to Interactive Exploration,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996. [RC99] Roussinov, D. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1-2, 1999, pp.67-79 [RP97] Rucker, J. and Polanco, M. J., “Siteseer: personalized navigation for the Web.” Communications of the ACM, Vol. 40, No.3, 1997, pp.73-75. [V86] Voorhees, E. M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol.22, 1986, pp.465-476. [V93] Voutilainen, A., “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993. [WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-Commerce Environments: An Evolution-Based Approach,” European Journal of Information System, Vol. 11, No. 3, September 2002, pp.208-222.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0202104-190002.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS