國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以群集技術支援文件類別整合之研究 ,A Clustering-based Approach to Document-Category Integration

論文名稱 Title	以群集技術支援文件類別整合之研究 A Clustering-based Approach to Document-Category Integration
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	91 學年度第 2 學期 The spring semester of Academic Year 91	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	105
研究生 Author	鄭滄祥 Tsang-Hsiang Cheng
指導教授 Advisor	魏志平 Chih-Ping Wei
召集委員 Convenor	林信惠 Hsin-Hui Lin
口試委員 Advisory Committee	曾新穆, 林福仁, 鄭興, 簡立峰, 黃三益 Shin-Mu Tseng; Fu-Ren Lin; Hsing Cheng; Lee-Feng Chien; San-Yi Huang
口試日期 Date of Exam	2003-07-30	繳交日期 Date of Submission	2003-09-04
關鍵字 Keywords	文件類別整合、階層式群集法、目錄整合、貝氏分類器、文件群集、階層式類別整合 Catalog Integration, Document Category Integration, Naïve Bayes Classifier, Hierarchical Category Integration, Hierarchical Clustering, Document Clustering
統計 Statistics	本論文已被瀏覽 5728 次，被下載 2560 次 The thesis/dissertation has been browsed 5728 times, has been downloaded 2560 times.

中文摘要
大量文字型態的線上資訊隨著各種電子商務的應用而產生，組織或個人也不斷地從各種資訊管道獲取必要的新資訊。為了管理及存取的方便性，組織或個人常建立適當的文件類別對既有的文件檔案進行組織與管理。由於網路的快速及便利，組織或個人很容易地從網路上獲取得大量的資訊文件。倘若收集得的文件集合中含有來源處給予文件之類別分類方式的資訊，故如何運用文件集合中既有的分類資訊，快速、準確地將集合中的文件整合至組織或個人既有的類別目錄中則成為本研究的研究重點。回顧既有的文獻，已有學者利用分類技術發展出分類式類別整合法(例如，加強式貝氏分類器，ENB)用以提升上述類別整合工作的效率，然而分類式類別整合法必須在如下列的特定條件下方能有效地運作：新文件集合與既有的類別目錄間必須有著同質性的文件分類方式、既有的類別目錄內必須存有相當數量的已分類文件等。這些特定條件的限制造成類別整合法應用上的限制，因此本研究利用群集技術發展了一個類別整合方法，稱之為群集式類別整合法(Clustering-based Category Integration technique, CCI)，用以放寬上述的條件限制、增加可應用的範圍。根據實證評估的結果顯示，不論新文件集合與既有類別目錄間的分類方式屬於相同、類似或完全不同的情況下，CCI所達成的整合準確度均優於ENB；而CCI對於既有類別目錄內文件數的需求也遠低於ENB。隨著文件數量的成長，用以管理文件的類別數量也將隨之增加，組織或個人通常將文件類別以樹狀的階層結構予以組織。由於CCI方法並不考慮類別目錄或文件集合內的類別間的階層結構關係，因此本研究同時發展了群集式的階層類別整合法(Clustering-based category-Hierarchy Integration technique, CHI)，用以處理階層式文件類別之整合。實驗結果顯示，當整合兩個階層式的文件類別時，若其分類方式屬於同質或類似的情況下，CHI可以有效地提昇文件類別整合的準確度。
Abstract
E-commerce applications generate and consume tremendous amount of online information that is typically available as textual documents. Observations of textual document management practices by organizations or individuals suggest the popularity of using categories (or category hierarchies) to organize, archive and access documents. On the other hand, an organization (or individual) also constantly acquires new documents from various Internet sources. Consequently, integration of relevant categorized documents into existent categories of the organization (or individual) becomes an important issue in the e-commerce era. Existing categorization-based approach for document-category integration (specifically, the Enhanced Naïve Bayes classifier) incurs several limitations, including homogeneous assumption on categorization schemes used by master and source catalogs and requirement for a large-sized master categories as training data. In this study, we developed a Clustering-based Category Integration (CCI) technique to deal with integrating two document catalogs each of which is organized non-hierarchically (i.e., in a flat set). Using the Enhanced Naïve Bayes classifier as benchmarks, the empirical evaluation results showed that the proposed CCI technique appeared to improve the effectiveness of document-category integration accuracy in different integration scenarios and seemed to be less sensitive to the size of master categories than the categorization-based approach. Furthermore, to integrate the document categories that are organized hierarchically, we proposed a Clustering-based category-Hierarchy Integration (referred to as CHI) technique extended the CCI technique and for category-hierarchy integration. The empirical evaluation results showed that the CHI technique appeared to improve the effectiveness of hierarchical document-category integration than that attained by CCI under homogeneous and comparable scenarios.

目次 Table of Contents
1. INTRODUCTION 1 1.1 Background 1 1.2 Analysis of Category Integration Problem 3 1.3 Research Motivation and Objectives 5 1.4 Organization of the Dissertation 8 2. LITERATURE REVIEW 9 2.1 Naïve Bayes Classification for Category Integration 9 2.2 Document Clustering 12 2.2.1 Feature extraction and selection Phase 13 2.2.2 Document Representation Phase 14 2.2.3 Clustering Phase 15 3. DESIGN OF CLUSTERING-BASED CATEGORY INTEGRATION TECHNIQUE 19 3.1 Overall Process of CCI 19 3.2 Extraction of Categorization Scheme 21 3.3 Source Category Decomposition 23 3.4 Category Merging 26 3.5 New Category Generation 30 4. EVALUATION OF CLUSTERING-BASED CATEGORY INTEGRATION (CCI) TECHNIQUE 33 4.1 Collection of Document Corpora 33 4.2 Types of Evaluation 34 4.3 Evaluation Design and Results: Source-document Assignment Task 36 4.3.1 Creation of Synthetic Catalogs 36 4.3.2 Evaluation Procedure and Criteria 39 4.3.3 Evaluation Results for Complete Coverage Integration 40 4.3.3.1 Parameter Tuning for ENB 40 4.3.3.2 Parameter Tuning Experiments for CCI 46 4.3.3.3 Comparative Evaluation Results 49 4.3.3.4 Effect of Data Size on Classification Accuracy 50 4.3.4 Evaluation Results for Partial Coverage Integration 55 4.3.4.1 Parameter Tuning for ENB 57 4.3.4.2 Parameter Tuning for CCI 61 4.3.4.3 Comparative Evaluation Results 63 4.3.4.4 Effect of Data Size on Classification Accuracy 65 4.4 Evaluation Design and Results: New Category Generation Task 69 4.4.1 Creation of Synthetic Catalogs 69 4.4.2 Evaluation Criteria 70 4.4.3 Evaluation Results for New Category Generation 71 5. DESIGN OF CLUSTERING-BASED CATEGORY-HIERARCHY INTEGRATION TECHNIQUE 76 5.1 Overall Process of CHI 76 5.1 Extraction of Hierarchical Categorization Schemes 78 5.2 Selection of Merging Target 80 5.3 Category-hierarchy Merging 80 6. EVALUATION OF CLUSTERING-BASED CATEGORY-HIERARCHY INTEGRATION (CHI) TECHNIQUE 86 6.1 Collection of Category Hierarchy and Document Corpus 86 6.2 Evaluation Design 87 6.2.1 Creation of Synthetic Category Hierarchies 87 6.2.2 Evaluation Procedure and Criteria 93 6.3 Evaluation Results 94 7. CONTRIBUTIONS AND FUTURE RESEARCH 97 7.1 Contributions 97 7.2 Future Research Directions 98 REFERENCES 101

參考文獻 References
[A73] Anderberg, M. R., Cluster Analysis for Applications, Academic Press Inc., 1973. [ABS00] Agrawal, R., Bayardo, R., and Srikant, R. “Athena: Mining-based Interactive Management of Text Databases,” Proceedings of the Seventh Conference on Extending Database Technology (EDBT00), 2000, pp.365-379. [AGY99] Aggarwal, C. C., Gates, S. C., and Yu, P. S., “On the Merits of Building Categorization Systems by Supervised Clustering,” Proceedings of Conference on Knowledge Discovery in Databases, San Diego, CA, 1999, pp.352-356. [AS01] Agrawal, R., and Srikant, R., “On Integrating Catalogs,” Proceedings of the tenth international conference on World Wide Web, April 2001, pp.603-612 [ADW94] Apté, C., Damerau, F., and Weiss, S., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions on Information Systems, Vol. 12, No. 3, 1994, pp.233-251. [B92] Brill, E., “A Simple Rule-based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992. [B94] Brill, E., “Some Advances in Rule-based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994. [BGH99] Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, L., “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems, Vol. 27, No. 3, December 1999, pp.329-341. [BS97] Berson, A. and Smith, S. J., Data Warehousing, Data Mining & OLAP, McGraw-Hill, Inc., 1997. [BM98] Baker, L. D. and Mccallum, A. K., “Distributional Clustering of Words for Text Classification,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp.96-103. [CKP92] Cutting, D., Karger, D., Pedersen, J. and Tukey, J., “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.318-329. [CS96] Cohen, W. W. and Singer, Y., “Context-sensitive Learning Methods for Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, August 1996, pp.307-315. [DC00] Dumais, S. T. and Chen, H., “Hierarchical classification of web content,” Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, August 2000, pp. 256–263. [DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representations for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM '98), 1998, pp.148-155. [EW86] El-Hamdouchi, A. and Willett, P., “Hierarchical Document Clustering Using Ward’s Method,” Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp.149-156. [IT95] Iwayama, M. and Tokunaga, T., “Cluster-based Text Categorization: A Comparison of Category Search Strategies,” Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), Seattle, WA, July 1995, pp.273-281. [K89] Kohonen, T., Self-Organization and Associative Memory, Springer, 1989. [K95] Kohonen, T., Self-Organizing Maps, Springer, 1995. [KL00] Kim, H. J. and Lee, S. G., “A Semi-Supervised Document Clustering Techniques for Information Organization,” Proceedings of the 2000 ACM 9th International Conference on Information and Knowledge Management (CIKM '00), 2000, pp.30-37. [KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, Inc., New York, NY, 1990. [LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, August 1999, pp.16-22. [LC96] Larkey, L. and Croft, W., “Combining Classifiers in Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘96), Zurich, Switzerland, August 1996, pp.289-297. [LHK96] Lagus, K., Honkela, T., Kaski, S., and Kohonen, T., “Self-organizing Maps of Document Collections: A New Approach to Interactive Exploration,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996. [LR94] Lewis, D. and Ringuette, M., “A Comparison of Two Learning Algorithms for Text Categorization,” Proceedings of Symposium on Document Analysis and Information Retrieval, 1994. [M96] Mitchell, T., Machine Learning, McGraw Hill, 1996. [MLW92] Masand, B., Linoff, G., and Waltz, D., “Classifying News Stories Using Memory Based Reasoning,” Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’92), 1992, pp.59-64. [NGL97] Ng, H. T., Goh, W. B., and Low, K. L., “Feature Selection, Perceptron Learning, and A Usability Case Study for Text Categorization,” Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘97), 1997, pp.67-73. [NH94] Ng, R. and Han, J., “Efficient and Effective Clustering methods for spatial Data Mining,” Proceedings of International Conference on Very Large Data Bases, Santiago, Chile, September 1994, pp.144-155. [RC99] Roussinov, D. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1-2, 1999, pp.67-79. [SH01] Stonebraker, M. and Hellerstein, J. M., “Content Integration for E-Business,” Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, May 2001, pp.552-560. [V86] Voorhees, E. M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol. 22, 1986, pp.465-476. [V93] Voutilainen, A., “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993. [WAD99] Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., and Hampp, T., “Maximizing Text-Mining Performance,” IEEE Intelligent Systems, Vol. 14, No. 4, July/August 1999, pp.63-69. [WYH02] Wei, C., Yang, C. S., Hsiao, H. W., and Cheng, T. H., “Combining Preference and Content-based Approaches for Document Clustering in E-Commerce Environments,” Proceedings of the First Workshop on e-Business (WEB 2002), Barcelona, Spain, December 2002, pp.115-126. [WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-Commerce Environments: An Evolution-Based Approach,” European Journal of Information Systems, Vol. 11, No. 3, September 2002, pp.208-222. [WPS03] Wei, C., Piramuthu, S. and Shaw, M. J., “Knowledge Discovery and Data Mining,” Chapter 41 in Handbook of Knowledge Management, Vol. 2, C. W. Holsapple (Ed.), Springer-Verlag, Berlin, Germany, 2003, pp.157-189. [WPW95] Wiener, W., Pedersen, J. O., and Weigend, A. S., “A Neural Network Approach to Topic Spotting,” Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’95), Las Vegas, NV, 1995, pp.317-332. [YC94] Yang, Y. and Chute,C. G., “An Example-based Mapping Method for Text Categorization and Retrieval,” ACM Transactions on Information Systems, Vol. 12, No. 3, 1994, pp.252-277. [YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp.412-420. [YPC98] Yang, Y., Pierce, T., and Carbonell, J. G., “A Study on Retrospective and Online Event Detection,” Proceedings of SIGIR ’98: 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM press, New York, 1998, pp.28-36. [YCB98] Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T., and Liu, X., “Learning Approaches for Detecting and Tracking News Event,” IEEE Intelligent Systems and Their Applications, 1999, pp.32-43. [YL99] Yang, Y. and Liu, X., “A Re-examination of Text Categorization methods,” Proceedings of SIGIR ’99: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp.42-49. [ZEM97] Zamir, O., Etzioni, O., Madani, O., and Karp, R. M., “Fast and Intuitive Clustering of Web Documents,” Proceedings of 3rd International Conference on Knowledge Discovery and Data Mining, 1997, pp.287-290.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內立即公開，校外一年後公開 off campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0904103-201802.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS