國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以資料探勘為基礎之文件類別演進技術,Mining-Based Category Evolution for Text Databases

論文名稱 Title	以資料探勘為基礎之文件類別演進技術 Mining-Based Category Evolution for Text Databases
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	88 學年度第 2 學期 The spring semester of Academic Year 88	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	68
研究生 Author	董元昕 Yuan-Xin Dong
指導教授 Advisor	魏志平, 陳年興 Chih-Ping Wei; Nian-Shing Chen
召集委員 Convenor	林福仁 Fu-Ren Lin
口試委員 Advisory Committee
口試日期 Date of Exam	2000-07-14	繳交日期 Date of Submission	2000-07-18
關鍵字 Keywords	文件分類、分群分析、文件類別管理、文件類別演進 Text Categorization, Clustering, Category management, Category Evolution
統計 Statistics	本論文已被瀏覽 5756 次，被下載 4362 次 The thesis/dissertation has been browsed 5756 times, has been downloaded 4362 times.

中文摘要
隨著網際網路興起，資訊的傳播與取得隨著線上應用程式的使用頻繁，越來越簡單且快速。大量的文件與資訊在網路上流通，如何對資訊進行管理與應用變得越來越重要，其中文件自動分類的技術已被廣泛使用在新聞、搜尋引擎等網站上。過去在文件分類領域的研究，大多偏重在演算法效率的改進與分類正確率的提昇，而忽略了隨著文件不斷的增加，文件的類別會隨著有所變動，而造成原始分類類別不適用的情況。本研究的目的在於發展以資料探勘為基礎的文件類別演進技術(MiCE)，以改善分類類別之品質。不同於Agrawal等人(1999)所提之文件類別探索之技術，本研究的文件類別演進(MiCE)技術利用文件庫中原來的分類知識，再依照類別中所包含文件的特性，以演進的方法，進行類別的分割與合併。本實證研究結果顯示，本研究所提出的文件類別演進(MiCE)技術比傳統的文件類別探索技術有更好的分類結果，可適用於不同品質的文件類別之演進，且提昇文件分類之正確率。
Abstract
As text repositories grow in number and size and global connectivity improves, the amount of online information in the form of free-format text is growing extremely rapidly. In many large organizations, huge volumes of textual information are created and maintained, and there is a pressing need to support efficient and effective information retrieval, filtering, and management. Text categorization is essential to the efficient management and retrieval of documents. Past research on text categorization mainly focused on developing or adopting statistical classification or inductive learning methods for automatically discovering text categorization patterns from a training set of manually categorized documents. However, as documents accumulate, the pre-defined categories may not capture the characteristics of the documents. In this study, we proposed a mining-based category evolution (MiCE) technique to adjust the categories based on the existing categories and their associated documents. According to the empirical evaluation results, the proposed technique, MiCE, was more effective than the discovery-based category management approach, insensitive to the quality of original categories, and capable of improving classification accuracy.

目次 Table of Contents
TABLE OF CONTENTS I LIST OF FIGURES II LIST OF TABLES III CHAPTER 1 Introduction 1 1.1 Background 1 1.2 Research Motivation and Objective 2 1.3 Organization of the Thesis 5 CHAPTER 2 Literature Review 6 2.1 Text Categorization 6 2.1.1 Preprocessing Step 7 2.1.2 Representation Step 11 2.1.3 Induction Step 14 2.2 Discovery-based Category Management Approach 28 CHAPTER 3 Mining-based Category Evolution (MiCE) Technique 29 3.1 High-level Architecture of Text Categorization with Categorization Evolution 29 3.2 Algorithm of Mining-based Category Evolution (MiCE) Technique 30 3.2.1 Category Decomposition 32 3.2.2 Category Merging 37 3.3 Complete MiCE Algorithm 40 CHAPTER 4 Evaluations of Mining-based Category Evolution 41 4.1 Test Data Set 41 4.2 Effectiveness of Category Evolution 42 4.2.1 Evaluation Procedure 43 4.2.2 Evaluation Criteria 43 4.2.3 Result 47 4.3 Sensitivity to the Quality of Categories 51 4.4 Effect of Category Evolution on Categorization Accuracy 53 4.4.1 Evaluation Procedure 53 4.4.2 Results 56 CHAPTER 5 Conclusions and Future Research Directions 57 REFERENCES 59

參考文獻 References
[A73] Anderberg, M. R., Cluster Analysis for Applications, Academic Press, Inc., 1973. [ABN92] Anwar, T. M., Beck, H. W., and Navathe, S. B., “Knowledge Mining by Imprecise Querying: A Classification-Based Approach,” Proceeding of the Eighth International Conference Data Engineering, Feb. 1992, pp. 622-630. [ABS99] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-based Interactive Management of Text Databases,” Proceedings of the Seventh Conference on Extending Database Technology, July, 1999. [ADW94] Apte’, C., Damerau, F., Weiss, S. M., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions on Information Systems. Vol. 12, No. 3, July 1994, pp. 233-251. [AGIIS92]Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., and Swami, A., “An Interval Classifier for Database Mining Applications,” Proceeding of the 18th International Conference on Very Large Data Bases, Aug. 1992, pp. 560-573. [B92] E. Brill, "A Simple Rule-Based Part of Speech Tagger," In Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992. [B94] E. Brill, "Some Advances in Rule-Based Part of Speech Tagging," Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Wa., 1994. [B95] Boll, E. M., “Analysis of Rule Sets Generated by the CN2, ID3, and Multiple Convergence Symbolic Learning Methods,” Proceedings of the 1995 ACM 23rd Annual Conference on Computer Science Conference, 1995, pp. 48-55. [BFOS84] Breiman, L., Friedman, J., Olshen, R., and Stone, C., Classification of Regression Trees. Wadsworth, 1984. [BL97] Berry, M. J. A., and Linoff, G., Data Mining Techniques: For Marketing, Sales and Customer Support, Wiley, 1997. [BP91] Brunk, C., and Pazzani, M., “Noise-tolerant Relational Concept Learning Algorithms,” Proceedings of the 8th International Workshop on Machine Learning, Ithaca, NY, 1991. [C93] Cohen, W. W., “Efficient Pruning Methods for Separate-and-Conquer Rule Learning Systems,” Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, 1993. [CFS94] Chidanand Apte’, Fred Damerau, Sholom M. Weiss, “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions on Information Systems, July 1994, Vol.12, No.3, pp.233-251. [CH89] Church, K. W. and Hanks, P., “Word Association Norms, Mutual Information, and Lexicography,” Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989, pp.76-83. [CHY96]Chen, M. S., Han, J., and Yu, P. S., “Data Mining: An Overview from A Database Perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, December 1996, pp.866-883. [CN89] Clark, P., and Niblett, T., “The CN2 Induction Algorithm,” Machine Learning, Vol. 3, 1989, pp.261-283. [CS96a] Cheeseman, P., and Stutz, J., “Bayesian Classification (AutoClass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI/MIT Press, 1996, pp. 153-180. [CS96b] Cohen, W. W., and Singer, Y., “Learning to Query the Web,” Proceedings of the 13th National Conference on Artificial Intelligence, Portland, OR, 1996. [CS99] Cohen, W. W., and Singer, Y., “Context-sensitive Learning Methods For Test Categorization,” ACM Transactions on Information Systems, Vol. 17, No. 2, April 1999, pp.141-173. [D76] Dudani, S. A., “The Distance-Weighted k-Nearest-Neighbor Rule,” IEEE Transactions on Systems, Man and Cybernetics, Vol. SMC-6, No. 4, April 1976, pp. 325-327. [DH73] Duda, R. O. and Hart, P. E., Pattern Classification and Scene Analysis, Wiley, New York, 1973. [DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representations for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM '98), 1998, pp.148-155. [EP96] Elder IV, J., and Pregibon, D., “A Statistical Perspective on Knowledge Discovery in Databases,” Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI/MIT Press, 1996, pp. 83-115. [FW94] Furnkranz, J., and Widmer, G., “Incremental Reduced Error Pruning,” Proceedings of the 11th Annual Conference on Machine Learning, New Brunswick, NJ, 1994. [G96] Gains, B. R., “Transforming Rules and Trees into Comprehensive Knowledge Structures,” Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI/MIT Press, 1996, pp. 205-228. [K96] Klosgen, W., “Explora: A Multipattern and Multistrategy Discovery Assistant,” Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), AAAI/MIT Press, 1996, pp. 249-271. [KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, Inc., New York, NY, 1990. [L92a] Lewis, D. D., “An Evaluation of Phrasal and Clustered Representations on A Text Categorization Task,” Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp. 37-50. [L92b] Lewis, D. D., “Feature Selection and Feature Extraction for Text Categorization,” Proceedings of the Speech and National Language Workshop, 1992, pp. 212-217. [LH98] Lam W., and Ho, C. Y., Using A Generalized Instance set for Automatic Text categorization; “Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,” 1998, pp. 81-89. [LSL95] Lu, H., Setiono, R., and Liu, H., “NeuroRule: A Connectionist Approach to Data Mining,” Proceedings of 21st International Conference on Very Large Data Bases, Sept. 1995, pp. 478-489. [MAR96] Mehta, M., Agrawal, R., and Rissanen, J., “SLIQ: A Fast Scalable Classifier for Data Mining,” Proceedings of International Conference on Extending Database Technology (EDBT’96), Avignon, France, Mar. 1996. [MMHL86] Micilalski, R., Mozetic, L., Hong, J., and Lavrac, N., “The Multi-purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains,” Proceedings of the AAAI-86, Menlo Park, CA, 1986, pp.1041-1045. [NGL97] Ng, H. T., Goh, W. B., and Low, K. L., “Feature Selection, Perception Learning, and A Usability Case Study for Text Categorization,” Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1997, pp. 67-73. [NH94] Ng, R. and Han, J., “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proceedings of International Conference on Very Large Data Bases, Santiage, Chile, Sept. 1994, pp. 144-155. [P91] Piatetsky-Shapiro, G., “Discovery, Analysis, and Presentation of Strong Rules,” Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W. J. Frawley (Eds.), AAAI/MIT Press, 1991, pp. 229-238. [PH90] Pagallo, M., and Haussler, D., 1990, “Boolean Feature Discovery in Empirical Learning,” Machine Learning, Vol. 5, No. 1, March 1990, pp. 71-99. [Q86] Quinlan, J. R., “Induction of Decision Trees,” Machine Learning, Vol. 1, 1986, pp. 81-106. [Q90] Quinlan, J. R., “MDL and Categorical Theories (continued),” Proceedings of the 12th International Conference on Machine Learning, Lake Tahoe, CA, 1995. [Q93] Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. [RH96] Riloff, E. and Hollaar, L., “Text Databases and Information Retrieval,” ACM Computing Surveys, Vol. 28, No. 1, March 1996. [RK98] Ragas, H. and Koster, C., “Four Text Classification Algorithms Compared in a Dutch Corpus,” Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , 1998, pp.369-370. [S80] Spath, H., Cluster Analysis Algorithms: For Data Reduction and Classification of Objects, John Wiley & Sons, Inc., New York, 1980. [SHP95] Schutze, H., Hull, D. A., and J. O., Pedersen, “A Comparison of Classifiers and Document Representations for the Routing Problem,” Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995, pp. 229-237. [T99] Tu, H. L., “Automatic Categorization of News Using Title Analysis,” Unpublished Master Thesis, National Tsing-Hua University, Taiwan, R.O.C., July 1999 (in Chinese). [V93] Voutilainen, A., "NPtool, a detector of English noun phrases," In Proceedings of Workshop on Very Large Corpora, Ohio, Jun., 1993. [WGT90] Weiss, S., Galen, R., and Tadepalli, P., “Maximizing the Predictive Value of Production Rules,” Artificial Intelligence, Vol. 45, 1990, pp. 47-71. [WI93] Weiss, S., Indurkhya, N. “Optimized Rule Induction,” IEEE Expert, Vol .8, No. 6, 1993, pp.61-69. [WK91] Weiss, S. M., and Kulikowski, C. A., Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems, Morgan Kaufman, 1991. [Y94] Y., Yang, “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, pp. 13-22. [YC94] Yang, Y. and Chute, C. G., “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Transactions on Information Systems, Vol. 12, No. 3, July 1994, pp. 252-277. [YPC98] Yang, Y., Pierce, T. and Carbonell, J., “A Study on Retrospective and Online Event Detection,” Proceedings of SIGIR ’98: 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM press, New York, 1998, pp.28-36. [YCB99] Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T. and Liu, X., “Learning Approaches for Detecting and Tracking News Events,” IEEE Intelligent Systems and Their Applications, 1999, pp.32-43. [Z94] Ziarko, W., Rough Sets, Fuzzy Sets and Knwoledge Discovery, Springer-Verlag, 1994.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available master_thesis_v11.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS