國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,跨語言文件自動分類之研究,Cross-Lingual Text Categorization

論文名稱 Title	跨語言文件自動分類之研究 Cross-Lingual Text Categorization
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	92 學年度第 2 學期 The spring semester of Academic Year 92	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	51
研究生 Author	林彥廷 Yen-Ting Lin
指導教授 Advisor	魏志平 Chih-Ping Wei
召集委員 Convenor	鄭興 Hsing Kenny Cheng
口試委員 Advisory Committee	胡仁華 Paul Jen-Hwa Hu
口試日期 Date of Exam	2004-07-27	繳交日期 Date of Submission	2004-07-29
關鍵字 Keywords	文字探勘、文件分類、文件管理、跨語言文件分類 Document management, Cross-lingual text categorization, Text categorization, Text mining
統計 Statistics	本論文已被瀏覽 5645 次，被下載 20 次 The thesis/dissertation has been browsed 5645 times, has been downloaded 20 times.

中文摘要
隨著網際網路服務與電子商務應用的快速發展與普及，產生了大量且能夠在網際網路上取得的資訊，而這些資訊通常為文字格式的文件。為了協助後續的存取和增加這些文件的效用，發展有效率與有效的技術來管理這些持續增加的文字文件，成為組織與個人的一項重要工作。在文件管理方面，傳統上人們習慣用類別的概念來整理其檔案或文件；然而，現存的文件分類技術主要著重在處理單語言文件。由於商業環境的全球化和網際網路技術的長足進步，組織與個人通常需要檢索與整理不同語言的文件，使得跨語言文件分類的需求與日俱增。基於上述跨語言文件分類技術的重要性與需求，本研究旨在設計兩種不同的文件類別指派方法，分別是individual-based方法及cluster-based方法。實驗結果顯示，本論文所提出的跨語言文件分類技術有優異的表現，同時，以cluster-based方法進行的跨語言文件分類結果優於individual-based的跨語言文件分類結果。
Abstract
With the emergence and proliferation of Internet services and e-commerce applications, a tremendous amount of information is accessible online, typically as textual documents. To facilitate subsequent access to and leverage from this information, the efficient and effective management—specifically, text categorization—of the ever-increasing volume of textual documents is essential to organizations and person. Existing text categorization techniques focus mainly on categorizing monolingual documents. However, with the globalization of business environments and advances in Internet technology, an organization or person often retrieves and archives documents in different languages, thus creating the need for cross-lingual text categorization. Motivated by the significance of and need for such a cross-lingual text categorization technique, this thesis designs a technique with two different category assignment methods, namely, individual- and cluster-based. The empirical evaluation results show that the cross-lingual text categorization technique performs well and the cluster-based method outperforms the individual-based method.

目次 Table of Contents
CHAPTER 1. INTRODUCTION 1 1.1 BACKGROUND 1 1.2 RESEARCH MOTIVATION AND OBJECTIVES 2 1.3 ORGANIZATION OF THE THESIS 2 CHAPTER 2. LITERATURE REVIEW 4 2.1 CROSS-LINGUAL INFORMATION RETRIEVAL 4 2.2 THESAURUS CONSTRUCTION TECHNIQUES 6 2.3 TEXT CATEGORIZATION TECHNIQUES 9 2.4 EXISTING TECHNIQUES FOR CROSS-LINGUAL TEXT CATEGORIZATION 13 CHAPTER 3. DESIGN OF A CROSS-LINGUAL TEXT CATEGORIZATION TECHNIQUE 14 3.1 CROSS-LINGUAL THESAURUS CONSTRUCTION PHASE 15 3.2 TEXT CATEGORIZATION LEARNING PHASE 18 3.3 CATEGORY ASSIGNMENT PHASE OF TEXT CATEGORIZATION 20 3.3.1 Individual-based category assignment method 20 3.3.2 Cluster-based category assignment method 24 CHAPTER 4. EMPIRICAL EVALUATION OF CROSS-LINGUAL TEXT CATEGORIZATION 26 4.1 EVALUATION DESIGN 26 4.1.1 Data collection 26 4.1.2 Evaluation criteria 27 4.1.3 Evaluation procedure 28 4.2 EVALUATION RESULTS 28 4.2.1 Effects of monolingual text categorization 28 4.2.2 Parameter tuning experiments for cross-lingual text categorization 30 4.2.2.1 Parameter tuning for individual-based category assignment method 31 4.2.2.2 Parameter tuning for cluster-based category assignment method 33 4.2.3 Comparative evaluations 37 4.2.3.1 Classifying Chinese documents into English categorization 37 4.2.3.2 Classifying English documents into Chinese categorization 38 CHAPTER 5. CONCLUSION AND FUTURE RESEARCH DIRECTIONS 39 REFERENCES 41

參考文獻 References
[ABS00] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-Based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Databases Technology (EDBT00), 2000, pp. 365-379. [ADW94] Apte, C., Damerau, F., and Weiss, S., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions of Information Systems, Vol. 12, No. 3, 1994, pp. 233-251. [AF77] Attar, R. and Fraenkel, A. S., “Local Feedback in Full-Text Retrieval Systems,” Journal of the ACM, Vol. 24, No. 3, 1997, pp. 397-417. [AIS93] Agrawal, R., Imielinski, T., and Swami, A., “Mining Association Rules Between Sets of Items in Large Databases,” Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993, pp. 207-216. [AS94] Agrawal, R. and Srikant, R., “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th VLDB Conference, Santiage, Chile, September 1994, pp. 487-499. [B92] Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992. [B94] Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994, pp. 722-727. [BKV03] Nuria, B., Cornelis, H.K., and Marta, V., “Cross-Lingual Text Categorization,” Proceedings ECDL’03, August 2003, pp. 126-139. [BM98] Baker, L. D. and Mccallum, A. K., “Distributional Clustering of Words for Text Classification,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 96-103. [CS96] Cohen, W. W. and Singer, Y., “Context-Sensitive Learning Methods for Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, August 1996, pp. 307-315. [DD95] Davis, M. and Munning, T., “Query Translation Using Evolutionary Programming for Multi-Lingual Information Retrieval,” Proceedings of the Fourth Annual Conference on Evolutionary Programming, San Diego, CA., March, 1995. [DLL96] Dumais, S. T., Landauer, T. K., and Littman, M. L., “Automatic Cross-Linguistic Information Retrieval Using Latent Semantic Indexing.,” Proceedings of ACM SIGIR Workshop on Cross-Linguistic Information Retrieval, 1996, pp.16-23. [DLL97] Dumais, S.T., Letsche, T.A., Littman, M.L, and Landauer, T.K, “Automatic Cross-Language Retrieval Using Latent Semantic Indexing,” Proceedings of AAAI Symposium on Cross-Language Text and Speech Retrieval, March 1997, pp. 15-21. [DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representation for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ‘98), 1998, pp. 148-155. [IT95] Iwayama, M. and Tokunaga, T., “Cluster-Based Text Categorization: A Comparison of Category Search Strategies,” Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), Seattle, WA, July 1995, pp. 273-281 [JC94] Jing, Y. and Croft, W. B., “An Association Thesaurus for Information Retrieval,” Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994. [LC96] Larkey, L. and Croft, W., “Combining Classifiers in Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’96), Zurich, Switzerland, August 1996, pp. 289-297. [LFM99] Letourneau, S., Famili, F., and Matwin, S., “Data Mining to Predict Aircraft Component Replacement,” IEEE Intelligen Systems, Vol. 14, No. 6, November/December 1999, pp. 59-66. [LH98] Lam, W. and Ho, C. Y., “Using A Generalized Instance Set for Automatic Text Categorization,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 91-89. [LL90] Landauer, T. K. and Littman, M. L., “Full Automatic Cross-Language Document Retrieval Using Latent Semantic Indexing,” Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, Waterloo, Ontario, October 1990, pp. 31-38. [LR94] Lewis, D. and Ringuette, M., “A Comparison of Two Learning Algorithms for Text Categorization,” Proceedings of Symposium on Document Analysis and Information Retrieval, 1994, pp. 81-93. [MLW92] Masand, B., Linoff, G., and Waltz, D., “Classifying News Stories Using Memory Based Reasoning,” Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘92), 1992, pp. 59-64. [MN98] McCallun, A. K. and Nigam, K., “A Comparison of Event Models for Naïve Bayes Text Classification,” Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998, [NGL97] Ng, H. T., Goh, W. B., and Low, K. L., “Feature Selection, Perception Learning, and A Usability Case Study for Text Categorization,” Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’97), 1997, pp. 67-73. [NSI99] Nie, J. Y., Simard, M., Lsabeele, P., and Durand, R., “Cross-Language Information Retrieval based on Parallel Texts and Automatic Mining of Parallel Texts from the Web,” Proceedings of the ASM SIGIR, Berkeley, CA, 1999, pp. 74-81. [O97] Oard, D. W., “Alternative Approaches for Cross-Language Text Retrieval,” Working Notes of AAAI-97 Spring Symposiums on Cross-Language Text and Speech Retrieval, pp. 131-139. [OD96] Oard, D. W. and Dorr, B. J., “A Survey of Multilingual Text Retrieval,” UMIACS-TR96-19 C-TR-3815, 1996. [S70] Salton, G., “Automatic Processing of foreign language documents,” Journal of the American Society for Information Science, Vol. 21, pp. 187-194, 1970. [SB90] Salton, G. and Buckley, C., “Improving the Retrieval Performance by Relevance Feedback,” Journal of the American Society for Information Science, Vol. 41, 1990, pp. 288-297. [SHP95] Schutze, H., Hull, D. A., and Pedersen, J. O., “A Comparison of Classifiers and Document Representations for the Routing Problem,” Proceedings of the 18st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), 1995, pp. 229-237. [V86] Voorhees, E. M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol.22, 1986, pp. 465-476. [V93] Voutilainen, A., “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993, pp. 48-57. [WAD99] Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., and Hampp, T., “Maximizing Text-Mining Performance,” IEEE Intelligence Systems, Vol. 14, No. 4, July/August 1999, pp. 63-69. [WBO00] Wei, J., Bressan, S., and Ooi, B. C., “Mining Term Association Rules for Automatic Global Query Expansion: Methodology and Preliminary Results,” Proceedings of the First International Conference on Web Information Systems Engineering, 2000, pp. 366-373. [WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-Commerce Environments: An Evolution-Based Approach,” European Journal of Information Systems, September 2002, pp. 208-222. [WPS03] Wei, C., Piramuthu, S. and Shaw, M. J., “Knowledge Discovery and Data Mining,” Chapter 41 in Handbook of Knowledge Management, Vol. 2, C. W. Holsapple (Ed.), Springer-Verlag, Berlin, Germany, 2003, pp.157-189. [WPW95] Wiener, W., Pedersen, J. O., and Weigend, A. S., “A Neural Network Approach to Topic Spotting,” Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’95), Las Vegas, NV, 1995, pp. 317-332. [Y94] Yang, Y., “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘94), Dublin, Ireland, July 1994, pp. 13-22. [YC94] Yang, Y. and Chute, C. G., “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Transaction on Information Systems, Vol. 12, No. 3, 1994, pp. 252-277. [YL03] Yang, C. C. and Luk J., “Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, 2003, pp. 671-682. [YL99] Yang, Y. and Liu, X., “A Re-Examination of Text Categorization methods,” Proceedings of SIGIR ’99: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42-49. [YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp. 412-420.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內一年後公開，校外永不公開 campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.15.3.154 論文開放下載的時間是校外不公開 Your IP address is 3.15.3.154 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS