Responsive image
博碩士論文 etd-0809106-221247 詳細資訊
Title page for etd-0809106-221247
論文名稱
Title
多語言文件自動分類之研究
Poly-Lingual Text Categorization
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
64
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2006-07-17
繳交日期
Date of Submission
2006-08-09
關鍵字
Keywords
多語言文件分類、文字探勘、文件管理、文件分類
Text categorization, Document management, Text mining, Poly-lingual text categorization
統計
Statistics
本論文已被瀏覽 5737 次,被下載 1429
The thesis/dissertation has been browsed 5737 times, has been downloaded 1429 times.
中文摘要
隨著網際網路的出現、普及與全球化的趨勢,產生了大量且能夠在網際網路上取得的不同語言的文件,有效率且有效的管理這些不同語言的文件成為組織或個人的重要工作。雖然多語言文件管理可以用多個獨立的單語文件分類器來達成,但這個方法卻只採用相同語言內的訓練文件而失去了利用多語言訓練文件中潛在分類資訊的機會。況且,目前現存的多語文言文件分類方法因為同時考慮全部語言之詞彙而引進太多分類雜訊,以致於它的正確性甚至比單語言文件分類更低。基於多語言文件分類技術需求愈來愈重要,本研究提出了一個在考慮全部訓練文件下,為特定語言所建構的單語分類器的多語言文件分類方法。與獨立的單語文件分類器方法比較起來,本研究所提出的多語言文件分類方法,不論在中文或英文的實驗資料下,都達到較佳的分類效能。
Abstract
With the rapid emergence and proliferation of Internet and the trend of globalization, a tremendous number of textual documents written in different languages are electronically accessible online. Efficiently and effectively managing these textual documents written different languages is essential to organizations and individuals. Although poly-lingual text categorization (PLTC) can be approached as a set of independent monolingual classifiers, this naïve approach employs only the training documents of the same language to construct to construct a monolingual classifier and fails to utilize the opportunity offered by poly-lingual training documents. Motivated by the significance of and need for such a poly-lingual text categorization technique, we propose a PLTC technique that takes into account all training documents of all languages when constructing a monolingual classifier for a specific language. Using the independent monolingual text categorization (MnTC) technique as our performance benchmark, our empirical evaluation results show that our proposed PLTC technique achieves higher classification accuracy than the benchmark technique does in both English and Chinese corpora. In addition, our empirical results also suggest the robustness of the proposed PLTC technique with respect to the range of training sizes investigated.
目次 Table of Contents
Chapter 1 Introduction 1
1.1 Background 1
1.2 Research Motivation and Objective 2
1.3 Organization of the Thesis 4
Chapter 2 Literature Review 5
2.1 Monolingual Text Categorization Techniques 5
2.2 Poly-lingual and Cross-lingual Text Categorization Techniques 7
Chapter 3 Design of Poly-lingual Text Categorization Technique 15
3.1 Bilingual Thesaurus Construction 16
3.2 Categorization Learning 19
3.3 Category Assignment 27
Chapter 4 Empirical Evaluation 29
4.1 Data Collection 29
4.2 Evaluation Procedure and Criteria 30
4.3 Performance Benchmark 31
4.4 Comparative Evaluations 31
4.5 Effects of ITF 40
4.6 Comparison with Monolingual Feature Reinforcement 42
4.7 Sensitivity to Size of Training Dataset 45
Chapter 5 Conclusion and Future Research Directions 50
References 52
參考文獻 References
[ABS00] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-Based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Databases Technology (EDBT00), 2000, pp. 365-379.
[ADW94] Apte, C., Damerau, F., and Weiss, S., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions of Information Systems, Vol. 12, No. 3, 1994, pp. 233-251.
[B92] Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992.
[B94] Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994, pp. 722-727.
[BC98] Ballesteros, L., and Croft, W. B., “Resolving Ambiguity for Cross-language Retrieval,” Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 64-71.
[BKV03] Bel, N., Koster, C. H. A., and Villegas, M., “Cross-Lingual Text Categorization,” Proceedings ECDL’03, August 2003, pp. 126-139.
[BM98] Baker, L. D. and Mccallum, A. K., “Distributional Clustering of Words for Text Classification,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 96-103.
[CS99] Cohen, W. W. and Singer, Y., “Context-sensitive Learning Methods for Text Categorization,” ACM Transactions on Information Systems, Vol. 17, No. 2, 1999, pp. 141-173.
[DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representation for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ‘98), 1998, pp. 148-155.
[H05] Hsu, K. H., “Cross-Lingual Text Categorization: A Training-corpus Translation-based Approach,” Unpublished Master Thesis, Department of Information Management, National Sun Yat-sen University, Kaohsiung, Taiwan, R.O.C., June 2005.
[IT95] Iwayama, M. and Tokunaga, T., “Cluster-Based Text Categorization: A Comparison of Category Search Strategies,” Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), Seattle, WA, July 1995, pp. 273-281.
[J98] Joachims, T., “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proceedings of 10th European Conference on Machine Learning (ECML 98), Chemnitz, Germany, 1998, pp. 137-142.
[K97] Kwok, K. L. “Evaluation of an English-Chinese Cross-Lingual Retrieval Experiment,” AAAI-97 Symposium on Cross Language Text & Speech Retrieval, 1999, pp.133-137
[LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 1999, pp. 16-22.
[LC96] Larkey, L. and Croft, W., “Combining Classifiers in Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’96), Zurich, Switzerland, August 1996, pp. 289-297.
[LR94] Lewis, D. and Ringuette, M., “A Comparison of Two Learning Algorithms for Text Categorization,” Proceedings of Symposium on Document Analysis and Information Retrieval, 1994, pp. 81-93.
[JC94] Jing, Y. and Croft, W. B., “An Association Thesaurus for Information Retrieval,” Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994.
[LC96] Larkey, L. and Croft, W., “Combining Classifiers in Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’96), Zurich, Switzerland, August 1996, pp. 289-297.
[LH98] Lam, W. and Ho, C. Y., “Using A Generalized Instance Set for Automatic Text Categorization,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 91-89.
[LR94] Lewis, D. and Ringuette, M., “A Comparison of Two Learning Algorithms for Text Categorization,” Proceedings of Symposium on Document Analysis and Information Retrieval, 1994, pp. 81-93.
[M97] Mitchell, T., Machine Learning, McGraw Hill, New York, NY, 1997.
[MLW92] Masand, B., Linoff, G., and Waltz, D., “Classifying News Stories Using Memory Based Reasoning,” Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘92), 1992, pp. 59-64.
[MN98] McCallun, A. K. and Nigam, K., “A Comparison of Event Models for Naïve Bayes Text Classification,” Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998.
[NGL97] Ng, H. T., Goh, W. B., and Low, K. L., “Feature Selection, Perception Learning, and A Usability Case Study for Text Categorization,” Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’97), 1997, pp. 67-73.
[RC99] Roussinov, D. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1, 1999, pp. 67-79.
[S02] Sebastiani, F., “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, Vol. 34, No. 1, 2002, pp. 1-47.
[SHP95] Schutze, H., Hull, D. A., and Pedersen, J. O., “A Comparison of Classifiers and Document Representations for the Routing Problem,” Proceedings of the 18st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), 1995, pp. 229-237.
[V00] V. N. Vapnik, The Nature of Statistical Learning Theory, 2nd ed. New York, NY: Springer-Verlag, 2000.
[WAD99] Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., and Hampp, T., “Maximizing Text-Mining Performance,” IEEE Intelligent Systems, Vol. 14, No. 4, July/August 1999, pp. 63-69.
[WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-Commerce Environments: An Evolution-Based Approach,” European Journal of Information Systems, September 2002, pp. 208-222.
[WLY05] Wei, C., Lin, Y. T., and Yang, C. C. “Cross-Lingual Text Categorization for Global Knowledge Management,” Working Paper, Department of Information Management, National Sun Yat-sen University, Kaohsiung, Taiwan, R.O.C., June 2005.
[WPS03] Wei, C., Piramuthu, S. and Shaw, M. J., “Knowledge Discovery and Data Mining,” Chapter 41 in Handbook of Knowledge Management, Vol. 2, C. W. Holsapple (Ed.), Springer-Verlag, Berlin, Germany, 2003, pp.157-189.
[WPW95] Wiener, W., Pedersen, J. O., and Weigend, A. S., “A Neural Network Approach to Topic Spotting,” Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’95), Las Vegas, NV, 1995, pp. 317-332.
[YL03] Yang, C. C. and Luk J., “Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, 2003, pp. 671-682.
[YLY00] Yang, C. C., Luk, J., Yung, S. and Yen, J., “Combination and Boundary Detection Approach for Chinese Indexing,” Journal of the American Society for Information Science, Vol. 51, No, 4, 2000, pp. 340-351.
[Y94] Yang, Y., “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘94), Dublin, Ireland, July 1994, pp. 13-22.
[YC94] Yang, Y. and Chute, C. G., “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Transaction on Information Systems, Vol. 12, No. 3, 1994, pp. 252-277.
[YL99] Yang, Y. and Liu, X., “A Re-Examination of Text Categorization methods,” Proceedings of SIGIR ’99: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42-49.
[YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp. 412-420.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內外都一年後公開 withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code