Responsive image
博碩士論文 etd-0721105-122705 詳細資訊
Title page for etd-0721105-122705
論文名稱
Title
跨語言文件自動分類之研究:以翻譯訓練文集建立跨語言分類之方法
Cross-Lingual Text Categorization: A Training-corpus Translation-based Approach
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
49
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2005-07-20
繳交日期
Date of Submission
2005-07-21
關鍵字
Keywords
文件管理、文件分類、文件探勘、跨語言文件分類
Text mining, Document management, Cross-lingual text categorization, Text categorization
統計
Statistics
本論文已被瀏覽 5746 次,被下載 7
The thesis/dissertation has been browsed 5746 times, has been downloaded 7 times.
中文摘要
文件分類技術可以自動化的從已經分類好的訓練文件中學習出分類模式,藉由所學出的分類基準,將未分類的文件歸類到正確的類別之中。現存的文件分類技術只能處理單語言的文件,也就是不論是訓練文件以及測試文件中的所有文件必須是以同一種語言撰寫而成。然而因為網際網路的發達,以及受到全球企業環境的影響,不論是個人與組織都會使用不同語言的文件,進而需要對其建檔與歸類,因此跨語言文件分類就有其需求存在。在現存跨語文件分類的研究中,都是採用翻譯預測端文件之策略,因此無法系統化的降低翻譯帶來的雜訊,也局限了跨語分類之效能。為了解決翻譯預測端文件之策略的侷限性,本研究提出翻譯訓練端文件的跨語言文件分類之方法,透過分類器具有的一般化能力以降低翻譯雜訊。實證結果顯示,本研究所提出的跨語文件分類技術有效的降低翻譯雜訊帶來的影響,也較現有的技術達到更高的分類準確度。
Abstract
Text categorization deals with the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the assignment of unclassified documents to appropriate categories. Most of existing text categorization techniques deal with monolingual documents (i.e., all documents are written in one language) during the text categorization model learning and category assignment (or prediction). However, with the globalization of business environments and advances in Internet technology, an organization or individual often generates/acquires and subsequently archives documents in different languages, thus creating the need for cross-lingual text categorization (CLTC). Existing studies on CLTC focus on the prediction-corpus translation-based approach that lacks of a systematic mechanism for reducing translation noises; thus, limiting their cross-lingual categorization effectiveness. Motivated by the needs of providing more effective CLTC support, we design a training-corpus translation-based CLTC approach. Using the prediction-corpus translation-based approach as the performance benchmark, our empirical evaluation results show that our proposed CLTC approach achieves significantly better classification effectiveness than the benchmark approach does in both Chinese
目次 Table of Contents
Chapter 1 Introduction 1
1.1. Background 1
1.2. Research Motivation and Objective 3
1.3. Organization of The Thesis 4
Chapter 2 Literature Review 6
2.1. Monolingual Text Categorization Techniques 6
2.2. Prediction-Corpus Translation-based CLTC Techniques 8
Chapter 3 Design of Training-corpus Translation-based CLTC Approach 14
3.1. Bilingual Thesaurus Construction 16
3.2. Training-Corpus Translation (into L2) 18
3.3. Categorization Learning (in L2) 20
3.4. Category Assignment (in L2) 22
3.4.1. Individual-based Category Assignment Method 22
3.4.2. Cluster-based Category Assignment Method 23
Chapter 4 Empirical Evaluation 27
4.1. Data Collection 27
4.2. Evaluation Procedure and Criteria 28
4.3. Performance Benchmarks 29
4.4. Parameter Tuning Experiments and Results 29
4.5. Comparative evaluations 35
Chapter 5 Conclusion and Future Research Directions 38
References 40
參考文獻 References
[ABS00] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-Based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Databases Technology (EDBT00), 2000, pp. 365-379.
[ADW94] Apte, C., Damerau, F., and Weiss, S., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions of Information Systems, Vol. 12, No. 3, 1994, pp. 233-251.
[B92] Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992.
[B94] Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994, pp. 722-727.
[BC98] Ballesteros, Lisa and Croft, W. Bruce, “Resolving ambiguity for cross-language retrieval,” Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998 pp. 64-71.
[BKV03] Nuria, B., Cornelis, H.K., and Marta, V., “Cross-Lingual Text Categorization,” Proceedings ECDL’03, August 2003, pp. 126-139.
[BM98] Baker, L. D. and Mccallum, A. K., “Distributional Clustering of Words for Text Classification,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 96-103.
[CS96] Cohen, W. W. and Singer, Y., “Context-Sensitive Learning Methods for Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, August 1996, pp. 307-315.
[DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representation for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ‘98), 1998, pp. 148-155.
[IT95] Iwayama, M. and Tokunaga, T., “Cluster-Based Text Categorization: A Comparison of Category Search Strategies,” Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), Seattle, WA, July 1995, pp. 273-281
[J98] Joachims, T., “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proceedings of 10th European Conference on Machine Learning (ECML 98), Chemnitz, Germany, 1998, pp. 137-142.
[JC94] Jing, Y. and Croft, W. B., “An Association Thesaurus for Information Retrieval,” Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994.
[K97] Kwok, K. L. “Evaluation of an English-Chinese Cross-Lingual Retrieval Experiment,” AAAI-97 Symposium on Cross Language Text & Speech Retrieval. 1999, pp.133-137
[LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 1999, pp. 16-22.
[LC96] Larkey, L. and Croft, W., “Combining Classifiers in Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’96), Zurich, Switzerland, August 1996, pp. 289-297.
[LH98] Lam, W. and Ho, C. Y., “Using A Generalized Instance Set for Automatic Text Categorization,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 91-89.
[LR94] Lewis, D. and Ringuette, M., “A Comparison of Two Learning Algorithms for Text Categorization,” Proceedings of Symposium on Document Analysis and Information Retrieval, 1994, pp. 81-93.
[M97] Mitchell, T., Machine Learning, McGraw Hill, New York, NY, 1997.
[MLW92] Masand, B., Linoff, G., and Waltz, D., “Classifying News Stories Using Memory Based Reasoning,” Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘92), 1992, pp. 59-64.
[MN98] McCallun, A. K. and Nigam, K., “A Comparison of Event Models for Naïve Bayes Text Classification,” Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998.
[NGL97] Ng, H. T., Goh, W. B., and Low, K. L., “Feature Selection, Perception Learning, and A Usability Case Study for Text Categorization,” Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’97), 1997, pp. 67-73.
[RC99] Roussinov, D. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1, 1999, pp. 67-79.
[SHP95] Schutze, H., Hull, D. A., and Pedersen, J. O., “A Comparison of Classifiers and Document Representations for the Routing Problem,” Proceedings of the 18st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), 1995, pp. 229-237.
[S99] Schapire, Robert E. “A Brief Introduction to Boosting,” Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.
[S02] Sebastiani, F., “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, Vol. 34, No. 1, 2002, pp. 1-47.
[TSH97] Taylor, M. G., Stoffel, K., and Hendler, J. A., “ontology based induction high level classification rules,” In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
[V93] Voutilainen, A., “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993, pp. 48-57.
[WAD99] Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., and Hampp, T., “Maximizing Text-Mining Performance,” IEEE Intelligence Systems, Vol. 14, No. 4, July/August 1999, pp. 63-69.
[WLY05] Wei, C., Lin, Y. T., and Yang, C. C. “Cross-Lingual Text Categorization for Global Knowledge Management,” Working Paper, Department of Information Management, National Sun Yat-sen University, Kaohsiung, Taiwan, R.O.C., June 2005.
[WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-Commerce Environments: An Evolution-Based Approach,” European Journal of Information Systems, September 2002, pp. 208-222.
[WPW95] Wiener, W., Pedersen, J. O., and Weigend, A. S., “A Neural Network Approach to Topic Spotting,” Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’95), Las Vegas, NV, 1995, pp. 317-332.
[Y94] Yang, Y., “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘94), Dublin, Ireland, July 1994, pp. 13-22.
[YC94] Yang, Y. and Chute, C. G., “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Transaction on Information Systems, Vol. 12, No. 3, 1994, pp. 252-277.
[YL03] Yang, C. C. and Luk J., “Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, 2003, pp. 671-682.
[YL99] Yang, Y. and Liu, X., “A Re-Examination of Text Categorization methods,” Proceedings of SIGIR ’99: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42-49.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內一年後公開,校外永不公開 campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 44.212.39.149
論文開放下載的時間是 校外不公開

Your IP address is 44.212.39.149
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code