國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,跨語言文件類別整合技術,Cross-Lingual Category Integration Technique

論文名稱 Title	跨語言文件類別整合技術 Cross-Lingual Category Integration Technique
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	94 學年度第 2 學期 The spring semester of Academic Year 94	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	63
研究生 Author	曾國翰 Guo-han Tzeng
指導教授 Advisor	魏志平 Chih-Ping Wei
召集委員 Convenor	楊傳智 Christopher C. Yang
口試委員 Advisory Committee	胡仁華 Paul J. H. Hu
口試日期 Date of Exam	2006-07-18	繳交日期 Date of Submission	2006-08-30
關鍵字 Keywords	加強式貝氏分類器、多語言文件管理、文件探勘、文件類別整合、跨語言文件類別整合 Category integration, Enhanced Naive Bayes, Cross-lingual category integration, Multi-lingual document, Text mining
統計 Statistics	本論文已被瀏覽 5727 次，被下載 7 次 The thesis/dissertation has been browsed 5727 times, has been downloaded 7 times.

中文摘要
隨著網際網路的蓬勃發展，許多來自不同國度有趣及具有創造力的應用逐漸被開發出來，電子商務也愈來愈普及。因應此潮流，在現代全球化的商業環境中，大量的資訊在組織與個人間不斷地被共享與交換，其中相當大部分的資訊是以文字形態表達，並以類別分類的方式來作管理，故如何發展一個實用並有效的技術來協助解決跨語言文件類別整合問題便成為一個不得不面對的重要課題。實際上現存的許多類別整合技術已被用於解決單語文件類別整合的問題，但並沒有相關技術或研究涉足到跨語言文件類別整合的領域。因此，本研究結合跨語言文件分類技術中所使用的跨語機制，加上現存的單語文件類別整合技術─加強式貝式分類器 (Enhanced Naïve Bayes)，提出一個跨語言文件類別整合的解決方案。實驗結果顯示，本研究所提出之跨語言文件類別整合技術具有可行性並可達到優異的效能。
Abstract
With the emergence of the Internet, many innovative and interesting applications from different countries have been stimulated and e-commerce is also getting more and more pervasive. Under this scenario, tremendous amount of information expressed in different languages are exchanged and shared by not only organizations but also individuals in the modern global environment. A large proportion of information is typically formatted and available as textual documents and managed by using categories. Consequently, the development of a practical and effective technique to deal with the problem of cross-lingual category integration (CLCI) becomes a very essential and important issue. Several category integration techniques have been proposed, but all of them deal with category integration involving only monolingual documents. In response, in this study, we combine the existing cross-lingual text categorization techniques with an existing monolingual category integration technique (specifically, Enhanced Naive Bayes) and proposed a CLCI solution to address cross-lingual category integration. Our empirical evaluation results show that our proposed CLCI technique demonstrates its feasibility and superior effectiveness.

目次 Table of Contents
Chapter 1 Introduction 1.1 Background 1.2 Research Motivation and Objective 1.3 Organization of the Thesis Chapter 2 Literature Review 2.1 Monolingual Category Integration Techniques 2.1.1 Enhanced Naive Bayes (ENB) Technique 2.1.2 Cluster Shrinkage Approach 2.1.3 Co-Bootstrapping Technique 2.1.4 Clustering-based Category Integration (CCI) Technique 2.2 Cross-Lingual Text Categorization 2.2.1 Prediction-corpus Translation-based CLTC Approach 2.2.2 Training-corpus Translation-based CLTC Approach Chapter 3 Design of Cross-Lingual Category Integration (CLCI) Technique 3.1 Cross-Lingual Thesaurus Construction 3.2 Translation Approaches 3.2.1 Source-catalog Translation 3.2.2 Master-catalog Translation 3.3 Category Integration Chapter 4 Empirical Evaluation 4.1 Data Collection 4.2 Evaluation Design 4.2.1 Dimension of Evaluations and Creation of Synthetic Catalogs 4.2.2 Evaluation Procedure and Criteria 4.3 Evaluation Results 4.3.1 Parameter Tuning Experiments 4.3.2 Comparative Evaluation 4.3.3 Effects of Data Size on Integration Accuracy Chapter 5 Conclusion and Future Research Directions References

參考文獻 References
[ABS00] Agrawal, R., Bayardo, R. & Srikant, R. (2000), “Athena: Mining-Based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Databases Technology (EDBT00), pp. 365-379. [AS01] Agrawal, R. & Srikant, R. (2001), “On Integrating Catalogs”, Proceedings of the Tenth International Conference on World Wide Web, Hong Kong: ACM Press, pp. 603-612. [B92] Brill, E. (1992), “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, pp. 152-155. [B94] Brill, E. (1994), “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, pp. 722-727. [BHL01] Berners-Lee, T., Hendler, J., & Lassila, O. (2001), “The Semantic Web,” Scientific American, Vol. 5, pp. 28-37. [BKV03] Bel, N., Koster, C.H.A., & Villegas, M. (2003), “Cross-Lingual Text Categorization,” Proceedings ECDL’03, August 2003, pp. 126-139. [CS00] Cristianini, N. & Shawe-Taylor, J. (2000), An Introduction to Support Vector Machines, Cambridge, UK: Cambridge University Press. [DLL97] Dumais, S.T, Letsche, T.A, Littman, M.L & Landauer, T.K (1997). “Automatic Cross-language Retrieval Using Latent Semantic Indexing,” AAAI-97 Spring Symposium: Cross-language Text and Speech Retrieval, pp.15-21. [DMDH02] Doan, A., Madhavan, J., Domingos, P. & Halevy, A. (2002), “Learning to Map between Ontologies on the Semantic Web,” Proceedings of the 11th International World Wide Web Conference (WWW). Hawaii, USA, pp. 662-673. [EW86] El-Hamdouchi, A. & Willett, P. (1986), “Hierarchical Document Clustering using Ward’s Method,” Proceedings of the 9th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.149-156. [FS97] Freund, Y. & Schapire, R. E. (1997), “A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting,” Journal of Computer and System Sciences, Vol. 55, No. 1, pp. 119-139. [H05] Hsu, K. (2005), “Cross-Lingual Text Categorization: A Training-corpus Translation-based Approach”, Unpublished Master Thesis, Department of Information Management, National Sun Yat-sen University, Taiwan, R.O.C. [J98] Joachims, T. (1998), "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proceedings of the 10th European Conference on Machine Learning (ECML). Chemnitz, Germany, pp. 137-142. [J99] Joachims, T. (1999), "Transductive Inference for Text Classification using Support Vector Machines," Proceedings of the 16th International Conference on Machine Learning (ICML). Bled, Slovenia, pp.200-209. [J03] Joachims, T. (2003), "Transductive Learning via Spectral Graph Partitioning," Proceedings of the 20th International Conference on Machine Learning (ICML). Washington DC, USA, pp.290-297. [JC94] Jing, Y. & Croft, W. B. (1994), “An Association Thesaurus for Information Retrieval,” Proceedings of Intelligence of Multimedia Retrieval Systems and Management Conference (RIAO), Paris: CID-CASIS, pp. 146-160. [KR90] Kaufman, L. & Rousseeuw, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons, Inc. [LG01] Lacher, M. S. & Groh, G. (2001), “Facilitating the Exchange of Explicit Knowledge through Ontology Mappings,” Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS). Key West, FL, pp.305-309. [LFM99] Letourneau, S., Famili, F., & Matwin, S. (1999), “Data Mining to Predict Aircraft Component Replacement,” IEEE Intelligen Systems, Vol. 14, No. 6, November/December 1999, pp.59-66. [M97] Mitchell, T. M. (1997), Machine Learning. McGraw-Hill Press. [RC99] Roussinov, D., and Chen, H. (1999), “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1-2, pp. 67-79. [SCG03] Sarawagi, S., Chakrabarti, S. & Godbole, S. (2003), “Cross-Training: Learning Probabilistic Mappings between Topics,” Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Washington DC, USA, pp.177-186. [SS99] Schapire, R. E. & Singer, Y. (1999), “Improved Boosting Algorithms Using Confidence-rated Predictions,” Machine Learning, Vol. 37, No. 3, pp. 297-336. [SS00] Schapire, R. E. & Singer, Y. (2000), “BoosTexter: A Boosting-based System for Text Categorization,” Machine Learning, Vol. 39, No. 2/3, pp. 135-168. [V86] Voorhees, E.M. (1986), “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol. 22, No. 6, pp. 465-476. [V93] Voutilainen, A. (1993), “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993, pp.48-57. [WC03] Wei, C. & Cheng, T. (2003), “A Clustering-Based Approach for Supporting Document-category Integration,” Proceedings of 7th Pacific Asia Conference on Information Systems (PACIS), Adelaide, South Australia, July 2003, pp.1314-1326. [WLY05] Wei, C., Lin, Y., & Yang, C. C. (2005), “Cross-Lingual Text Categorization: Conquering Language Boundaries in Globalized Environments,” Working Paper, Institute of Technology Management, National Tsing Hua University, Hsinchu, Taiwan, R.O.C. [YL03] Yang, C. C. & Luk, J. (2003), “Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, pp.671-682. [YP97] Yang, Y. & Pedersen, J. O. (1997), “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning. Nashville, TN: Morgan Kaufmann, pp.412-420. [ZL04a] Zhang, D. & Lee, W.S. (2004), “Learning to Integrate Web Taxonomies,” Journal of Web Semantics, Vol. 2, No. 2, pp.131-151. [ZL04b] Zhang, D. & Lee, W.S. (2004), “Web Taxonomy Integrating using Support Vector Machines,” Proceedings of 13th international conference on World Wide Web (WWW), New York, NY, pp.472-481. [ZL04c] Zhang, D. & Lee, W.S. (2004), “Web Taxonomy Integration through Co-Bootstrapping,” Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, pp.410-417.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內一年後公開，校外永不公開 campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.224.44.108 論文開放下載的時間是校外不公開 Your IP address is 18.224.44.108 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS