Responsive image
博碩士論文 etd-0830106-152316 詳細資訊
Title page for etd-0830106-152316
論文名稱
Title
跨語言文件類別整合技術
Cross-Lingual Category Integration Technique
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
63
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2006-07-18
繳交日期
Date of Submission
2006-08-30
關鍵字
Keywords
加強式貝氏分類器、多語言文件管理、文件探勘、文件類別整合、跨語言文件類別整合
Category integration, Enhanced Naive Bayes, Cross-lingual category integration, Multi-lingual document, Text mining
統計
Statistics
本論文已被瀏覽 5727 次,被下載 7
The thesis/dissertation has been browsed 5727 times, has been downloaded 7 times.
中文摘要
隨著網際網路的蓬勃發展,許多來自不同國度有趣及具有創造力的應用逐漸被開發出來,電子商務也愈來愈普及。因應此潮流,在現代全球化的商業環境中,大量的資訊在組織與個人間不斷地被共享與交換,其中相當大部分的資訊是以文字形態表達,並以類別分類的方式來作管理,故如何發展一個實用並有效的技術來協助解決跨語言文件類別整合問題便成為一個不得不面對的重要課題。實際上現存的許多類別整合技術已被用於解決單語文件類別整合的問題,但並沒有相關技術或研究涉足到跨語言文件類別整合的領域。因此,本研究結合跨語言文件分類技術中所使用的跨語機制,加上現存的單語文件類別整合技術─加強式貝式分類器 (Enhanced Naïve Bayes),提出一個跨語言文件類別整合的解決方案。實驗結果顯示,本研究所提出之跨語言文件類別整合技術具有可行性並可達到優異的效能。
Abstract
With the emergence of the Internet, many innovative and interesting applications from different countries have been stimulated and e-commerce is also getting more and more pervasive. Under this scenario, tremendous amount of information expressed in different languages are exchanged and shared by not only organizations but also individuals in the modern global environment. A large proportion of information is typically formatted and available as textual documents and managed by using categories. Consequently, the development of a practical and effective technique to deal with the problem of cross-lingual category integration (CLCI) becomes a very essential and important issue. Several category integration techniques have been proposed, but all of them deal with category integration involving only monolingual documents. In response, in this study, we combine the existing cross-lingual text categorization techniques with an existing monolingual category integration technique (specifically, Enhanced Naive Bayes) and proposed a CLCI solution to address cross-lingual category integration. Our empirical evaluation results show that our proposed CLCI technique demonstrates its feasibility and superior effectiveness.
目次 Table of Contents
Chapter 1 Introduction
1.1 Background
1.2 Research Motivation and Objective
1.3 Organization of the Thesis
Chapter 2 Literature Review
2.1 Monolingual Category Integration Techniques
2.1.1 Enhanced Naive Bayes (ENB) Technique
2.1.2 Cluster Shrinkage Approach
2.1.3 Co-Bootstrapping Technique
2.1.4 Clustering-based Category Integration (CCI) Technique
2.2 Cross-Lingual Text Categorization
2.2.1 Prediction-corpus Translation-based CLTC Approach
2.2.2 Training-corpus Translation-based CLTC Approach
Chapter 3 Design of Cross-Lingual Category Integration (CLCI) Technique
3.1 Cross-Lingual Thesaurus Construction
3.2 Translation Approaches
3.2.1 Source-catalog Translation
3.2.2 Master-catalog Translation
3.3 Category Integration
Chapter 4 Empirical Evaluation
4.1 Data Collection
4.2 Evaluation Design
4.2.1 Dimension of Evaluations and Creation of Synthetic Catalogs
4.2.2 Evaluation Procedure and Criteria
4.3 Evaluation Results
4.3.1 Parameter Tuning Experiments
4.3.2 Comparative Evaluation
4.3.3 Effects of Data Size on Integration Accuracy
Chapter 5 Conclusion and Future Research Directions
References
參考文獻 References
[ABS00] Agrawal, R., Bayardo, R. & Srikant, R. (2000), “Athena: Mining-Based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Databases Technology (EDBT00), pp. 365-379.
[AS01] Agrawal, R. & Srikant, R. (2001), “On Integrating Catalogs”, Proceedings of the Tenth International Conference on World Wide Web, Hong Kong: ACM Press, pp. 603-612.
[B92] Brill, E. (1992), “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, pp. 152-155.
[B94] Brill, E. (1994), “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, pp. 722-727.
[BHL01] Berners-Lee, T., Hendler, J., & Lassila, O. (2001), “The Semantic Web,” Scientific American, Vol. 5, pp. 28-37.
[BKV03] Bel, N., Koster, C.H.A., & Villegas, M. (2003), “Cross-Lingual Text Categorization,” Proceedings ECDL’03, August 2003, pp. 126-139.
[CS00] Cristianini, N. & Shawe-Taylor, J. (2000), An Introduction to Support Vector Machines, Cambridge, UK: Cambridge University Press.
[DLL97] Dumais, S.T, Letsche, T.A, Littman, M.L & Landauer, T.K (1997). “Automatic Cross-language Retrieval Using Latent Semantic Indexing,” AAAI-97 Spring Symposium: Cross-language Text and Speech Retrieval, pp.15-21.
[DMDH02] Doan, A., Madhavan, J., Domingos, P. & Halevy, A. (2002), “Learning to Map between Ontologies on the Semantic Web,” Proceedings of the 11th International World Wide Web Conference (WWW). Hawaii, USA, pp. 662-673.
[EW86] El-Hamdouchi, A. & Willett, P. (1986), “Hierarchical Document Clustering using Ward’s Method,” Proceedings of the 9th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.149-156.
[FS97] Freund, Y. & Schapire, R. E. (1997), “A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting,” Journal of Computer and System Sciences, Vol. 55, No. 1, pp. 119-139.
[H05] Hsu, K. (2005), “Cross-Lingual Text Categorization: A Training-corpus Translation-based Approach”, Unpublished Master Thesis, Department of Information Management, National Sun Yat-sen University, Taiwan, R.O.C.
[J98] Joachims, T. (1998), "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," Proceedings of the 10th European Conference on Machine Learning (ECML). Chemnitz, Germany, pp. 137-142.
[J99] Joachims, T. (1999), "Transductive Inference for Text Classification using Support Vector Machines," Proceedings of the 16th International Conference on Machine Learning (ICML). Bled, Slovenia, pp.200-209.
[J03] Joachims, T. (2003), "Transductive Learning via Spectral Graph Partitioning," Proceedings of the 20th International Conference on Machine Learning (ICML). Washington DC, USA, pp.290-297.
[JC94] Jing, Y. & Croft, W. B. (1994), “An Association Thesaurus for Information Retrieval,” Proceedings of Intelligence of Multimedia Retrieval Systems and Management Conference (RIAO), Paris: CID-CASIS, pp. 146-160.
[KR90] Kaufman, L. & Rousseeuw, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: John Wiley & Sons, Inc.
[LG01] Lacher, M. S. & Groh, G. (2001), “Facilitating the Exchange of Explicit Knowledge through Ontology Mappings,” Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS). Key West, FL, pp.305-309.
[LFM99] Letourneau, S., Famili, F., & Matwin, S. (1999), “Data Mining to Predict Aircraft Component Replacement,” IEEE Intelligen Systems, Vol. 14, No. 6, November/December 1999, pp.59-66.
[M97] Mitchell, T. M. (1997), Machine Learning. McGraw-Hill Press.
[RC99] Roussinov, D., and Chen, H. (1999), “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1-2, pp. 67-79.
[SCG03] Sarawagi, S., Chakrabarti, S. & Godbole, S. (2003), “Cross-Training: Learning Probabilistic Mappings between Topics,” Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Washington DC, USA, pp.177-186.
[SS99] Schapire, R. E. & Singer, Y. (1999), “Improved Boosting Algorithms Using Confidence-rated Predictions,” Machine Learning, Vol. 37, No. 3, pp. 297-336.
[SS00] Schapire, R. E. & Singer, Y. (2000), “BoosTexter: A Boosting-based System for Text Categorization,” Machine Learning, Vol. 39, No. 2/3, pp. 135-168.
[V86] Voorhees, E.M. (1986), “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol. 22, No. 6, pp. 465-476.
[V93] Voutilainen, A. (1993), “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993, pp.48-57.
[WC03] Wei, C. & Cheng, T. (2003), “A Clustering-Based Approach for Supporting Document-category Integration,” Proceedings of 7th Pacific Asia Conference on Information Systems (PACIS), Adelaide, South Australia, July 2003, pp.1314-1326.
[WLY05] Wei, C., Lin, Y., & Yang, C. C. (2005), “Cross-Lingual Text Categorization: Conquering Language Boundaries in Globalized Environments,” Working Paper, Institute of Technology Management, National Tsing Hua University, Hsinchu, Taiwan, R.O.C.
[YL03] Yang, C. C. & Luk, J. (2003), “Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, pp.671-682.
[YP97] Yang, Y. & Pedersen, J. O. (1997), “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning. Nashville, TN: Morgan Kaufmann, pp.412-420.
[ZL04a] Zhang, D. & Lee, W.S. (2004), “Learning to Integrate Web Taxonomies,” Journal of Web Semantics, Vol. 2, No. 2, pp.131-151.
[ZL04b] Zhang, D. & Lee, W.S. (2004), “Web Taxonomy Integrating using Support Vector Machines,” Proceedings of 13th international conference on World Wide Web (WWW), New York, NY, pp.472-481.
[ZL04c] Zhang, D. & Lee, W.S. (2004), “Web Taxonomy Integration through Co-Bootstrapping,” Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, pp.410-417.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內一年後公開,校外永不公開 campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 18.224.44.108
論文開放下載的時間是 校外不公開

Your IP address is 18.224.44.108
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code