Responsive image
博碩士論文 etd-0808106-220124 詳細資訊
Title page for etd-0808106-220124
論文名稱
Title
以字詞翻譯為基礎之多語言文件自動分群技術
Feature Translation-based Multilingual Document Clustering Technique
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
55
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2006-07-17
繳交日期
Date of Submission
2006-08-08
關鍵字
Keywords
文件分群、多語言文件分群、文件探勘
multilingual document clustering, document clustering, text mining
統計
Statistics
本論文已被瀏覽 5712 次,被下載 3
The thesis/dissertation has been browsed 5712 times, has been downloaded 3 times.
中文摘要
文件分群係根據一群文件的內容自動將其組織成有意義的類別。現有的文件分群技術大多只處理單語文件,也就是所有文件只以單一種語言所寫成。然而隨著國際化的趨勢以及網際網路科技的發展,組織或個人常常會產生獲取進而儲存不同語言之文件,也因此產生了對多語文件自動分群技術的需要。此技術的重要性及需要性激發了本研究的動機,於是我們設計了一個以字詞翻譯為基礎之多語文件自動分群技術。我們的實證評估以 cluster recall 與cluster precision 來衡量分群效果,結果顯示所提出之多語文件分群技術達到令人滿意的效果。
Abstract
Document clustering automatically organizes a document collection into distinct groups of similar documents on the basis of their contents. Most of existing document clustering techniques deal with monolingual documents (i.e., documents written in one language). However, with the trend of globalization and advances in Internet technology, an organization or individual often generates/acquires and subsequently archives documents in different languages, thus creating the need for multilingual document clustering (MLDC). Motivated by its significance and need, this study designs a translation-based MLDC technique. Our empirical evaluation results show that the proposed multilingual document clustering technique achieves satisfactory clustering effectiveness measured by both cluster recall and cluster precision.
目次 Table of Contents
Chapter 1 Introduction.............................................................................1
1.1 Background .....................................................................................................1
1.2 Research Motivation and Objectives .............................................................2
1.3 Organization of the Thesis .............................................................................5
Chapter 2 Literature Review...................................................................6
2.1 Monolingual Document Clustering Technique .............................................6
2.2 Multilingual Document Clustering (MLDC) Technique ............................10
Chapter 3 Design of Multilingual Thesaurus Translation-based
MLDC Technique....................................................................................14
3.1 Multilingual Thesaurus Construction..........................................................15
3.2 Document Translation ..................................................................................18
3.3 Document Clustering....................................................................................20
Chapter 4 Empirical Evaluation ...........................................................23
4.1 Data Collection .............................................................................................23
4.2 Evaluation Criteria .......................................................................................24
4.3 Evaluation Procedure ...................................................................................25
4.4 Evaluation Benchmark .................................................................................26
4.5 Tuning Experiments and Results .................................................................27
4.5.1 Tuning Number of Features k ...........................................................27
4.5.2 Tuning Translation Threshold δaw ....................................................31
4.6 Comparative Evaluation Results .................................................................35
Chapter 5 Conclusion and Future Research Directions......................40
References ................................................................................................42
參考文獻 References
[A73]Anderberg, M. R., Cluster Analysis for Applications, Academic Press, Inc., 1973.
 
[AGY99]Aggarwal, C. C., Gates, S. C., and Yu, P. S., “On the Merits of Building Categorization Systems by Supervised Clustering,” Proceedings of Conference on Knowledge Discovery in Databases, San Diego, CA, 1999, pp.352-356.
 
[B92]Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, Association for Computational Linguistics, Trento, Italy, 1992.
 
[B94]Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994, pp.722-727.
 
[BDO95]Berry, M. W., Dumais, S. T., and O’Brien G. W., “Using Linear Algebra for Intelligent Information Retrieval,” SIAM Review, Vol. 37, No. 4, 1995
 
[BGR99]Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J., “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems, Vol. 27, No. 3, 1999, pp.329-341.
 
[BS97]Berson, A. and Smith, S. J., Data Warehousing, Data Mining & OLAP, McGraw-Hill, Inc., 1997.
 
[BY95]Berry, M.W. and Young, P.G., “Using Latent Semantic Indexing for Multilingual Information Retrieval,” Computers and Humanities, Vol. 29, No. 6, 1995, pp.413-429.
 
[CKP92]Cutting, D., Karger, D., Pedersen, J., and Tukey, J., “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.318-329.
 
[CL00]Chen, H. H. and Lin, C. J., “A Multilingual News Summarizer,” Proceedings of 18th International Conference on Computational Linguistics, July -August 2000, pp.159-165
 
[DDF+90]Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. A., “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, Vol. 41, No. 6., 1990.
 
[DPH98]Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representation for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ‘98), 1998, pp.148-155.
 
[EK03]Evans, D. K. and Klavans, J. L., “A Platform for Multilingual News Summarization,” Technical Report, Department of Computer Science, Columbia University, May 2003.
 
[EM97]Estivill-Castro, V. and Murray, A. T., “Spatial Clustering for Data Mining with Generic Algorithms,” Technical Report FIT-TR-97-10, Queensland University of Technology, Faculty of Information Management, September 1997.
 
[EW86]El-Hamdouchi, A. and Willett, P., “Hierarchical Document Clustering Using Ward’s Method,” Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp.149-156.
 
[GMH02]Guerrero Bote, V. P., Moya Anegón, F., and Herrero Solana, V., “Document Organization Using Kohonen’s Algorithm,” Information Processing and Management, Vol. 38, No. 1, 2002, pp.79-89.
 
[JC94]Jing, Y. and Croft, W. B., “An Association Thesaurus for Information Retrieval,” Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994.
 
[K89]Kohonen, T., Self-Organization and Associative Memory, Springer, 1989.
 
[K95]Kohonen, T., Self-Organizing Maps, Springer, 1995.
 
[KL00]Kim, H. J. and Lee, S. G.., “A Semi-Supervised Document Clustering Techniques for Information Organization,” Proceedings of the 2000 ACM 9th International Conference on Information and Knowledge Management (CIKM '00), McLean, VA, 2000, pp.30-37.
 
[KL02]Kim, H. J. and Lee, S. G., “An Effective Document Clustering Method Using User-adaptable Distance Metrics,” Proceedings of the 2002 ACM Symposium on Applied Computing, Madrid, Spain, 2002, pp.16-20.
 
[KR90]Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An introduction to Cluster Analysis, John Wiley & Sons, Inc., New York, NY, 1990.
 
[L99]Lerman, K., “Document Clustering in Reduced Dimension Vector Space,” unpublished, available at http://www.isi.edu/%7Elerman/papers/Lerman99.pdf
 
[L05]Chih-Ping Wei, Christopher C. Yang, and Chia-Min Lin, “A Latent Semantic Indexing Based Approach to Multilingual Document Clustering “ 2005.
 
[LA99]Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.16-22.

 
[LCN99]Lin, C., Chen, H., and Nunamaker, J. F., “Verifying the Proximity and Size Hypothesis for Self-organizing Maps,” Journal of Management Information Systems, Vol. 16, No. 3, 1999, pp.57-70.
 
[LHK96]Lagus, K., Honkela, T., Kaski, S., and Kohonen, T., “Self-organizing Maps of Document Collections: A New Approach to Interactive Exploration,” Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1996.
 
[LL90]Landauer, T. K. and Littman, M. L., “Full Automatic Cross-Language Document Retrieval Using Latent Semantic Indexing,” Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, Waterloo, Ontario, October 1990.
 
[NH94]Ng, R. and Han, J., “Efficient and Effective Clustering methods for spatial Data Mining,” Proceedings of International Conference on Very Large Data Bases, Santiago, Chile, September 1994, pp.144-155.
 
[O96]Oard, D.W., “Adaptive Vector Space Text Filtering for Monolingual and Cross-language Applications,” Ph.D. Dissertation, University of Maryland, College Park, 1996.
 
[PL02]Pantel, P. and Lin, D., “Document Clustering with Committees,” Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002, pp.199-206.
 
[PSI+04]Pouliquen, B., Steinberger, R., Ignat, C., Käsper E., and Temnikova, I., “Multilingual and Cross-Lingual News Topic Tracking,“ Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, August 2004.
 
[RC99]Roussinov, D. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1-2, 1999, pp.67-79.
 
[RC01]Roussinov, D. and Chen, H., “Information Navigation on the Web by Clustering and Summarizing Query Results,” Information Processing and Management, Vol. 37, No. 6, 2001, pp.789-816.
 
[RM99]Rauber, A. and Merkl, D., “Using Self-organizing Maps to Organize Document Archives and to Characterize Subject Matters: How to Make A Map Tell the News of the World,” Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA’98), Florence, Italy, 1999.
 
[SS97]Schutze, H. and Silverstein, C., “Projections for Efficient Document Clustering,” Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, USA, July 1997, pp. 74-81
 
[TB99]Talavera, L. and Bejar, J., “Integrating Declarative Knowledge in Hierarchical Clustering Tasks,” Proceedings of the 3rd International Symposium on Intelligent Data Analysis, 1999, pp.211-222.
 
[V86]Voorhees, E. M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol. 22, 1986, pp.465-476.
 
[V93]Voutilainen, A., “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993, pp.48-57.
 
[WFW01]Wu, M., Fuller, M., and Wilkinson, R., “Using Clustering and Classification Approaches in Interactive Retrieval,” Information Processing and Management, Vol. 37, No, 3, 2001, pp.456-484.
 
[WHD02]Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-commerce Environments: An Evolution-based Approach,” European Journal of Information Systems, Vol. 11, No. 3, 2002, pp.208-222.
 
[WYH06]Wei, C., Yang, C. S., Hsiao, H. W., and Cheng, T. H., “Combining Preference- and Content-based Approaches for Improving Document Clustering Effectiveness,” Information Processing and Management, Vol. 42, No. 2, March 2006, pp.350-372.
 
[Y94]Young, P. G.., “Cross-Language Information Retrieval Using Latent Semantic Indexing,” Master’s thesis, The University of Knoxville, Tennessee, Knoxville, TN, 1994.
 
[YC94]Yang, Y. and Chute, C. G., “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Transaction on Information Systems, Vol. 12, No. 3, 1994, pp.252-277.
 
[YL03]Yang, C. C. and Luk, J., “Automatic Generation of English/Chinese Thesaurus Based on A Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, 2003, pp.671-682.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內一年後公開,校外永不公開 campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 34.204.3.195
論文開放下載的時間是 校外不公開

Your IP address is 34.204.3.195
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code