Responsive image
博碩士論文 etd-0814103-153807 詳細資訊
Title page for etd-0814103-153807
論文名稱
Title
群集式字詞擴張研究
Cluster-based Query Expansion Technique
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
45
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2003-07-22
繳交日期
Date of Submission
2003-08-14
關鍵字
Keywords
文件探勘、文件分群、字詞關連性、群集式字詞擴展、字詞使用差異、字詞擴展、資訊擷取
Term Association, Query Expansion, Thesaurus, Text Mining, Information Retrieval, Word Mismatch, Cluster-based Query Expansion, Document Clustering
統計
Statistics
本論文已被瀏覽 5717 次,被下載 1850
The thesis/dissertation has been browsed 5717 times, has been downloaded 1850 times.
中文摘要
隨著網路與資訊科技的高速發展,越來越多的資訊以文字文件的型態出現在網路上。為了協助使用者快速且準確的尋找到其所需的文件,資訊擷取系統(Information Retrieval Systems)所扮演的角色也就越來越重要。然而在資訊擷取的過程中,常常會遇到使用者使用和文件中不同的關鍵字詞來描述同一概念的情況,也就是所謂的字詞使用差異(Word Mismatch)。如果沒有適當的去處理字詞使用差異,則資訊資訊擷取的效果將會大受影響。然而,在文件探勘的文獻中,這個問題確極少被處理與解決。

因此在本篇論文中,我們提出一個群集式字詞擴展技術(Cluster-based Query Expansion Technique)來解決字詞使用差異,並利用傳統的字詞擴展技術 (也就是Global Analysis and Local Feedback), 當做我們的衡量基準。根據實證的結果,我們發現當使用者所下的查詢只包含一個關鍵字詞時,傳統的方法Global Analysis提供了較有效的查詢結果。但是,當使用者所下的查詢包含兩個關鍵字詞以上時,群集式字詞擴展技術則可以提供較精確的查詢結果。
Abstract
As advances in information and networking technologies, huge amount of information typically in the form of text documents are available online. To facilitate efficient and effective access to documents relevant to users’ information needs, information retrieval systems have been imposed a more significant role than ever. One challenging issue in information retrieval is word mismatch that refers to the phenomenon that concepts may be described by different words in user queries and/or documents. The word mismatch problem, if not appropriately addressed, would degrade retrieval effectiveness critically of an information retrieval system.

In this thesis, we develop a cluster-based query expansion technique to solve the word mismatch problem. Using the traditional query expansion techniques (i.e., global analysis and local feedback) as performance benchmarks, the empirical results suggest that when a user query only consists of one query term, the global analysis technique is more effective. However, if a user query consists of two or more query terms, the cluster-based query expansion technique can provide a more accurate query result, especially within the first few top-ranked documents retrieved.
目次 Table of Contents
1. INTRODUCTION 1
1.1 BACKGROUND 1
1.2 RESEARCH MOTIVATION AND OBJECTIVES 2
1.3 ORGANIZATION OF THE THESIS 4
2. LITERATURE REVIEW 6
2.1 QUERY EXPANSION METHODS 6
2.1.1 Global Analysis 6
2.1.2 Local Feedback 7
2.2 THESAURUS CONSTRUCTION TECHNIQUES 9
2.3 DOCUMENT CLUSTERING 14
3. DEVELOPMENT OF CLUSTER-BASED QUERY EXPANSION TECHNIQUE 17
3.1 PROCESS OF CLUSTER-BASED QUERY EXPANSION TECHNIQUE 17
3.2 THESAURI CONSTRUCTION PROCESS 19
3.2.1 Document Clustering 19
3.2.2 Local Thesaurus Construction 22
3.3 QUERY PROCESS 23
3.3.1 Local Query Expansion 23
3.3.2 Document Retrieval 24
4. EMPIRICAL RESEARCH RESULTS 26
4.1 DATA COLLECTION 26
4.2 EVALUATION PROCEDURE AND CRITERIA 27
4.3 BENCHMARK TECHNIQUES 29
4.4 EVALUATION RESULTS 31
4.4.1 Comparative Evaluation 32
4.4.2 Effects of Number of Query Terms 34
4.4.3 Effects of Number of Document-clusters 36
5. CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS 40
6. REFERENCES 42
參考文獻 References
[A73] Anderberg, M. R., Cluster Analysis for Applications, Academic Press Inc., 1973.
[AF77] Attar, R. and Fraenkel, A. S., “Local Feedback in Full-Text Retrieval Systems,” Journal of the ACM, Vol. 24, No. 3, 1997, pp.397-417.
[BGG99] Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, L., “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems, Vol. 27, No. 3, December 1999, pp.329-341.
[CH79] Croft, W. B. and Harper, D. J., “Using Probabilistic Models of Document Retrieval without Relevance Information.” Journal of Documentation, Vol. 35, 1879, pp.285-295.
[CCW95] Croft, W. B., Cook, R., and Wilder, D., ”Providing Government Information on the Internet: Experiences with THOMAS,” Digital Libraries Conference, 1995, pp.19-24.
[CKP92] Cutting, D., Karger, D., Pedersen, J. and Tukey, J., “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.318-329.
[EW86] El-Hamdouchi, A. and Willett, P., “Hierarchical Document Clustering Using Ward’s Method,” Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp.149-156.
[FLG87] Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T., “The Vocabulary Problem in Human-System Communication,” Communications of the ACM, Vol. 30, No. 11, November 1987, pp.964-971.
[J71] Jones, S. K. “Automatic Keyword Classification for Information Retrieval,” Butterworth, 1971.
[JC94] Jing, Y. and Croft, W. B., “An Association Thesaurus for Information Retrieval,” Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994.
[JJ70] Jones, S. K., and Jackson, D. “The Use of Automatically-obtained Keyword Classifications for Information Retrieval,” Information Processing and Management, Vol. 5, 1970, pp.175-201.
[K89] Kohonen, T., Self-Organization and Associative Memory, Springer, 1989.
[K95] Kohonen, T., Self-Organizing Maps, Springer, 1995.
[KL00] Kim, H. J. and Lee, S. G., “A Semi-Supervised Document Clustering Techniques for Information Organization,” Proceedings of the 2000 ACM 9th International Conference on Information and Knowledge Management (CIKM '00), 2000, pp.30-37.
[KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An introduction to Cluster Analysis, John Wiley & Sons, Inc., New York, NY, 1990.
[L69] Lesk, M. E., “Word-word Association in Document Retrieval Systems,” American Documentation 20, 27, 1969
[L92] Lewis, D. D., “Representation and Learning in Information Retrieval,” PhD thesis, University of Massachusetts at Amherst, 1992.
[LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.16-22.
[LHK96] Lagus, K., Honkela, T., Kaski, S., and Kohonen, T., “Self-organizing Maps of Document Collections: A New Approach to Interactive Exploration,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996.
[M80] McCarn, D., “MedLine: An Introduction to On-line Searching,” Journal of the American Society for Information Science, Vol. 31, No. 3, 1980, pp.181-192.
[M95] Miller, G. A, “WordNet: A Lexical Database for English,” Communications of the ACM, Vol. 38, No. 11, November 1995, pp.39- 41.
[MBF93] Miller, G. A., Beckwith, R., Felbaum, C., Gross, D., and Miller, K., Introduction to WordNet: An On-line Lexical Database, Revised Version 1993.
[MWZ72] Minker, J., Wilson, G., and Zimmerman, B. “An Evaluation of Query Expansion by the Addition of Clustered Terms for A Document Retrieval System,” Information Storage and Retrieval, Vol. 8, 1972, 329-348.
[N01] National Library of Medicine, UMLS Knowledge Sources, 12th Experimental Edition, January 2001.
[QF93] Qiu, Y. and Frei, H. P., “Concept Based Query Expansion,” Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1993, pp.160-169.
[R97] Ruge, G., Combining Corpus Linguistics and Human Memory Models for Automatic Term Association, AI Group, Institut fuer Informatik, TU Muenchen. Natural Language Information Retrieval, Kluwer Academic Publishers, 1997.
[RC99] Roussinov, D. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1-2, 1999, pp.67-79.
[RPC01] Rui Pedro Chaves, “WordNet and Automated Text Summarization,” Computation of Lexical and Grammatical Knowledge Research Group, Centro de Linguística da Universidade de Lisboa, 2001.
[SB88] Salton, G., Buckly, C., “Term Weighting Approach in Automatic Text Retrieval,” Information Processing and Management, Vol. 24, No. 5, pp. 513--523, 1988.
[SB90] Salton, G. and Buckley, C., “Improving the Retrieval Performance by Relevance Feedback,” Journal of American Society for Information Sciences, Vol. 41, 1990, pp.288-197.
[V86] Voorhees, E. M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol.22, 1986, pp.465-476.
[V93] Voutilainen, A., “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993.
[V94] Voorhees, E. M., “Query Expansion Using Lexical-Semantic Relations,” Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, pp.61-69.
[WBO00] Wei, J., Bressan, S., and Ooi, B. C., “Mining Term Association Rules for Automatic Global Query Expansion: Methodology and Preliminary Results,” Proceedings of the First International Conference on Web Information Systems Engineering, 2000, pp.366-373.
[X97] Xu, J., “Solving the Word Mismatch Problem Through Automatic Text Analysis,” Unpublished Ph.D. Thesis, University of Massachusetts at Amherst, 1997.
[XC96] Xu, J. and Croft, W. B. “Query Expansion Using Local and Global Document Analysis,” Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.4-11.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內立即公開,校外一年後公開 off campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code