Responsive image
博碩士論文 etd-0802102-142205 詳細資訊
Title page for etd-0802102-142205
論文名稱
Title
文件探勘技術中字詞擴展之研究
Investigations of Term Expansion on Text Mining Techniques
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
64
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2001-07-25
繳交日期
Date of Submission
2002-08-02
關鍵字
Keywords
字詞關聯、文件探勘、文件分類、字詞使用差異、文件分群、事件偵測、字詞擴展
Term Association, Word Mismatch, Text Mining, Event Detection, Term Expansion, Document Clustering, Text Categorization
統計
Statistics
本論文已被瀏覽 5751 次,被下載 7426
The thesis/dissertation has been browsed 5751 times, has been downloaded 7426 times.
中文摘要
近來電腦及網路科技的快速發展促成了全球網路的連結,也使得線上文件快速地成長及累積。這些在網路上或組織內所累積下來的文件可能含有許多組織競爭所需的知識,有效的文件管理(Document Management)技術(包括資訊檢索(Information Retrieval)、資訊過濾(Information Filtering)、文字探勘(Text Mining)等)可協助組織有效的運用這些文件。然而,文件管理研究面臨一項挑戰性的議題,即所謂的字詞使用差異(Word Mismatch)。目前字詞使用差異的研究主要是在資訊檢索的研究領域,並以字詞擴展(Term Expansion)的技術來解決這個問題,然而,在文件探勘的文獻中,這個問題卻極少被處理與解決。因此,本論文旨在對文件探勘技術中字詞擴展之使用進行研究,並特別以文件分類(Text Categorization)、文件分群(Document Clustering)以及事件偵測(Event Detection)這三類文件探勘技術為研究對象,發展這三類技術所需之字詞擴展技術。根據實證評估的結果,當使用相關係數(Correlation Coefficient)作為特徵選擇(Feature Selection)方式時,字詞擴展技術增加了的文件分類之效能。在文件分群方面,使用字詞擴展之文件分群技術並未改善分群之效能,但在Specificity的衡量上,使用字詞擴展技術的結果普遍明顯地優於傳統文件分群技術。最後,使用字詞擴展來協助事件偵測則導致了偵測效果的降低。
Abstract
Recent advances in computer and network technologies have contributed significantly to global connectivity and stimulated the amount of online textual document to grow extremely rapidly. The rapid accumulation of textual documents on the Web or within an organization requires effective document management techniques, covering from information retrieval, information filtering and text mining. The word mismatch problem represents a challenging issue to be addressed by the document management research. Word mismatch has been extensively investigated in information retrieval (IR) research by the use of term expansion (or specifically query expansion). However, a review of text mining literature suggests that the word mismatch problem has seldom been addressed by text mining techniques. Thus, this thesis aims at investigating the use of term expansion on some text mining techniques, specifically including text categorization, document clustering and event detection. Accordingly, we developed term expansion extensions to these three text mining techniques. The empirical evaluation results showed that term expansion increased the categorization effectiveness when the correlation coefficient feature selection was employed. With respect to document clustering, techniques extended with term expansion achieved comparable clustering effectiveness to existing techniques and showed its superiority in improving clustering specificity measure. Finally, the use of term expansion for supporting event detection has degraded the detection effectiveness as compared to the traditional event detection technique.
目次 Table of Contents
CHAPTER 1 INTRODUCTION 1
1.1 BACKGROUND 1
1.2 RESEARCH MOTIVATION AND OBJECTIVES 1
1.3 ORGANIZATION OF THE THESIS 3
CHAPTER 2 LITERATURE REVIEW 5
2.1 TERM ASSOCIATION CONSTRUCTION 5
2.2 TEXT CATEGORIZATION 9
2.3 DOCUMENT CLUSTERING 14
2.4 EVENT DETECTION 16
CHAPTER 3 TEXT MINING TECHNIQUES WITH TERM EXPANSION 20
3.1 TERM ASSOCIATION CONSTRUCTION PROCESS 20
3.2 TEXT CATEGORIZATION WITH TERM EXPANSION 22
3.3 DOCUMENT CLUSTERING WITH TERM EXPANSION 24
3.4 EVENT DETECTION WITH TERM EXPANSION 26
CHAPTER 4 EMPIRICAL EVALUATION FOR TEXT CATEGORIZATION WITH TERM EXPANSION 29
4.1 EVALUATION DESIGN 29
4.1.1 Data collection 29
4.1.2 Evaluation Criteria 31
4.1.3 Evaluation Procedure 31
4.2 EVALUATION RESULT 32
4.2.1 Effects of Number of Features on Categorization Effectiveness 32
4.2.2 Effects of Induction Method, Feature Selection Method and Representation Scheme on Categorization Effectiveness 36
4.2.3 Comparative Evaluation of Text Categorization Techniques 38
CHAPTER 5 EMPIRICAL EVALUATION FOR TERM EXPANSION EMBEDDED DOCUMENT CLUSTERING 42
5.1 EVALUATION DESIGN 42
5.1.1 Evaluation Criteria 42
5.1.2 Evaluation Procedure 44
5.2 EVALUATION RESULT 45
5.2.1 Comparative Evaluation of Document Clustering Techniques 45
CHAPTER 6 EMPIRICAL EVALUATION FOR EVENT DETECTION WITH TERM EXPANSION 50
6.1 EVALUATION DESIGN 50
6.1.1 Data Collection 50
6.1.2 Evaluation Criteria 51
6.1.3 Performance Benchmark 51
6.1.4 Evaluation Procedure 52
6.2 EVALUATION RESULT 52
6.2.1 Parameter Tuning 52
6.2.2 Comparative Evaluation of Event Detection Techniques 56
CHAPTER 7 CONCLUSION AND FUTURE RESEARCH DIRECTIONS 57
REFERENCES 59

參考文獻 References
[A73] Anderberg, M. R., Cluster Analysis for Applications, Academic Press, Inc., 1973.
[ABS99] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-based Interactive Management of Text Databases,” Proceedings of the 6th International Conference on Extending Databases Technology, July 1999, pp.365-379.
[ADW94] Apte, C., Damerau, F., and Weiss, S., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions on Information Systems, Vol. 12, No. 3, 1994, pp.233-251.
[AF77] Attar, R. and Fraenkel, A. S., “Local Feedback in Full-Text Retrieval Systems,” Journal of the ACM, Vol. 24, No. 3, 1997, pp.397-417.
[AIS93] Agrawal, R., Imielinski, T., and Swami, A., “Mining Association Rules between Sets of Items in Large Databases,” Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993, pp.207-216.
[APL98] Allan, J., Papka, R., and Lavrenko, V., “On-line New Event Detection and Tracking,” Proceedings of SIGIR ’98: 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM press, New York, 1998, pp.37-45.
[AS94] Agrawal, R. and Srikant, R, “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th VLDB Conference, Santiago, Chile, September 1994, pp.487-499.
[B92] Brill, E., “A Simple Rule-Based Part of Speech Tagger,” Proceedings of the Third Conference on Applied Natural Language Processing, ACL, Trento, Italy, 1992.
[B94] Brill, E., “Some Advances in Rule-Based Part of Speech Tagging,” Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994.
[BM98] Baker, L. D. and Mccallum, A. K., “Distributional Clustering of Words for Text Classification,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp.96-103.
[C99] Choo, C. W., “The Art of Scanning the Environment,” Bulletin of the American Society for Information Science, 1999, pp.21-24.
[CS96] Cohen, W. W. and Singer, Y., “Context-sensitive Learning Methods for Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, August 1996, pp.307-315.
[DPH98] Dumais, S., Platt, J., Heckerman, D., and Sahami, M., “Inductive Learning Algorithms and Representation for Text Categorization,” Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM ‘98), 1998, pp.148-155.
[EM97] Estivill-Castro, V. and Murray, A. T., “Spatial Clustering for Data Mining with Generic Algorithms,” Technical Report FIT-TR-97-10, Queensland University of Technology, Faculty of Information Management, September 1997.
[EW86] El-Hamdouchi, A. and Willett, P, “Hierarchical Document Clustering Using Ward’s Method,” Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp.149-156.
[FLG87] Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T., “The Vocabulary Problem in Human-System Communication,” Communications of the ACM, Vol. 30, No. 11, November 1987, pp.964-971.
[H81] Hambrick, D. C., “Specialization of Environmental Scanning Activities Among Upper Level Executives,” Journal of Management Studies, Vol.18, 1981, pp.299-320.
[IT95] Iwayama, M. and Tokunaga, T., “Cluster-based Text Categorization: A Comparison of Category Search Strategies,” Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘95), Seattle, WA, July 1995, pp.273-281.
[JL92] Jennings, D. and Lumpkin, J., “Insights Between Environmental Scanning Activities and Porter’s Generic Strategies: An Empirical Analysis,” Journal of Management, Vol. 18, No. 4, 1992, pp.791-803.
[K89] Kohonen, T., Self-Organization and Associative Memory, Springer, 1989.
[K95] Kohonen, T., Self-Organizing Maps, Springer, 1995.
[KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An introduction to Cluster Analysis, John Wiley & Sons, Inc., New York, NY, 1990.
[LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.16-22.
[LC96] Larkey, L. and Croft, W., “Combining Classifiers in Text Categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘96), Zurich, Switzerland, August 1996, pp.289-297.
[LFM99] Letourneau, S., Famili, F., and Matwin, S., “Data Mining to Predict Aircraft Component Replacement,” IEEE Intelligent Systems, Vol. 14. No. 6, November/December 1999, pp.59-66.
[LH98] Lam, W. and Ho, C. Y., “Using A Generalized Instance set for Automatic Text categorization,” Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp.81-89.
[LR94] Lewis, D. and Ringuette, M., “A Comparison of Two Learning Algorithms for Text Categorization,” Proceedings of Symposium on Document Analysis and Information Retrieval, 1994.
[M80] McCarn, D., “Medline: An Introduction to On-line Searching,” Journal of the American Society for Information Science, Vol. 31, No. 3, 1980, pp.181-192.
[M95] Miller, G. A, “WordNet: A Lexical Database for English,” Communications of the ACM, Vol. 38, No. 11, November 1995, pp.39-41.
[MLW92] Masand, B., Linoff, G., and Waltz, D., “Classifying News Stories Using Memory Based Reasoning,” Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’92), 1992, pp.59-64.
[MN98] McCallun, A. K. and Nigam, K., “A Comparison of Event Models for Naïve Bayes Text Classification,” Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998.
[N01] National Library of Medicine, “UMLS Knowledge Sources,” National Library of Medicine, 12th Experimental Edition, January 2001.
[NGL97] Ng, H. T., Goh, W. B., and Low, K. L., “Feature Selection, Perceptron Learning, and A Usability Case Study for Text Categorization,” Proceedings of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘97), 1997, pp.67-73.
[NH94] Ng, R. and Han, J., “Efficient and Effective Clustering methods for spatial Data Mining,” Proceedings of International Conference on Very Large Data Bases, Santiago, Chile, September 1994, pp.144-155.
[QF93] Qiu, Y. and Frei, H. P., “Concept Based Query Expansion,” Proceedings of the 16th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, 1993, pp.160-169.
[RC99] Roussinov, D. G. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No.1-2, November 1999, pp.67-79.
[RHW86] Rumelhart, D. E., Hinton, G. E., and Williams, R. J., “Learning Internal Representations by Error Propagation,” Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1, Rumelhart, D. E. and McClelland, J. L. (Eds.), MIT Press, Cambridge, MA, 1986, pp.318-362.
[SB90] Salton, G. and Buckley, C., “Improving the Retrieval Performance by Relevance Feedback,” Journal of American Society for Information Sciences, Vol. 41, 1990, pp.288-197.
[SHP95] Schutze, H., Hull, D. A., and Pedersen, J. O., “A Comparison of Classifiers and Document Representations for the Routing Problem,” Proceedings of 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995, pp.229-237.
[v79] van Rijsbergen, C. J., Information Retrieval, Butterworths, 1979.
[V86] Voorhees, E. M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management, Vol.22, 1986, pp.465-476.
[V93] Voutilainen, A., “Nptool: A Detector of English Noun Phrases,” Proceedings of Workshop on Very Large Corpora, Ohio, June 1993.
[WAD99] Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., and Hampp, T., “Maximizing Text-Mining Performance,” IEEE Intelligent Systems, Vol. 14, No. 4, July/August 1999, pp.63-69.
[WBO00] Wei, J., Bressan, S., and Ooi, B. C., “Mining Term Association Rules for Automatic Global Query Expansion: Methodology and Preliminary Results,” Proceedings of the First International Conference on Web Information Systems Engineering, 2000, pp.366-373.
[WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-Commerce Environments: An Evolution-Based Approach,” European Journal of Information Systems (forthcoming).
[WPS02] Wei, C., Piramuthu, S., and Shaw, M. J., “Knowledge Discovery and Data Mining,” To appear in Handbook of Knowledge Management (forthcoming).
[WPW95] Wiener, W., Pedersen, J. O., and Weigend, A. S., “A Neural Network Approach to Topic Spotting,” Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR ’95), Las Vegas, NV, 1995, pp.317-332.
[X97] Xu, J., “Solving the Word Mismatch Problem Through Automatic Text Analysis,” unpublished Ph.D Thesis, University of Massachusetts at Amherst, 1997.
[XC96] Xu, J. and Croft, W. B., “Query Expansion Using Local and Global Document Analysis,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.4-11.
[Y94] Yang, Y., “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘94), Dublin, Ireland, July 1994, pp.13-22.
[YC94] Yang, Y. and Chute, C. G., “An Expample-based Mapping Method for Text Categorization and Retrieval,” ACM Transactions on Information Systems, Vol. 12, No. 3, 1994, pp.252-277.
[YCB99] Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T., and Liu, X., “Learning Approaches for Detecting and Tracking News Events,” IEEE Intelligent Systems, Vol. 14, No. 4, July/August 1999, pp.32-43.
[YL99] Yang, Y. and Liu, X., “A Re-examination of Text Categorization methods,” Proceedings of SIGIR ’99: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp.42-49.
[YPC98] Yang, Y., Pierce, T., and Carbonell, J., “A Study on Retrospective and Online Event Detection,” Proceedings of SIGIR ’98: 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp.28-36.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內立即公開,校外一年後公開 off campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code