Responsive image
博碩士論文 etd-0214107-150013 詳細資訊
Title page for etd-0214107-150013
論文名稱
Title
以自動摘要與潛在語意索引分析輔助文件分類
Summary-based document categorization with LSI
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
54
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2006-07-20
繳交日期
Date of Submission
2007-02-14
關鍵字
Keywords
文件摘要、潛在語意索引、文件分類
Document Categorization, Latent Semantic Indexing, Text Summarization
統計
Statistics
本論文已被瀏覽 5881 次,被下載 22
The thesis/dissertation has been browsed 5881 times, has been downloaded 22 times.
中文摘要
現今在網際網路上傳遞的文件成指數倍數的成長,如何能有效正確地幫助使用者找到所需要的文件成為一大議題。將文件歸類至事先指定類別的文件分類技術可以加速文件搜尋的過程並增加搜尋結果的品質。但是文件分類本身亦面臨二大難題,亦即詞彙屬性選擇問題與詞彙語意問題。在詞彙屬性選擇方面,可以將文件先作摘要而縮減文件大小並進一步減低詞彙屬性數量;在詞彙語意問題,目前可以以潛在語意索引技術將詞彙的真實意義藉由於其它詞彙的關連而充分顯現。但是少有研究文獻指出此二種方法結合使用的效果影響。所以本研究的目的就在於提出SBDR,結合文件摘要與潛在語意索引,以解決上述文件分類問題,並進而驗證之。
本研究進行二個實驗來驗證SBDR的績效。實驗一針對文件摘要否對文件分類有幫助,以及用何種詞彙來決定文句的重要性作探討。結果顯示,文件摘要果如預期地能提昇文件分類的績效;此外,同時考量名詞-名詞與名詞-動詞比只考量名詞-名詞來決定文句的重要性較好。實驗二則是比較文件摘要否結合潛在語意索引的績效影響。雖然結果顯示潛在語意索引應用於全文(無摘要)之準確度比應用於摘要文件(SBDR)稍微要佳,但其運算時間亦隨詞彙屬性數量增加而增加。因此SBDR可以應用在以效率為主要考量並且容忍效能少量減損的情況下。此二實驗之結果因此驗證SBDR的實際適用性。
Abstract
Text categorization to automatically assign documents into the appropriate pre-defined category or categories is essential to facilitating the retrieval of desired documents efficiently and effectively from a huge text depository, e.g., the world-wide web. Most techniques, however, suffer from the feature selection problem and the vocabulary mismatch problem. A few research works have addressed on text categorization via text summarization to reduce the size of documents, and consequently the number of features to consider, while some proposed using latent semantic indexing (LSI) to reveal the true meaning of a term via its association with other terms. Few works, however, have studied the joint effect of text summarization and the semantic dimension reduction technique in the literature. The objective of this research is thus to propose a practical approach, SBDR to deal with the above difficulties in text categorization tasks.
Two experiments are conducted to validate our proposed approach. In the first experiment, the results show that text summarization does improve the performance in categorization. In addition, to construct important sentences, the association terms of both noun-noun and noun-verb pairs should be considered. Results of the second experiment indicate slight better performance with the approach of adopting LSI exclusively (i.e. no summarization) than that with SBDR (i.e. with summarization). Nonetheless, the minor accuracy reduction can be largely compensated for the computational time saved using LSI with text summarized. The feasibility of the SBDR approach is thus justified.
目次 Table of Contents
TABLE OF CONTENTS
CHAPTER 1 Introduction 1
1.1 Overview 1
1.2 Objective of the research 2
1.3 Organization of the thesis 3
CHAPTER 2 Literature Review 4
2.1 Information Retrieval 4
2.2 Text Mining 9
2.3 Text Summarization 10
2.4 Text Categorization 11
CHAPTER 3 Proposed Approach 15
3.1 Document Preprocessing 15
3.2 Text Summarization 18
3.4 Dimension Reduction 23
3.5 Categorization 24
CHAPTER 4 Experiments and Results 26
4.1 Experimental Design 26
4.2 Experiment I 30
4.3 Experiment II 35
CHAPTER 5 Conclusion and Future Work 41
5.1 Concluding remarks 41
5.2 Future Work 42
REFERENCE 43
參考文獻 References
陳光華(Chen, K. H.), 新資訊時代的啟發性資訊服務, 1998
Apte, C., Damerau, F. and Weiss, S. M., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions of Information Systems, Volume 12, No. 3, 1994, pp. 233-251
Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval. Addison Wesley Longman Publishing Co. Inc., 1999
Benkhalifa, M., Bensaid, A., Mouradi, A., “Text categorization using the semi-supervised fuzzy c-means algorithm,” Proceedings of the 18th International Conference of the North American Fuzzy Information (NAFIPS'99), 1999, pp. 561-565
Brill, E., “A Simple Rule-Based Part Of Speech Tagger,” Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing, 1992
Chen, K. H. and Chen, H. H., “A corpus-based approach to text partition,” In Proceedings of the Workshop of Recent Advances in Natural Language Processing,1995, pp. 152-161
Cowie, J. and Lehnert, W., “Information Extraction,” Communications of the ACM, Volume 39, No. 1, 1996, pp. 80-91
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R., “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, Volume 41, No. 6, 1990, pp. 391-407
Edmundson, H. P., “New methods in automatic extracting,” Journal of the ACM, Volume 16, No. 2, 1969, pp.264-285
Furnas G.. W., Landauer T. K., Gomez L. M. and Dumais S. T., “Statistical semantics: Analysis of the Potential Performance of Keyword Information Systems,”Bell System Technical Journal, Volume 62, No. 6, 1986, pp. 1753-1806
Golub, G. H. and Van Loan, C. F., “Matrix computations,” John Hopkins Univ. Press, Baltimore, 1989
Grobelnik, M., Mladenic, D. and Milic-Frayling, N., “Text Mining as Integration of Several Related Research Areas: Report on KDD'2000 Workshop on Text Mining,” SIGKDD Explorations, Volume 2, No. 2, 2000, pp. 99-102
Hahn, U. and Mani, I. “The challenges of automatic summarization,” IEEE Computer, Volume 33, No. 11, 2000, pp. 29-35
Hovy, E., and Lin, C. Y. “Automated text summarization in SUMMARIST,” In Proceedings of ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, pp.18-24, 1997
Huffman, S., “Learning Information Extraction Patterns From Examples,” In IJCAI 1995 Workshop on New Approaches to Learning for Natural Language Processing, 1995, pp.127-142
Hull, D. A., “Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing,” Proceedings of the 17th International ACM-SIGIR Conference on Research and Development in Information Retrieval, 1994, pp. 282-191
Ker, S. J. and Chen, J. N., “A Text Categorization Based on Summarization Technique,” In Proceeding of NLPIR Workshop of ACL2000, 2000, pp. 79-83
Kolcz, A., Prabkarmurthi, V. and Kalia, J., “Summarization as feature selection for text categorization,” Proceedings of the tenth international conference on Information and knowledge management, 2001, pp. 365-370
Kontostathis, A. and Pottenger, W. M. “A framework for understanding LSI performance,”Information Processing and Management, Volume 42, No. 1, 2006, pp.56-73
Krovetz, R. “Viewing Morphology as an Inference Process,” Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval(SIGIR’93), Pittsburgh, PA, USA, June 1993, pp.191-202
Larsen, B. and Aone, C., “Fast and effective text mining using linear-time document clustering,” In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999, pp. 16-22
Letsche, T. A. and Berry, M. W. “Large-Scale Information Retrieval with Latent Semantic Indexing,” Information Sciences – Applications, Volume 100, No. 1-4 ,1997, pp.105-137
Levow, G.A, Oard, D. W., Resnik P., “Dictionary-based techniques for cross-language information retrieval,” Information Processing and Management: an International Journal, Volume 41, No. 3, 2005, pp.523-547
Letourneau, S., Famili, A. F. and Matwin, S., “Data mining for prediction of aircraft component replacement,” IEEE Intelligent Systems and their Applications, Volume 14, No. 6, 1999, pp. 59-66
Liang, C.Y., Guo, L., Xia, Z. J., Nie, F. G., Li, X. X., Su, L. and Yang, Z., “Dictionary-based text categorization of chemical web pages,” Information Processing and Management, Volume 42, No. 4, 2006, pp. 1017-1029
Luhn, H. P., “The Automatic Creation of Literature Abstracts,” IBM Journal, pp. 159-165, 1958
Mladenic, M. and Grobelnik, M. “Feature selection for classification based on text hierarchy,” In Working Notes of Learning from Text and the Web: Conference on Automated Learning and Discovery (CONALD-98),1998
Porter, M., “An algorithm for suffix stripping,” Automated Library and Information Systems, Volume 14, No. 3, 1980, pp. 130-137
Robertson, S. E. and Jones K. S., “Relevance weighting of search terms,” Journal of the American Society for Information Science, Volume 27, 1976, pp. 129-146
Ruge, G., “Combining Corpus Linguistics and Human Memory Models for Automatic Term Association,” AI Group, Institut fuer Informatik, TU. Muenchen. Natural Language Information Retrieval. Kluwer Academic. Publishers, 1997
Salton, G. and Buckley, C., “Term Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, Volume 24, No. 5, 1988 , pp. 513-523
Salton, G.., Wong, A. and Yang, C. S., “A Vector Space Model for Automatic Indexing,” Communications of the ACM, Volume 18, No. 11, 1975, pp. 613-620
Salton, G., Yang, C., and Yu, C., “A Theory of Term Importance in Automic Text Analysis,” Journal of the Americal Society for Information Sciences, Volume 26, No. 1, 1975, pp. 33-44
Sebastiani, F., “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, Volume 34, 2002, pp. 1-47
Teufel S. and Moens M., “Sentence extraction as a classification task,” In Proceedings of ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, pp.58-65, 1997
Toutanova, K., Klein, D., Manning, C., and Singer, Y., “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network,” Proceedings of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), 2003, pp. 252-259
van Rijsbergen, C. J, Information Retrieval. 2nd edition, London: Butterworths, 1979
Voutilainen, A., “NPtool, A detector of English noun phrases,” Proceedings, Workshop on Very Large Corpora : Academic and Industrial Perspectives, 1993, pp.48-57
Wei, J., Bressan, S. and Ooi, B. C.,“Mining Term Association Rules for Automatic Global Query Expansion: Methodology and Preliminary Results,” Proceedings of the First International Conference on Web Information Systems Engineering, Volume 1, 2000, pp. 366-373
Wu, H., Gunopulos, D., "Evaluating the Utility of Statistical Phrases and Latent Semantic Indexing for Text Classification,” IEEE International Conference on Data Mining, 2002, pp. 713-716
Yang, Y. and Liu, X., “A re-examination of text categorization methods,” In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 1999, pp. 42-49
Yang, Y. and Pedersen, J.O, “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of the Fourteenth International Conference on Machine Learning,1997, p 412–420
Zelikovitz, S. and Hirsh, H., “Using LSI for Text Classification in the Presence of Background Text,” Proceedings of the tenth international conference on Information and knowledge management, 2001, pp. 113-118
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內一年後公開,校外永不公開 campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 3.146.221.204
論文開放下載的時間是 校外不公開

Your IP address is 3.146.221.204
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code