國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以自動摘要與潛在語意索引分析輔助文件分類,Summary-based document categorization with LSI

論文名稱 Title	以自動摘要與潛在語意索引分析輔助文件分類 Summary-based document categorization with LSI
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	95 學年度第 1 學期 The fall semester of Academic Year 95	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	54
研究生 Author	劉曉雯 Hsiao-Wen Liu
指導教授 Advisor	張德民 Te-Min Chang
召集委員 Convenor	蕭文峰 Wen-Feng Hsiao
口試委員 Advisory Committee	孫培真 Pei-Chen Sun
口試日期 Date of Exam	2006-07-20	繳交日期 Date of Submission	2007-02-14
關鍵字 Keywords	文件摘要、潛在語意索引、文件分類 Document Categorization, Latent Semantic Indexing, Text Summarization
統計 Statistics	本論文已被瀏覽 5881 次，被下載 22 次 The thesis/dissertation has been browsed 5881 times, has been downloaded 22 times.

中文摘要
現今在網際網路上傳遞的文件成指數倍數的成長，如何能有效正確地幫助使用者找到所需要的文件成為一大議題。將文件歸類至事先指定類別的文件分類技術可以加速文件搜尋的過程並增加搜尋結果的品質。但是文件分類本身亦面臨二大難題，亦即詞彙屬性選擇問題與詞彙語意問題。在詞彙屬性選擇方面，可以將文件先作摘要而縮減文件大小並進一步減低詞彙屬性數量；在詞彙語意問題，目前可以以潛在語意索引技術將詞彙的真實意義藉由於其它詞彙的關連而充分顯現。但是少有研究文獻指出此二種方法結合使用的效果影響。所以本研究的目的就在於提出SBDR，結合文件摘要與潛在語意索引，以解決上述文件分類問題，並進而驗證之。本研究進行二個實驗來驗證SBDR的績效。實驗一針對文件摘要否對文件分類有幫助，以及用何種詞彙來決定文句的重要性作探討。結果顯示，文件摘要果如預期地能提昇文件分類的績效；此外，同時考量名詞-名詞與名詞-動詞比只考量名詞-名詞來決定文句的重要性較好。實驗二則是比較文件摘要否結合潛在語意索引的績效影響。雖然結果顯示潛在語意索引應用於全文(無摘要)之準確度比應用於摘要文件(SBDR)稍微要佳，但其運算時間亦隨詞彙屬性數量增加而增加。因此SBDR可以應用在以效率為主要考量並且容忍效能少量減損的情況下。此二實驗之結果因此驗證SBDR的實際適用性。
Abstract
Text categorization to automatically assign documents into the appropriate pre-defined category or categories is essential to facilitating the retrieval of desired documents efficiently and effectively from a huge text depository, e.g., the world-wide web. Most techniques, however, suffer from the feature selection problem and the vocabulary mismatch problem. A few research works have addressed on text categorization via text summarization to reduce the size of documents, and consequently the number of features to consider, while some proposed using latent semantic indexing (LSI) to reveal the true meaning of a term via its association with other terms. Few works, however, have studied the joint effect of text summarization and the semantic dimension reduction technique in the literature. The objective of this research is thus to propose a practical approach, SBDR to deal with the above difficulties in text categorization tasks. Two experiments are conducted to validate our proposed approach. In the first experiment, the results show that text summarization does improve the performance in categorization. In addition, to construct important sentences, the association terms of both noun-noun and noun-verb pairs should be considered. Results of the second experiment indicate slight better performance with the approach of adopting LSI exclusively (i.e. no summarization) than that with SBDR (i.e. with summarization). Nonetheless, the minor accuracy reduction can be largely compensated for the computational time saved using LSI with text summarized. The feasibility of the SBDR approach is thus justified.

目次 Table of Contents
TABLE OF CONTENTS CHAPTER 1 Introduction 1 1.1 Overview 1 1.2 Objective of the research 2 1.3 Organization of the thesis 3 CHAPTER 2 Literature Review 4 2.1 Information Retrieval 4 2.2 Text Mining 9 2.3 Text Summarization 10 2.4 Text Categorization 11 CHAPTER 3 Proposed Approach 15 3.1 Document Preprocessing 15 3.2 Text Summarization 18 3.4 Dimension Reduction 23 3.5 Categorization 24 CHAPTER 4 Experiments and Results 26 4.1 Experimental Design 26 4.2 Experiment I 30 4.3 Experiment II 35 CHAPTER 5 Conclusion and Future Work 41 5.1 Concluding remarks 41 5.2 Future Work 42 REFERENCE 43

參考文獻 References
陳光華(Chen, K. H.), 新資訊時代的啟發性資訊服務, 1998 Apte, C., Damerau, F. and Weiss, S. M., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions of Information Systems, Volume 12, No. 3, 1994, pp. 233-251 Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information Retrieval. Addison Wesley Longman Publishing Co. Inc., 1999 Benkhalifa, M., Bensaid, A., Mouradi, A., “Text categorization using the semi-supervised fuzzy c-means algorithm,” Proceedings of the 18th International Conference of the North American Fuzzy Information (NAFIPS'99), 1999, pp. 561-565 Brill, E., “A Simple Rule-Based Part Of Speech Tagger,” Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing, 1992 Chen, K. H. and Chen, H. H., “A corpus-based approach to text partition,” In Proceedings of the Workshop of Recent Advances in Natural Language Processing,1995, pp. 152-161 Cowie, J. and Lehnert, W., “Information Extraction,” Communications of the ACM, Volume 39, No. 1, 1996, pp. 80-91 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R., “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, Volume 41, No. 6, 1990, pp. 391-407 Edmundson, H. P., “New methods in automatic extracting,” Journal of the ACM, Volume 16, No. 2, 1969, pp.264-285 Furnas G.. W., Landauer T. K., Gomez L. M. and Dumais S. T., “Statistical semantics: Analysis of the Potential Performance of Keyword Information Systems,”Bell System Technical Journal, Volume 62, No. 6, 1986, pp. 1753-1806 Golub, G. H. and Van Loan, C. F., “Matrix computations,” John Hopkins Univ. Press, Baltimore, 1989 Grobelnik, M., Mladenic, D. and Milic-Frayling, N., “Text Mining as Integration of Several Related Research Areas: Report on KDD'2000 Workshop on Text Mining,” SIGKDD Explorations, Volume 2, No. 2, 2000, pp. 99-102 Hahn, U. and Mani, I. “The challenges of automatic summarization,” IEEE Computer, Volume 33, No. 11, 2000, pp. 29-35 Hovy, E., and Lin, C. Y. “Automated text summarization in SUMMARIST,” In Proceedings of ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, pp.18-24, 1997 Huffman, S., “Learning Information Extraction Patterns From Examples,” In IJCAI 1995 Workshop on New Approaches to Learning for Natural Language Processing, 1995, pp.127-142 Hull, D. A., “Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing,” Proceedings of the 17th International ACM-SIGIR Conference on Research and Development in Information Retrieval, 1994, pp. 282-191 Ker, S. J. and Chen, J. N., “A Text Categorization Based on Summarization Technique,” In Proceeding of NLPIR Workshop of ACL2000, 2000, pp. 79-83 Kolcz, A., Prabkarmurthi, V. and Kalia, J., “Summarization as feature selection for text categorization,” Proceedings of the tenth international conference on Information and knowledge management, 2001, pp. 365-370 Kontostathis, A. and Pottenger, W. M. “A framework for understanding LSI performance,”Information Processing and Management, Volume 42, No. 1, 2006, pp.56-73 Krovetz, R. “Viewing Morphology as an Inference Process,” Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval(SIGIR’93), Pittsburgh, PA, USA, June 1993, pp.191-202 Larsen, B. and Aone, C., “Fast and effective text mining using linear-time document clustering,” In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999, pp. 16-22 Letsche, T. A. and Berry, M. W. “Large-Scale Information Retrieval with Latent Semantic Indexing,” Information Sciences – Applications, Volume 100, No. 1-4 ,1997, pp.105-137 Levow, G.A, Oard, D. W., Resnik P., “Dictionary-based techniques for cross-language information retrieval,” Information Processing and Management: an International Journal, Volume 41, No. 3, 2005, pp.523-547 Letourneau, S., Famili, A. F. and Matwin, S., “Data mining for prediction of aircraft component replacement,” IEEE Intelligent Systems and their Applications, Volume 14, No. 6, 1999, pp. 59-66 Liang, C.Y., Guo, L., Xia, Z. J., Nie, F. G., Li, X. X., Su, L. and Yang, Z., “Dictionary-based text categorization of chemical web pages,” Information Processing and Management, Volume 42, No. 4, 2006, pp. 1017-1029 Luhn, H. P., “The Automatic Creation of Literature Abstracts,” IBM Journal, pp. 159-165, 1958 Mladenic, M. and Grobelnik, M. “Feature selection for classification based on text hierarchy,” In Working Notes of Learning from Text and the Web: Conference on Automated Learning and Discovery (CONALD-98),1998 Porter, M., “An algorithm for suffix stripping,” Automated Library and Information Systems, Volume 14, No. 3, 1980, pp. 130-137 Robertson, S. E. and Jones K. S., “Relevance weighting of search terms,” Journal of the American Society for Information Science, Volume 27, 1976, pp. 129-146 Ruge, G., “Combining Corpus Linguistics and Human Memory Models for Automatic Term Association,” AI Group, Institut fuer Informatik, TU. Muenchen. Natural Language Information Retrieval. Kluwer Academic. Publishers, 1997 Salton, G. and Buckley, C., “Term Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, Volume 24, No. 5, 1988 , pp. 513-523 Salton, G.., Wong, A. and Yang, C. S., “A Vector Space Model for Automatic Indexing,” Communications of the ACM, Volume 18, No. 11, 1975, pp. 613-620 Salton, G., Yang, C., and Yu, C., “A Theory of Term Importance in Automic Text Analysis,” Journal of the Americal Society for Information Sciences, Volume 26, No. 1, 1975, pp. 33-44 Sebastiani, F., “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, Volume 34, 2002, pp. 1-47 Teufel S. and Moens M., “Sentence extraction as a classification task,” In Proceedings of ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, pp.58-65, 1997 Toutanova, K., Klein, D., Manning, C., and Singer, Y., “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network,” Proceedings of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), 2003, pp. 252-259 van Rijsbergen, C. J, Information Retrieval. 2nd edition, London: Butterworths, 1979 Voutilainen, A., “NPtool, A detector of English noun phrases,” Proceedings, Workshop on Very Large Corpora : Academic and Industrial Perspectives, 1993, pp.48-57 Wei, J., Bressan, S. and Ooi, B. C.,“Mining Term Association Rules for Automatic Global Query Expansion: Methodology and Preliminary Results,” Proceedings of the First International Conference on Web Information Systems Engineering, Volume 1, 2000, pp. 366-373 Wu, H., Gunopulos, D., "Evaluating the Utility of Statistical Phrases and Latent Semantic Indexing for Text Classification,” IEEE International Conference on Data Mining, 2002, pp. 713-716 Yang, Y. and Liu, X., “A re-examination of text categorization methods,” In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 1999, pp. 42-49 Yang, Y. and Pedersen, J.O, “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of the Fourteenth International Conference on Machine Learning,1997, p 412–420 Zelikovitz, S. and Hirsh, H., “Using LSI for Text Classification in the Presence of Background Text,” Proceedings of the tenth international conference on Information and knowledge management, 2001, pp. 113-118

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內一年後公開，校外永不公開 campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.146.221.204 論文開放下載的時間是校外不公開 Your IP address is 3.146.221.204 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS