國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以漸進式方法探究資訊涵義之研究,Concept Extraction With Change Detection From Navigated Information

論文名稱 Title	以漸進式方法探究資訊涵義之研究 Concept Extraction With Change Detection From Navigated Information
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	93 學年度第 2 學期 The spring semester of Academic Year 93	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	67
研究生 Author	林子翔 Tzu-hsiang Lin
指導教授 Advisor	張德民 Te-min Chang
召集委員 Convenor	蕭文峰 none
口試委員 Advisory Committee	楊婉秀 none
口試日期 Date of Exam	2005-06-29	繳交日期 Date of Submission	2005-07-07
關鍵字 Keywords	潛在語意索引、概念變化追蹤、網際網路、概念萃取 Internet, Concepts Extraction, Latent Semantic Indexing, Concept Change Detection
統計 Statistics	本論文已被瀏覽 5757 次，被下載 18 次 The thesis/dissertation has been browsed 5757 times, has been downloaded 18 times.

中文摘要
網路上資訊如洪流般不斷的產生，一般必須透過搜尋引擎來蒐集特定的資訊。然而若欲瞭解所蒐集資訊涵蓋的概念，將所有的資訊都瀏覽的方法既費時又不一定得到完整的概念。取而代之的應是使用自動概念萃取的方法來輔助使用者全盤瞭解的所蒐集資訊的涵義。在本研究中，我們提出一套方法從收集的文件中萃取出概念並偵測隨著時間概念的演進變化。本方法分為兩階段，第一階段藉由潛在語意索引將段落群聚起來並且找出關鍵的字詞，以段落摘要和相關的字詞來充分表達概念，讓使用者可以容易地了解資訊概念的涵義；第二階段為演進式的概念結構修改來代表概念的演進，當新的資訊不斷的加入後，概念可能因此而結合、分裂，甚至產生新的概念。本研究提出三個實驗來驗證所提出之方法，前兩個實驗的結果都和既定答案有相當高的精確度和回覆率，最後一個實驗是以實際個案(南亞大海嘯)來說明所提方法可以萃取出海嘯報導的整體概念與其演進。這些實驗因此驗證本研究方法的適用性。
Abstract
To manage the information flood in the Internet, we usually navigate specific information using the provided search engines. Search engines are convenient but with limited functions. For example, it is impractical and impossible to browse through the entire collected information for us to gain an overall picture about what the navigated information stands for. To do so, we need an appropriate approach to automatically extracting concepts from the navigated information to assist users to easily and quickly gain the primary understanding toward a topic that interests users. In this research, we propose an approach to extracting concepts from the navigated web information and detecting the concept changes over time. It basically includes two stages. In the first stage, information is decomposed into paragraphs and they are clustered with key terms identified through the aid of latent semantic indexing method. Concepts are represented in the form of paragraph summary and associated key terms, which allows the user to easily comprehend what they describe. The second stage is to adaptively modify the concept structure to detect concept changes. With new information added, the concepts could be merging, splitting, or even emerging with time. Three experiments are conducted in this research to verify the proposed approach. Results of the first and second experiments show both high recall and high precision that matches the predefined concept categories. The last one is an illustrated real case application on the tsunami event. It shows that we can easily grasp different concepts of the tsunami reports and realize their changes by using our approach. The feasibility of employing our approach is thus justified.

目次 Table of Contents
CHAPTER 1 Introduction 1 1.1 Overview 1 1.2 Objective of the research 2 1.3 Organization of the Thesis 2 CHAPTER 2 Literature Review 4 2.1 Information retrieval 4 2.1.1 Probabilistic model 4 2.1.2 Vector space model 5 2.1.3 Latent semantic indexing 5 2.2 Text mining 7 2.2.1 Document clustering 8 2.2.2 Text summarization 8 2.2.3 Concept Extraction 9 2.3 Cluster analysis 11 2.3.1 Partitioning clustering 11 2.3.2 Hierarchical clustering 12 2.3.3 Incremental clustering 13 CHAPTER 3 Proposed Approach 15 3.1 Concept Extraction Stage 16 3.1.1 Information Preprocessing 16 (1) Part-of-speech tagging 17 (2) Noun stemming 17 (3) Term Frequency Calculating 18 3.1.2 Similarity Calculating 19 3.1.3 Paragraph Clustering 21 3.1.4 Concept Extracting 22 3.2 Concept Change Analysis Stage 24 3.2.1 New Paragraphs Joining 24 3.2.2 Intra Cluster Reorganizing 26 3.2.3 Detecting concept changes 27 (1) Slightly Changing 28 (2) Merging 28 (3) Splitting 30 3.2.4 Structure Re-Building 30 CHAPTER 4 Experiments and Results 32 4.1 Experiment I 32 4.2 Experiment II 38 4.3 An Illustrated Application 41 CHAPTER 5 Conclusions 46 REFERENCES 48 Appendix I The Complete Summaries Under Each Concept Category 52 Appendix II The Most Representative Paragraphs of Tsunami News 56

參考文獻 References
賴志民,網際網路上資訊涵意探究與資訊變化追蹤之研究,中山大學資訊管理研究所碩士論文,民91 張家豪,網際網路搜尋資訊之涵意探究及其變化偵測,中山大學資訊管理研究所碩士論文,民93 Angheluta, R., Debusser, R. and Moens, M., “The Use of Topic Segmentation for Automatic Summarization,” In Proceedings of the ACL-2002 Post-Conference Workshop on Automatic Summarization, 2002 Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J., “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems, Vol. 27, No. 3, 1999, pp.329-341. Cattell R. B., “The Scree test for number of factors,” Multivariate Behavioral Research, NO 1, 1966, pp.140-160 Chen, C.Y., Hwang S.C., and Oyang, Y.J, “An Incremental Hierarchical Data Clustering Algorithm Based on Gravity Theory,” PAKDD 2002, 2002, pp. 237-250 Chen, H., Lynch, K. J. Automatic Construction of Networks of Concepts Characterizing Document Database. IEEE Transaction on Systems, Man and Cybernetics, Vol. 22 No. 5, 1992, pp. 885-902 Cowie, J. and Lehnert, W.,” Information Extraction,” Communications of the ACM, Vol. 39, No. 1, January, 1996, pp. 80-91 Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K. and Harshman, Richard, “Indexing by latent semantic indexing,” Journal of the American Society for Information Science, Vol. 41, No. 6, 1990 Dorre, J.,Gerstl, P., and Seiffert, R., “Text Mining: Finding Nuggets in Mountains of Textual Data,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, August 15-18, 1999, pp. 398-401 Dumais, S., “Improving the Retrieval of Information from External Sources,” Behavior Research Methods, Instruments, & Computers, Vol. 23, No. 2, 1991, pp. 229-236 Everitt, B.,”Cluster Analysis”, New York, Halsted Press, 1980 Gee, K., “Using Latent Semantic Indexing to Filter Spam,” In Proceedings of the 2003 ACM symposium on Applied computing, 2003, pp. 460-464 Golub, G. and Van Loan, C., “Matrix Computations. Johns-Hopkins,” Baltimore, Maryland, Second Edition, 1989 Grobelnik, M., and Mladenic,D., Natasa Milic-Frayling, “Text Mining as Integration of Several Related Research Areas: Report on KDD'2000 Workshop on Text Mining,” SIGKDD Explorations, Vol. 2, No. 2 , 2000, pp. 99-102 Halliday, M. A.K. and Hansan, R., Cohesion in English, Longman, 1976 Huffman, S., “Learning Information Extraction Patterns from Examples,” In IJCAI 1995 Workshop on New Approaches to Learning for Natural Language Processing, 1995, pp.127-142, Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y., “Topic detection and tracking pilot study: Final report,” In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998 Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering: A Review,” ACM Computing Surveys, Vol. 31, No. 3, 1999 Kobayashi, M. and Takeda K., “Information Retrieval on The Web,” ACM Computing Surveys, Vol. 32, No. 2, 2000, pp. 144-173 Kontostathis, April and William M. Pottenger, “Detecting Patterns in the LSI Term-Term Matrix,” Workshop on the Foundation of Data Mining and Discovery, The 2002 IEEE International Conference on Data Mining, 2002, pp.243-248 Kupiec, J., Pedersen, J., and Chen, F., “A Trainable Document Summarizer,” Proceedings of the 18th ACM-SIGIR Conference, 1995, pp.68-73 Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.16-22 Luhn, H. P., “A Statistical Approach to the Mechanized,” IBM J. Research and Development, Vol. 1, No. 4, 1957, pp.309--317 Luhn, H. P., “The Automatic Creation of Literature Abstracts,” IBM ,1958 MacQueen, J., “Some Methods for Classification and Analysis Of Multivariate Observations,” In Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability, Vol.1,1967, pp. 281-298 Mani, I., “Automatic Summarization,” Amsterdam, Philadelphia, 2001 Merialdo,B., “Tagging Text With A Probabilistic Model,” In IEEE International Conference on Acoustics, Speech and Signal Processing, 1991 Morris, J. and Hirst, G., “Lexical Cohesion Computed by Thesaural Relations as Indicator of the Structure of Text,” Computational Linguistics, Vol. 17, No. 1, 1991, pp 21-48 Ohsawa, Y., “The Scope of Chance Discovery,” New Frontiers in Artificial Intelligence: Joint JSAI 2001 Workshop Post-Proceedings, 2001, pp 413 Rasmussen, E., “Clustering algorithms. In Information Retrieval: Data Structures and Algorithms,” Prentice-Hall, Inc., Upper Saddle River, NJ, 1992, pp. 419–442 R.Cummins, C.O'Riordan, “Evolving, Analysing and Improving Global Term-Weighting,” Schemes in Information Retrieval, Technicl Report of Dept. of Information Retrieval, NUI, Galway,2004 Robertson, S. E., Jones, K. S., “Relevance Weighting of Search Terms,” Journal of the American Society for Information Sciences, Vol. 27, No. 3, 1976, pp.129-146 Roussinov, D, G. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Volume 27, Number 1-2, 1999, pp. 67-80 Ruge, G., “Combining Corpus Linguistics and Human Memory Models for Automatic Term Association,” AI Group, Institut fuer Informatik, TU Muenchen. Natural Language Information Retrieval, Kluwer Academic Publishers, 1997 Salton, G.., and Buckley, C., “Term Weighting Approaches in Automatic Text Retrieval”, Information Processing and Management, Vol. 14 No. 5, 1988, pp 513-523 Salton, G., Wong, A., and Yang, C. S., “A vector space model for automatic indexing,” Communications of the ACM, Vol.18, 1975, pp 613-620 Salton, G.,Singhal, A.,Mitra, M. ,Buckley, C.,”Automatic Text Structuring and Summarization,” Information Processing and Management: an International Journal, Vol. 33, No. 2, March 1997, pp. 193-207 Sp

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內一年後公開，校外永不公開 campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.14.246.254 論文開放下載的時間是校外不公開 Your IP address is 3.14.246.254 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS