Responsive image
博碩士論文 etd-0707105-132553 詳細資訊
Title page for etd-0707105-132553
論文名稱
Title
以漸進式方法探究資訊涵義之研究
Concept Extraction With Change Detection From Navigated Information
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
67
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2005-06-29
繳交日期
Date of Submission
2005-07-07
關鍵字
Keywords
潛在語意索引、概念變化追蹤、網際網路、概念萃取
Internet, Concepts Extraction, Latent Semantic Indexing, Concept Change Detection
統計
Statistics
本論文已被瀏覽 5757 次,被下載 18
The thesis/dissertation has been browsed 5757 times, has been downloaded 18 times.
中文摘要
網路上資訊如洪流般不斷的產生,一般必須透過搜尋引擎來蒐集特定的資訊。然而若欲瞭解所蒐集資訊涵蓋的概念,將所有的資訊都瀏覽的方法既費時又不一定得到完整的概念。取而代之的應是使用自動概念萃取的方法來輔助使用者全盤瞭解的所蒐集資訊的涵義。
在本研究中,我們提出一套方法從收集的文件中萃取出概念並偵測隨著時間概念的演進變化。本方法分為兩階段,第一階段藉由潛在語意索引將段落群聚起來並且找出關鍵的字詞,以段落摘要和相關的字詞來充分表達概念,讓使用者可以容易地了解資訊概念的涵義;第二階段為演進式的概念結構修改來代表概念的演進,當新的資訊不斷的加入後,概念可能因此而結合、分裂,甚至產生新的概念。
本研究提出三個實驗來驗證所提出之方法,前兩個實驗的結果都和既定答案有相當高的精確度和回覆率,最後一個實驗是以實際個案(南亞大海嘯)來說明所提方法可以萃取出海嘯報導的整體概念與其演進。這些實驗因此驗證本研究方法的適用性。
Abstract
To manage the information flood in the Internet, we usually navigate specific information using the provided search engines. Search engines are convenient but with limited functions. For example, it is impractical and impossible to browse through the entire collected information for us to gain an overall picture about what the navigated information stands for. To do so, we need an appropriate approach to automatically extracting concepts from the navigated information to assist users to easily and quickly gain the primary understanding toward a topic that interests users.
In this research, we propose an approach to extracting concepts from the navigated web information and detecting the concept changes over time. It basically includes two stages. In the first stage, information is decomposed into paragraphs and they are clustered with key terms identified through the aid of latent semantic indexing method. Concepts are represented in the form of paragraph summary and associated key terms, which allows the user to easily comprehend what they describe. The second stage is to adaptively modify the concept structure to detect concept changes. With new information added, the concepts could be merging, splitting, or even emerging with time.
Three experiments are conducted in this research to verify the proposed approach. Results of the first and second experiments show both high recall and high precision that matches the predefined concept categories. The last one is an illustrated real case application on the tsunami event. It shows that we can easily grasp different concepts of the tsunami reports and realize their changes by using our approach. The feasibility of employing our approach is thus justified.
目次 Table of Contents
CHAPTER 1 Introduction 1
1.1 Overview 1
1.2 Objective of the research 2
1.3 Organization of the Thesis 2
CHAPTER 2 Literature Review 4
2.1 Information retrieval 4
2.1.1 Probabilistic model 4
2.1.2 Vector space model 5
2.1.3 Latent semantic indexing 5
2.2 Text mining 7
2.2.1 Document clustering 8
2.2.2 Text summarization 8
2.2.3 Concept Extraction 9
2.3 Cluster analysis 11
2.3.1 Partitioning clustering 11
2.3.2 Hierarchical clustering 12
2.3.3 Incremental clustering 13
CHAPTER 3 Proposed Approach 15
3.1 Concept Extraction Stage 16
3.1.1 Information Preprocessing 16
(1) Part-of-speech tagging 17
(2) Noun stemming 17
(3) Term Frequency Calculating 18
3.1.2 Similarity Calculating 19
3.1.3 Paragraph Clustering 21
3.1.4 Concept Extracting 22
3.2 Concept Change Analysis Stage 24
3.2.1 New Paragraphs Joining 24
3.2.2 Intra Cluster Reorganizing 26
3.2.3 Detecting concept changes 27
(1) Slightly Changing 28
(2) Merging 28
(3) Splitting 30
3.2.4 Structure Re-Building 30
CHAPTER 4 Experiments and Results 32
4.1 Experiment I 32
4.2 Experiment II 38
4.3 An Illustrated Application 41
CHAPTER 5 Conclusions 46
REFERENCES 48
Appendix I The Complete Summaries Under Each Concept Category 52
Appendix II The Most Representative Paragraphs of Tsunami News 56
參考文獻 References
賴志民,網際網路上資訊涵意探究與資訊變化追蹤之研究,中山大學資訊管理研究所碩士論文,民91
張家豪,網際網路搜尋資訊之涵意探究及其變化偵測,中山大學資訊管理研究所碩士論文,民93
Angheluta, R., Debusser, R. and Moens, M., “The Use of Topic Segmentation for Automatic Summarization,” In Proceedings of the ACL-2002 Post-Conference Workshop on Automatic Summarization, 2002
Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, J., “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems, Vol. 27, No. 3, 1999, pp.329-341.
Cattell R. B., “The Scree test for number of factors,” Multivariate Behavioral Research, NO 1, 1966, pp.140-160
Chen, C.Y., Hwang S.C., and Oyang, Y.J, “An Incremental Hierarchical Data Clustering Algorithm Based on Gravity Theory,” PAKDD 2002, 2002, pp. 237-250
Chen, H., Lynch, K. J. Automatic Construction of Networks of Concepts Characterizing Document Database. IEEE Transaction on Systems, Man and Cybernetics, Vol. 22 No. 5, 1992, pp. 885-902
Cowie, J. and Lehnert, W.,” Information Extraction,” Communications of the ACM, Vol. 39, No. 1, January, 1996, pp. 80-91
Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K. and Harshman, Richard, “Indexing by latent semantic indexing,” Journal of the American Society for Information Science, Vol. 41, No. 6, 1990
Dorre, J.,Gerstl, P., and Seiffert, R., “Text Mining: Finding Nuggets in Mountains of Textual Data,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, August 15-18, 1999, pp. 398-401
Dumais, S., “Improving the Retrieval of Information from External Sources,” Behavior Research Methods, Instruments, & Computers, Vol. 23, No. 2, 1991, pp. 229-236
Everitt, B.,”Cluster Analysis”, New York, Halsted Press, 1980
Gee, K., “Using Latent Semantic Indexing to Filter Spam,” In Proceedings of the 2003 ACM symposium on Applied computing, 2003, pp. 460-464
Golub, G. and Van Loan, C., “Matrix Computations. Johns-Hopkins,” Baltimore, Maryland, Second Edition, 1989

Grobelnik, M., and Mladenic,D., Natasa Milic-Frayling, “Text Mining as Integration of Several Related Research Areas: Report on KDD'2000 Workshop on Text Mining,” SIGKDD Explorations, Vol. 2, No. 2 , 2000, pp. 99-102
Halliday, M. A.K. and Hansan, R., Cohesion in English, Longman, 1976
Huffman, S., “Learning Information Extraction Patterns from Examples,” In IJCAI 1995 Workshop on New Approaches to Learning for Natural Language Processing, 1995, pp.127-142,
Allan, J., Carbonell, J., Doddington, G., Yamron, J., and Yang, Y., “Topic detection and tracking pilot study: Final report,” In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998
Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering: A Review,” ACM Computing Surveys, Vol. 31, No. 3, 1999
Kobayashi, M. and Takeda K., “Information Retrieval on The Web,” ACM Computing Surveys, Vol. 32, No. 2, 2000, pp. 144-173
Kontostathis, April and William M. Pottenger, “Detecting Patterns in the LSI Term-Term Matrix,” Workshop on the Foundation of Data Mining and Discovery, The 2002 IEEE International Conference on Data Mining, 2002, pp.243-248
Kupiec, J., Pedersen, J., and Chen, F., “A Trainable Document Summarizer,” Proceedings of the 18th ACM-SIGIR Conference, 1995, pp.68-73
Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.16-22
Luhn, H. P., “A Statistical Approach to the Mechanized,” IBM J. Research and Development, Vol. 1, No. 4, 1957, pp.309--317
Luhn, H. P., “The Automatic Creation of Literature Abstracts,” IBM ,1958
MacQueen, J., “Some Methods for Classification and Analysis Of Multivariate Observations,” In Proceedings of 5th Berkeley Symposium on Mathematics, Statistics and Probability, Vol.1,1967, pp. 281-298
Mani, I., “Automatic Summarization,” Amsterdam, Philadelphia, 2001
Merialdo,B., “Tagging Text With A Probabilistic Model,” In IEEE International Conference on Acoustics, Speech and Signal Processing, 1991

Morris, J. and Hirst, G., “Lexical Cohesion Computed by Thesaural Relations as Indicator of the Structure of Text,” Computational Linguistics, Vol. 17, No. 1, 1991, pp 21-48
Ohsawa, Y., “The Scope of Chance Discovery,” New Frontiers in Artificial Intelligence: Joint JSAI 2001 Workshop Post-Proceedings, 2001, pp 413
Rasmussen, E., “Clustering algorithms. In Information Retrieval: Data Structures and Algorithms,” Prentice-Hall, Inc., Upper Saddle River, NJ, 1992, pp. 419–442
R.Cummins, C.O'Riordan, “Evolving, Analysing and Improving Global Term-Weighting,” Schemes in Information Retrieval, Technicl Report of Dept. of Information Retrieval, NUI, Galway,2004
Robertson, S. E., Jones, K. S., “Relevance Weighting of Search Terms,” Journal of the American Society for Information Sciences, Vol. 27, No. 3, 1976, pp.129-146
Roussinov, D, G. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Volume 27, Number 1-2, 1999, pp. 67-80
Ruge, G., “Combining Corpus Linguistics and Human Memory Models for Automatic Term Association,” AI Group, Institut fuer Informatik, TU Muenchen. Natural Language Information Retrieval, Kluwer Academic Publishers, 1997
Salton, G.., and Buckley, C., “Term Weighting Approaches in Automatic Text Retrieval”, Information Processing and Management, Vol. 14 No. 5, 1988, pp 513-523
Salton, G., Wong, A., and Yang, C. S., “A vector space model for automatic indexing,” Communications of the ACM, Vol.18, 1975, pp 613-620
Salton, G.,Singhal, A.,Mitra, M. ,Buckley, C.,”Automatic Text Structuring and Summarization,” Information Processing and Management: an International Journal, Vol. 33, No. 2, March 1997, pp. 193-207
Sp
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內一年後公開,校外永不公開 campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 3.14.246.254
論文開放下載的時間是 校外不公開

Your IP address is 3.14.246.254
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code