論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
混合式自動文件摘要方法 A hybrid approach to automatic text summarization |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
53 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2007-09-28 |
繳交日期 Date of Submission |
2007-10-18 |
關鍵字 Keywords |
文件自動摘要、統計方法、語言學方法、文件分類 automatic text summarization, statistical approach, linguistic approach |
||
統計 Statistics |
本論文已被瀏覽 5935 次,被下載 4688 次 The thesis/dissertation has been browsed 5935 times, has been downloaded 4688 times. |
中文摘要 |
文件自動摘要可以有效節省使用者在閱讀大量文件時所耗費的時間,亦即摘要的目的在於從文件中萃取出原文必要、重要的文句,而不失其原意涵,以期從摘要中使用者即可瞭解文件所涵蓋的概念,幫助使用者更快速瞭解文件內容。 本研究提出一個文件自動摘要的方法,KCS,透過文件關鍵字詞權重的計算與字詞間關係的分析,藉以幫助挑選文件中具有涵蓋較多資訊的句子,來提升摘要的品質。所提方法的過程可分成兩大步驟:1) 找出在文件中的重要關鍵字,使用統計方法之機率模式(K-mixture) 計算字詞權重。2)接著找出名詞與名詞、名詞與動詞之字詞關係,再藉由這些關係以距離調節找出名詞的連結強度,並據以篩選文句,來增加摘要結果的代表性。 本研究進行三個實驗以驗證本研究所提之方法。摘要結果利用其於文件分類上的準確度加以判斷,文件分類法則固定為天真貝氏法。實驗結果顯示K-mixture 之機率模式比一般常用統計法之TFIDF 較佳,但仍不比所選取之語言學方法(CS)為佳;但是結合K-mixture 與CS 之KCS 法,可以大幅提昇文件分類的準確度,亦即KCS 可以擷取較有代表性之文句,而其之適用性與可行性得以 驗證。 |
Abstract |
Automatic text summarization can efficiently and effectively save users’ time while reading text documents. The objective of automatic text summarization is to extract essential sentences that cover almost all the concepts of a document so that users are able to comprehend the ideas the document tries to address by simply reading through the corresponding summary. This research focuses on developing a hybrid automatic text summarization approach, KCS, to enhancing the quality of summaries. This approach basically consists of two major components: first, it employs the K-mixture probabilistic model to calculate term weights in a statistical sense; it then identifies the term relationship between nouns and nouns as well as nouns and verbs, which results in the connective strength (CS) of nouns. With the connective strengths available scores of sentences can be calculated and ranked to be extracted. We conduct three experiments to justify the proposed approach. The quality of summary is examined by its capability of increasing accuracy of text classification,while the classifier employed, the Naïve Bayes classifier, is kept the same through all experiments. The results show that the K-mixture model is more contributive to document classification than traditional TFIDF weighting scheme. It, however, is still no better than CS, a more complex linguistic-based approach. More importantly, our proposed approach, KCS, performs best among all approaches considered. It implies that KCS can extract more representative sentences from the document and its feasibility in text summarization applications is thus justified. |
目次 Table of Contents |
目錄.........................................................................................4 圖目錄.....................................................................................5 表目錄.....................................................................................6 第一章 緒論............................................................................7 1.1 背景與動機......................................................................7 1.2 研究目的..........................................................................8 1.3 論文架構..........................................................................8 第二章 文獻探討....................................................................9 2.1 文件探勘..........................................................................9 2.2 文件摘要........................................................................10 2.3 文件摘要方法................................................................11 2.4 文件分類........................................................................17 第三章 混和式自動摘要方法.............................................19 3.1 前序處理........................................................................19 3.2 計算關鍵字詞權重........................................................21 3.3 決定文句重要性............................................................23 3.4 產生摘要........................................................................26 第四章 實驗與結果.............................................................27 4.1 摘要結果之驗證............................................................27 4.2 實驗設計........................................................................28 4.3 實驗一............................................................................32 4.4 實驗二............................................................................34 4.5 實驗三............................................................................35 4.6 摘要結果範例................................................................37 第五章 結論.........................................................................44 5.1 結語................................................................................44 5.2 未來工作........................................................................45 參考文獻..............................................................................46 附錄一 實驗一之結果.........................................................50 附錄二 實驗二之結果.........................................................51 附錄三 實驗三之結果.........................................................52 |
參考文獻 References |
1. Apte C., Damerau F., Sholom M. W. (1994), Automated learning of decision rules for text categorization, ACM Transactions on Information Systems (TOIS), v.12 n.3, p.233-251 2. Angheluta R., De Busser R., & Moens M.F. (2002). The use of topic segmentation for automatic summarization. In U. Hahn & D. Harman (Eds.), Proceedings of the workshop on automatic summarization, Philadelphia, Pennsylvania, USA, July 11-12, 2002 (pp. 66-70). Gaithersburg, MD: NIST. 3. Allan J., Lavrenko V., Frey D., and Khandelwal V. (2000) ,UMass at TDT 2000. In Proceedings of Topic Detection and Tracking Workshop. 4. Benkhalifa M. AND BENSAID A. (1999). Text categorization using the semi-supervised fuzzy c-means algorithm. In Proceedings of the 18th International Fuzzy Information Conference of North American (New York, NY, June), 561-565. 5. Brunn M., Chali Y., Pinchak C. J. (2001), Text Summarization Using Lexical Chains. In Workshop on Text Summarization, ACM SIGIR Conference. September 13-14, New Orleans, Louisiana USA. 6. Chen K. H. and Chen H. H. (1995), “A corpus-based approach to text partition,” In Proceedings of the Workshop of Recent Advances in Natural Language Processing, , pp. 152-161 7. Chang E., Lai WC.(2002), Hybrid learning schemes for multimedia information retrieval ,IEEE Pacific Rim Conference on Multimedia 8. Efron B. (1979), Bootstrap methods: another look at the jackknife. Ann Statistics, 7:1-26. 9. Ferrier L. (2001), A Maximum Entropy Approach to Text Summarization, School of Artificial Intelligence, Division of Informatics, University of Edinburgh 10. Goldstein J., Kantrowitz M., Mittal V., and Carbonell J. (1999), Summarizing Text Documents: Sentence Selection and Evaluation Metrics, In Proceedings of ACM-SIGIR'99, Berkeley, CA 11. Grobelnik M., Mladenic D., and Frayling N. M. (2000), Text mining as integration of several related research areas: report on kdd’s workshop on text mining 2000. ACM SIGKDD Explorations Newsletter, 2(2):99–102. 12. Hovy E., Lin C. (1997), Identifying topics by position, Proceedings of the fifth conference on Applied natural language processing, p.283-290, March 31-April 03, Washington, DC 13. Jing, L.; Huang H.; Shi H. (2002), Improved Feature Selection Approach TFIDF in Text Mining, Proceedings International Conference on Machine Learning and Cybernetics, Vol. 2, pp. 944-946, Beijing 14. Krovetz R(1993). “Viewing Morphology as an Inference Process,” Proceedings of the 16th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval(SIGIR’93), Pittsburgh, PA, USA, pp.191-202 15. Katz S. M. (1995). Distribution of content words and phrases in text and language modeling. Natural Language Engineering 2(1): 15–59. 16. Lang K. (1995). NEWSWEEDER: learning to filter netnews. In Proceedings of ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, CA, 1995), 331-339. 17. Larsen, B. and Aone, C. (1999), “Fast and Effective Text Mining Using Linear-time Document Clustering,” KDD-99, San Diego, California 18. Letourneau S, Famili A. F. and Matwin S. (1999), “Data mining for prediction of aircraft component replacement,” IEEE Intelligent Systems and their Applications, Volume 14, No. 6, pp. 59-66 19. Nakao Y. (2000), An Alogrithm for One-page Summarization of a Long Text Based on Thematic Hierarchy Detection, Fujitsu Laboratories Ltd. 20. Ohsawa Y., Benson N. E., and Yachida M. (1998), KeyGraph: Automatic indexing by cooccurrence graph based on building construction metaphor. In Proceedings of the Advanced Digital Library Conference 21. Ruge, G. (1997), Automatic detection of thesaurus relations for information retrieval applications. Foundations of Computer Science: Potential - Theory - Cognition, C. Freksa, M. Jantzen, R. Valk (Eds.), Lecture Notes in Computer Science, Springer-Verlag, 499-506. 22. Roussinov, D. G. and Chen, H. (1999), A scalable self-organizing map algorithm for textual classification: a neural network approach to thesaurus generation. CC-AI, The Journal for the Integrated Study of Artificial Intelligence, Cognitive Science and Applied Epistemology, 15(1–2):81–111 23. Silla Jr., C. N., Kaestner, C. A. A., and Freitas, A. A. (2003), A non-linear topic detection method for text summarization using wordnet.In MGV Nunes, SM Aluisio, LHM Oliveira, and JA Teles, editors, Proc. I Workshop em Tecnologia da Informacao e Linguagem Humana. ICMC-USP, Brazil 24. Sparck Jones, Karen. (1988). Tailoring Output to the User: What does User Modelling in Generation Mean? TR–158. Technical report, Computer Laboratory, University of Cambridge, Cambridge, UK 25. Saravanan M., Reghuraj P. C., and Raman S. (2003), Summarization and categorization of text data in high-level data cleaning for information retrieval, Applied Artificial Intelligence, 17:461–474 26. Saravanan M., and Raman S. (2002). The term distribution model for summarization of multiple documents. In Proc. of the Indo European Conference on Multilingual Communication Technologies (IEMCT 2002), pages 182–192, Pune, India. 27. Teufel, Simone and Marc Moens. (1997). Sentence extraction as a classification task. In Proceedings of the ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization, pages 58-65. 28. Toutanova K., Klein D., Manning C., and Singer Y. (2003), “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network,” Proceedings of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), pp. 252-259 29. Wei W. (2000), Text Data mining of Internet, Beijing: Computer Science 30. Yang Y., Zhang J., Carbonell J., Jin C. (2002), Topic-conditioned novelty detection, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, Edmonton, Alberta, Canada 31. Yang Y. (1999), An Evaluation of Statistical Approaches to Text Categorization, Information Retrieval, v.1 n.1-2, p.69-90 |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內校外完全公開 unrestricted 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |