Responsive image
博碩士論文 etd-1018107-100507 詳細資訊
Title page for etd-1018107-100507
論文名稱
Title
混合式自動文件摘要方法
A hybrid approach to automatic text summarization
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
53
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2007-09-28
繳交日期
Date of Submission
2007-10-18
關鍵字
Keywords
文件自動摘要、統計方法、語言學方法、文件分類
automatic text summarization, statistical approach, linguistic approach
統計
Statistics
本論文已被瀏覽 5934 次,被下載 4688
The thesis/dissertation has been browsed 5934 times, has been downloaded 4688 times.
中文摘要
文件自動摘要可以有效節省使用者在閱讀大量文件時所耗費的時間,亦即摘要的目的在於從文件中萃取出原文必要、重要的文句,而不失其原意涵,以期從摘要中使用者即可瞭解文件所涵蓋的概念,幫助使用者更快速瞭解文件內容。
本研究提出一個文件自動摘要的方法,KCS,透過文件關鍵字詞權重的計算與字詞間關係的分析,藉以幫助挑選文件中具有涵蓋較多資訊的句子,來提升摘要的品質。所提方法的過程可分成兩大步驟:1) 找出在文件中的重要關鍵字,使用統計方法之機率模式(K-mixture) 計算字詞權重。2)接著找出名詞與名詞、名詞與動詞之字詞關係,再藉由這些關係以距離調節找出名詞的連結強度,並據以篩選文句,來增加摘要結果的代表性。
本研究進行三個實驗以驗證本研究所提之方法。摘要結果利用其於文件分類上的準確度加以判斷,文件分類法則固定為天真貝氏法。實驗結果顯示K-mixture 之機率模式比一般常用統計法之TFIDF 較佳,但仍不比所選取之語言學方法(CS)為佳;但是結合K-mixture 與CS 之KCS 法,可以大幅提昇文件分類的準確度,亦即KCS 可以擷取較有代表性之文句,而其之適用性與可行性得以
驗證。
Abstract
Automatic text summarization can efficiently and effectively save users’ time while reading text documents. The objective of automatic text summarization is to extract essential sentences that cover almost all the concepts of a document so that
users are able to comprehend the ideas the document tries to address by simply reading through the corresponding summary. This research focuses on developing a hybrid automatic text summarization
approach, KCS, to enhancing the quality of summaries.
This approach basically consists of two major components: first, it employs the K-mixture probabilistic model to calculate term weights in a statistical sense; it then identifies the term relationship
between nouns and nouns as well as nouns and verbs, which results in the connective strength (CS) of nouns. With the connective strengths available scores of sentences can be calculated and ranked to be extracted.
We conduct three experiments to justify the proposed approach. The quality of summary is examined by its capability of increasing accuracy of text classification,while the classifier employed, the Naïve Bayes classifier, is kept the same through all experiments. The results show that the K-mixture model is more contributive to document classification than traditional TFIDF weighting scheme. It, however, is still no better than CS, a more complex linguistic-based approach. More importantly, our proposed approach, KCS, performs best among all approaches considered. It implies that KCS can extract more representative sentences from the document and its feasibility in text summarization applications is thus justified.
目次 Table of Contents
目錄.........................................................................................4
圖目錄.....................................................................................5
表目錄.....................................................................................6
第一章 緒論............................................................................7
1.1 背景與動機......................................................................7
1.2 研究目的..........................................................................8
1.3 論文架構..........................................................................8
第二章 文獻探討....................................................................9
2.1 文件探勘..........................................................................9
2.2 文件摘要........................................................................10
2.3 文件摘要方法................................................................11
2.4 文件分類........................................................................17
第三章 混和式自動摘要方法.............................................19
3.1 前序處理........................................................................19
3.2 計算關鍵字詞權重........................................................21
3.3 決定文句重要性............................................................23
3.4 產生摘要........................................................................26
第四章 實驗與結果.............................................................27
4.1 摘要結果之驗證............................................................27
4.2 實驗設計........................................................................28
4.3 實驗一............................................................................32
4.4 實驗二............................................................................34
4.5 實驗三............................................................................35
4.6 摘要結果範例................................................................37
第五章 結論.........................................................................44
5.1 結語................................................................................44
5.2 未來工作........................................................................45
參考文獻..............................................................................46
附錄一 實驗一之結果.........................................................50
附錄二 實驗二之結果.........................................................51
附錄三 實驗三之結果.........................................................52
參考文獻 References
1. Apte C., Damerau F., Sholom M. W. (1994), Automated learning of decision
rules for text categorization, ACM Transactions on Information Systems (TOIS),
v.12 n.3, p.233-251
2. Angheluta R., De Busser R., & Moens M.F. (2002). The use of topic
segmentation for automatic summarization. In U. Hahn & D. Harman (Eds.),
Proceedings of the workshop on automatic summarization, Philadelphia,
Pennsylvania, USA, July 11-12, 2002 (pp. 66-70). Gaithersburg, MD: NIST.
3. Allan J., Lavrenko V., Frey D., and Khandelwal V. (2000) ,UMass at TDT 2000.
In Proceedings of Topic Detection and Tracking Workshop.
4. Benkhalifa M. AND BENSAID A. (1999). Text categorization using the
semi-supervised fuzzy c-means algorithm. In Proceedings of the 18th
International Fuzzy Information Conference of North American (New York, NY,
June), 561-565.
5. Brunn M., Chali Y., Pinchak C. J. (2001), Text Summarization Using Lexical
Chains. In Workshop on Text Summarization, ACM SIGIR Conference.
September 13-14, New Orleans, Louisiana USA.
6. Chen K. H. and Chen H. H. (1995), “A corpus-based approach to text partition,”
In Proceedings of the Workshop of Recent Advances in Natural Language
Processing, , pp. 152-161
7. Chang E., Lai WC.(2002), Hybrid learning schemes for multimedia information
retrieval ,IEEE Pacific Rim Conference on Multimedia
8. Efron B. (1979), Bootstrap methods: another look at the jackknife. Ann Statistics,
7:1-26.
9. Ferrier L. (2001), A Maximum Entropy Approach to Text Summarization, School
of Artificial Intelligence, Division of Informatics, University of Edinburgh
10. Goldstein J., Kantrowitz M., Mittal V., and Carbonell J. (1999), Summarizing
Text Documents: Sentence Selection and Evaluation Metrics, In Proceedings of
ACM-SIGIR'99, Berkeley, CA
11. Grobelnik M., Mladenic D., and Frayling N. M. (2000), Text mining as
integration of several related research areas: report on kdd’s workshop on text
mining 2000. ACM SIGKDD Explorations Newsletter, 2(2):99–102.
12. Hovy E., Lin C. (1997), Identifying topics by position, Proceedings of the fifth
conference on Applied natural language processing, p.283-290, March 31-April
03, Washington, DC
13. Jing, L.; Huang H.; Shi H. (2002), Improved Feature Selection Approach TFIDF
in Text Mining, Proceedings International Conference on Machine Learning and
Cybernetics, Vol. 2, pp. 944-946, Beijing
14. Krovetz R(1993). “Viewing Morphology as an Inference Process,” Proceedings
of the 16th Annual International ACM-SIGIR Conference on Research and
Development in Information Retrieval(SIGIR’93), Pittsburgh, PA, USA,
pp.191-202
15. Katz S. M. (1995). Distribution of content words and phrases in text and
language modeling. Natural Language Engineering 2(1): 15–59.
16. Lang K. (1995). NEWSWEEDER: learning to filter netnews. In Proceedings of
ICML-95, 12th International Conference on Machine Learning (Lake Tahoe, CA,
1995), 331-339.
17. Larsen, B. and Aone, C. (1999), “Fast and Effective Text Mining Using
Linear-time Document Clustering,” KDD-99, San Diego, California
18. Letourneau S, Famili A. F. and Matwin S. (1999), “Data mining for prediction of
aircraft component replacement,” IEEE Intelligent Systems and their
Applications, Volume 14, No. 6, pp. 59-66
19. Nakao Y. (2000), An Alogrithm for One-page Summarization of a Long Text
Based on Thematic Hierarchy Detection, Fujitsu Laboratories Ltd.
20. Ohsawa Y., Benson N. E., and Yachida M. (1998), KeyGraph: Automatic
indexing by cooccurrence graph based on building construction metaphor. In
Proceedings of the Advanced Digital Library Conference
21. Ruge, G. (1997), Automatic detection of thesaurus relations for information
retrieval applications. Foundations of Computer Science: Potential - Theory -
Cognition, C. Freksa, M. Jantzen, R. Valk (Eds.), Lecture Notes in Computer
Science, Springer-Verlag, 499-506.
22. Roussinov, D. G. and Chen, H. (1999), A scalable self-organizing map algorithm
for textual classification: a neural network approach to thesaurus generation.
CC-AI, The Journal for the Integrated Study of Artificial Intelligence, Cognitive
Science and Applied Epistemology, 15(1–2):81–111
23. Silla Jr., C. N., Kaestner, C. A. A., and Freitas, A. A. (2003), A non-linear topic
detection method for text summarization using wordnet.In MGV Nunes, SM
Aluisio, LHM Oliveira, and JA Teles, editors, Proc. I Workshop em Tecnologia
da Informacao e Linguagem Humana. ICMC-USP, Brazil
24. Sparck Jones, Karen. (1988). Tailoring Output to the User: What does User
Modelling in Generation Mean? TR–158. Technical report, Computer Laboratory,
University of Cambridge, Cambridge, UK
25. Saravanan M., Reghuraj P. C., and Raman S. (2003), Summarization and
categorization of text data in high-level data cleaning for information retrieval,
Applied Artificial Intelligence, 17:461–474
26. Saravanan M., and Raman S. (2002). The term distribution model for
summarization of multiple documents. In Proc. of the Indo European Conference
on Multilingual Communication Technologies (IEMCT 2002), pages 182–192,
Pune, India.
27. Teufel, Simone and Marc Moens. (1997). Sentence extraction as a classification
task. In Proceedings of the ACL/EACL-97 Workshop on Intelligent Scalable
Text Summarization, pages 58-65.
28. Toutanova K., Klein D., Manning C., and Singer Y. (2003), “Feature-Rich
Part-of-Speech Tagging with a Cyclic Dependency Network,” Proceedings of
Human Language Technology Conference of the North American Chapter of the
Association for Computational Linguistics (HLT-NAACL 2003), pp. 252-259
29. Wei W. (2000), Text Data mining of Internet, Beijing: Computer Science
30. Yang Y., Zhang J., Carbonell J., Jin C. (2002), Topic-conditioned novelty
detection, Proceedings of the eighth ACM SIGKDD international conference on
Knowledge discovery and data mining, July 23-26, Edmonton, Alberta, Canada
31. Yang Y. (1999), An Evaluation of Statistical Approaches to Text Categorization,
Information Retrieval, v.1 n.1-2, p.69-90
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code