國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,從非結構化文件中學習資訊粹取法則之技術,LIEF: An Algorithm for Learning Information Extraction Rules from Unstructured Documents

論文名稱 Title	從非結構化文件中學習資訊粹取法則之技術 LIEF: An Algorithm for Learning Information Extraction Rules from Unstructured Documents
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	89 學年度第 2 學期 The spring semester of Academic Year 89	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	49
研究生 Author	潘智仁 Chih-Jen Pen
指導教授 Advisor	魏志平 Chih-Ping Wei
召集委員 Convenor	林福仁 Fu-Ren Lin
口試委員 Advisory Committee	李昇暾 Sheng-Tun Li
口試日期 Date of Exam	2001-07-23	繳交日期 Date of Submission	2001-08-02
關鍵字 Keywords	正向搜尋路徑、非結構化文件、正規式語言、學習資訊粹取法則、資訊粹取、負向搜尋路徑 Information Extraction Rules, Positive Searching Path, Regular Expression, Unstructured Documents, Negative Searching Path, Information Extraction
統計 Statistics	本論文已被瀏覽 5682 次，被下載 3497 次 The thesis/dissertation has been browsed 5682 times, has been downloaded 3497 times.

中文摘要
隨著網路時代的來臨，有愈來愈多的資訊以數位化的型式儲存，包括各種數化的文件，這些文件裡往往存在非常多有價值的資訊。然而由於大部分的數位化文件都以非結構化的形式存在，使得如何從大量這類文件中快速地取得有用資訊成為非常重要的課題。傳統的作法是形成資訊粹取法則，然後透過資訊粹取系統來取出。不過應用人工的方式產生資訊粹取法則，存在著許多問題，比如非常耗時。所以冀望這些法則能夠自動產生。但是現有自動產生法則所採用的學習策略，存在著一些盲點，尤其在針對非結構化文件做處理時，都很難獲致良好的效果。因此本研究提出了一個新的學習策略-從錯誤經驗中學習，用以改善現有策略所遭遇到的問題。此外本研究也建置出採用這種學習策略的雛型系統，並與技術基準做效果上的比較。根據驗證的結果顯示，本研究所提出的學習策略確實有著明顯的效果。
Abstract
In the past, information was stored more or less well-structured in database. Nowadays, a lot of information is presented in unstructured format. The management of and retrieval from such large vast of textual information has been a challenging issue for organizations or individuals. Information extraction is the process of extracting relevant data from semi-structured or unstructured documents and transforming them into structured representations. Many information extraction learning techniques have been proposed. However, they are ineffectiveness on unstructured documents. Thus, in the research, we proposed a new information extraction learning algorithm, called LIEF, that enhancing existing information extraction learning techniques. According to the empirical evaluations on news documents that are unstructured format, the LIEF algorithm proposed showed its capabilities in accuracy rate.

目次 Table of Contents
TABLE OF CONTENTS Abstract….………… …………..……………………………….………II 中文摘要………………….… …………………… ……………………III TABLE OF CONTENTS ……………………………………………..IV LIST OF FIGURES……. ……………………………………………..VI LIST OF TABLES………………………………………………..…..VII Chapter 1. Introduction……………………………………… ...………1 1.1 Background ……………………………………………………….1 1.2 Research Motivation and Objective …………………………………………..3 1.3 Research Process and Tasks ………….…………………..……………………..5 Chapter 2. Literature Review …………………………………………6 2.1 Information Extraction ……………………………………………………….6 2.2 Learning of Information Extraction Rule ……………………………………10 2.2.1 WHISK Algorithm ……………………………………………………10 2.2.2 CRYSTAL Algorithm ………………………………………………….14 Chapter 3. Problem Analysis …………………………………………16 3.1 Problems Inherent to Existing Information Extraction Learning Algorithm .16 3.2 Selection of Learning Strategy ………………………………………………19 Chapter 4. Development of LIEF Algorithm ……………………….20 4.1 Rule Representation …………………………………………………………20 4.2 Architecture of the LIEF Algorithm……………………………………... 22 4.3 Learning Subsystem of LIEF ……………………………………………….23 4.3.1 Document Tagging ……………………………………………………23 4.3.2 Word Parser …………………………………………………………..25 4.3.3 Rule Inductions ………………………………………………………….25 4.3.3.1 Creating a new rule from a seed instance …………………………26 4.3.3.2 Growing a rule …………………………………………………….26 4.3.3.3 Extending a rule ………………………………………………….29 4.3.3.4 Prune Rules ……………………………………………………….31 4.4 Reasoning Subsystem ………………………………………………………….32 Chapter 5. Empirical Evaluation …………………………………….34 5.1 Evaluation Design ……………………………………………………………34 5.1.1 Data Collection ………………………………………………………….34 5.1.2 Evaluation Criteria ……………………………………………………….35 5.1.3 Evaluation Procedure ….…………………………………………………..36 5.1.4 Performance Benchmarks …………………………………………………37 5.2 Comparative Evaluation ……………………………………………………….37 Chapter 6. Conclusions and Future Research Directions ………….41 References ……………………………………………………………..43 Appendix A……………... ……………………………………………..46

參考文獻 References
[ANK97] Ashish, N., and Knoblock, C., “Wrapper Generation for Semi-structured Internet Sources,” ACM SIGMOD Record, Vol. 26, No. 4, 1997, pp. 8-15. [APL98] Allan, J., Papka, R. and Lavrenko, V., “On-line New Event Detection and Tracking,” Proceedings of SIGIR ’98: 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp.37-45. [CL96] Cowie, J. and Lehnert, W., ”Information Extraction,” Communication of ACM , Vol. 39, No. 1, Jan .1996, pp. 80-91. [D94] Domings, P., “The RISE System: Conquering Without Separating,” Proceedings of the Sixth IEEE international Conference on Tools with Artificial Intelligence, 1994, pp. 704-707. [DKR97] Dagan, I. and Karov, Y. and Roth, D.,”Mistake-Driven Learning in Text Categorization,” Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing, August 1997. [ECJ99] Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D., “Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages,” Data and Knowledge Engineering, November 1999. [ECS98] Embley, D. W., Campbell, D. M., Smith, R. D., and Liddle, S. W., “Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents,” CIKM'98 Proceedings, September 1998, pp.52-59. [EFK99] Embley, D. W., Fuhr, N., Klas, C. P. and Roelleke, T. “Ontology Suitability for Uncertain Extraction of Information from Multi-Record Web Documents,” ADI’99 Proceedings. [F98] Freitag. D, “Multistrategy Learning for Information Extraction,” Proceedings of the Fifteenth International Machine Learning Conference, 1998, pp. 161-169. [HS96] Hardy, D. R and Schwartz, M. F., “Customized Information Extraction as A Basis for Resource Discovery,” ACM Transactions on Computer Systems, Vol. 14, No 2, May 1996, pp. 171-199. [K95] Krupka, G., “Description of the SRA System As Used for MUC-6,” Proceedings of the Sixth Message Understanding Conference, 1995, pp. 221-236. [KWD97] Kushmerick. N., Weld. D., and Doorenbos. R., “Wrapper Induction for Information Extraction,” Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, 1997, pp. 729-737. [M99] Muslea. I., “Extraction Patterns for Information Extraction Tasks: A Survey,” Workshop on Machine Learning for Information Extraction, Orlando, July 1999. [NIST] NIST Site. Available at: http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/gen_ie.htm [NLP] NLP Site. Available at: http://www-nlp.cs.umass.edu/nlpie.html. [RL94] Riloff, E. and Lehnert, W., “Information Extraction as A Basis for High-precision Text Classification,” ACM Transactions on Information Systems, Vol. 12, No. 3 , July 1994, pp. 296-333. [RLD99] Ribeiro-Neto, B., Laender, A. H. F., and Da Silva, A. S., “Extracting Semi-structured Data through Examples,” Proceedings of the Eighth International conference on Information Knowledge Management, November 1999, pp 94-101. [S97] Soderland, S.,“Learning to Extraction Text-based Information from the World Wide Web,” in Proceedings of Third International Conference on Knowledge Discovery and Data Mining, 1997. [S99] Soderland, S., “Learning Information Extraction Rules for Semi-structured and Free Text,” Machine Learning, Vol.34, 1999, pp. 233-272. [SFA95] Soderland. S., Fisher. D., Aseltine. J., & Lehnert. W, “CRYSTAL: Inducing a conceptual dictionary,” Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1314-1321, 1995. [SOM93] Sumita. K., Ono. K., and Miike. S., “Document Structure Extraction for Interactive Document Retrieval Systems,” Proceedings of the 11th Annual International Conference on Systems Documentation, 1993, pp. 301 – 310. [WK91] Weiss, S. M. and Kulikowski, C. A., Computer Systems That Learn: Classification and Predication Methods from Statistics, Neural Nets, Manchine Learning, and Expert Systems, Morgan Kaufmann Publishers, Inc., San Francisco, CA, 1991. [WL01] Wei, C. and Lee, Y., “Event Detection for Supporting Environmental Scanning: An Information Extraction-based Approach,” Proceedings of 5th Pacific Asia Conference on Information Systems, Seoul, Korea, June 2001. [Y94] Y., Yang, “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval,” Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, pp. 13-22.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0802101-100356.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS