Responsive image
博碩士論文 etd-0524112-180320 詳細資訊
Title page for etd-0524112-180320
論文名稱
Title
文字語料庫描述長度比較
Comparison of Description Length for Text Corpus
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
37
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2012-04-26
繳交日期
Date of Submission
2012-05-24
關鍵字
Keywords
剖析樹、Stanford剖析器、描述長度、文法學習、上下文無關文法
grammar learning, context-free grammar, description length, parse tree, Stanford parser
統計
Statistics
本論文已被瀏覽 5658 次,被下載 614
The thesis/dissertation has been browsed 5658 times, has been downloaded 614 times.
中文摘要
在本論文中我們比較不同文法所需的描述長度,並且將過去藉自動學習所得文法延伸至Stanford文法剖析器所得文法。在之前的研究當中,我們提出如何對一個文字語料庫進行自動學習文法並且計算其描述長度(description length)。研究中我們以中研院漢語語料庫(Academia Sinica Balanced Corpus,ASBC)進行實作。其中根據資料壓縮的概念所進行的編碼方法可有效減少一個文字語料庫的描述長度。另外,我們更進一步研究了兩種基於上下文無關文法(Context-free grammar,CFG)的語料庫描述長度:詳盡(exhaustive)跟遞迴(recursive),詳盡文法是導出每一個出現在語料庫中的語句,而遞迴文法則是涵蓋所有的字串。而在本論文的研究中,我們使用Stanford文法剖析器這個工具來產生文法規則,並且對機器剖析產生的文法所需的描述長度和經過人工修飾的方法做出比較。在其中一個實驗我們使用Stanford剖析器剖析ASBC語料庫,所需要的描述長度為53.0百萬位元數。其中絕大部分為推導所需,僅52,683為規則所需。另一個實驗中我們比較Stanford剖析器產生的文法和既有的文法所需的描述長度,我們使用中文句結構樹資料庫(Sinica Treebank)做為資料庫。結果中中文句結構樹資料庫原有的文法所需描述長度為2.76百萬位元數,而Stanford剖析器產生的文法所需描述長度為4.02百萬位元數。
Abstract
In this thesis, we compare the description length of different grammars, and extend the research of automatic grammar learning to the grammar production of Stanford parser. In our research before, we have introduced that how to minimize the description length of the grammar which is generated from the Academia Sinica Balanced Corpus. Based on the concept of data compression, the encoding method in our research is effective in reducing the description length of a text corpus. Moreover, we further discussed about the description length of two special cases of context-free grammars: exhaustive and recursive. The exhaustive grammar is that for every distinct sentence in the corpus is derived, and the recursive one covers all strings. In our research of this thesis, we use a parsing tool called "Stanford parser" to parse sentences and generate grammar rules. We also compare the description length of the grammar parsed by machine with the grammar fixed by artificial. In one of the experiments, we use Stanford parser to parse ASBC corpus, and the description length is 53.0Mb. The description length of rule is only 52,683. In the other experiment, we use Stanford parser to parse Sinica Treebank and compare the description length of the generated grammar with the origin. The result shows that the description length of grammar of the Sinica Treebank is 2.76Mb, and the grammar generated by Stanford parser is 4.02Mb.
目次 Table of Contents
List of Tables viii
List of Figures ix
Chapter 1 簡介1
1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究背景. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2 自動學習文法與描述長度比較3
2.1 文法歸納. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 上下文無關文法介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 描述長度分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Stanford剖析器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 3 實驗15
3.1 實驗資料. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 中研院漢語平衡語料庫. . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 中文句結構樹資料庫. . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 資料前處理. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 4 結論和未來工作20
附錄A 詞性標記23
附錄B 因素模型26
參考文獻 References
[1] R. Levy and C. D. Manning, “Is it harder to parse Chinese, or the Chinese Treebank?,”
in Proceedings of the 41st Annual Meeting of the Association for Computational Lin-
guistics, Sapporo, Japan, vol. 1, July 2003.
[2] T. H. Chen, C. H. Tseng and C. P. Chen, “Automatic Learning of Context-Free Gram-
mar,” in Proceedings of Rocling, Hsinchu, Taiwan, 2006.
[3] 陳克健、黃居仁, “技術報告第95-02/98-04號「中央研究院漢語平衡語料庫的內容
與說明」,” tech. rep., 中央研究院資訊所、語言所詞庫小組.
[4] D. Jurafsky and J. H. Martin, Speech and Language Processing: An Introduction to Nat-
ural Language Processing, Computational Linguistics, and Speech Recognition. Pren-
tice Hall, 2000.
[5] J. C. Kieffer and E. H. Yang, “Design of context-free grammars for lossless data com-
pression,” in Proceedings of 1998 IEEE Information Theory Workshop (ITW), Killarney,
Ireland, June 1998.
[6] H. Lucke, “Reducing the Computation Complexity for Inferring Stochastic Context-
Free Grammar Rules from Example Text,” in Proceedings of 1994 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP-94), Kyoto, Japan,
April 1994.
[7] L. Kovacs and P. Barabas, “Experiences in Building of Context-Free Grammar Tree,” in
Proceedings of the 9th IEEE International Symposium on Applied Machine Intelligence
and Informatics (SAMI), Smolenice, Slovakia, January 2011.
[8] T. Cover and J. Thomas, Elements of Information Theory. John Wiley and Sons, 1991.
[9] S. C. Kremer, “Parallel Stochastic Grammar Induction,” in Proceedings of International
Conference on Neural Networks, Texas, USA, vol. 3, pp. 1424–1428, June 1997.
[10] P. Wyard, “Context Free Grammar Induction using Genetic Algorithms,” in Proceedings
of IEE Colloquium on Grammatical Inference: Theory, Applications and Alternatives,
Colchester, England, pp. 11/1–11/5, April 1993.
[11] J. E. Hopcroft, R. Motwani and J. D. Ullman, Introduction to Automata Theory, Lan-
guages and Computation. Pearson Addison-Wesley, 2001.
[12] B. S. Mitchell P. Marcus and M. A. Marcinkiewicz, “Building a Large Annotated Corpus
of English: The Penn Treebank,” Computational Linguistics, vol. 19, pp. 313–330, June
1993.
[13] D. Klein and C. D. Manning, “Accurate Unlexicalized Parsing,” in Proceedings of the
41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan,
vol. 1, pp. 423–430, July 2003.
[14] D. Klein and C. D. Manning, “Fast Exact Inference with a Factored Model for Natu-
ral Language Parsing,” in Proceedings of Advances in Neural Information Processing
Systems 15 (NIPS), Nevada, USA, vol. 15, pp. 3–10, December 2003.
[15] 陳鳳儀、蔡碧芳、陳克健、黃居仁, “中文句結構樹資料庫(Sinica Treebank)的構
建,” tech. rep., 中央研究院資訊所、語言所.
[16] F. Xia, “The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0),”
tech. rep., Department of Computer and Information Science, University of Pennsylva-
nia.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code