國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,改善條件隨機域模型於中文斷詞,An Enhanced Conditional Random Field Model for Chinese Word Segmentation

論文名稱 Title	改善條件隨機域模型於中文斷詞 An Enhanced Conditional Random Field Model for Chinese Word Segmentation
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	98 學年度第 1 學期 The fall semester of Academic Year 98	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	101
研究生 Author	黃昭銘 Jhao-ming Huang
指導教授 Advisor	林葭華 Chia-hua Lin
召集委員 Convenor	李宗南 Tsung-nan Li
口試委員 Advisory Committee	陳嘉平 Chia-ping Chen
口試日期 Date of Exam	2010-01-18	繳交日期 Date of Submission	2010-02-03
關鍵字 Keywords	條件隨機域模型、中文斷詞、特徵模版 Conditional Random Fields (CRF), Chinese Word Segmentation, Feature Template
統計 Statistics	本論文已被瀏覽 5765 次，被下載 1418 次 The thesis/dissertation has been browsed 5765 times, has been downloaded 1418 times.

中文摘要
在中文裡，詞是有意義的最小語意單位。一個中文句子由許多的詞所組成，而一個詞通常由一連串的字元所組成，且每個詞間並沒有使用分隔符號分開詞與詞。在資訊擷取或資料探勘領域裡，使用文件裡的中文詞之前，這些文件裡的句子都必須先被分割成正確的詞，稱為中文斷詞。近年來有許多關於中文斷詞的研究，儘管有些研究達成的效能已經非常高, 但對於未知詞的召回率仍然只有六成到七成。本論文中，我們使用一個線性的條件隨機域模型以達成更準確的中文斷詞，並提出兩種改善的特徵模版應用於條件隨機域模型以決定字元與字元間的邊界; 另外，我們更提出三種方法：疊字處理、日期處理、及斷詞精煉以改善初步分割的結果。在實驗中，我們使用三個機構提供的不同語料庫，並使用數種不同方法測試，然後分別和由Li et al. 及Lau and King 提出的方法相比較。實驗結果顯示利用前綴詞和後綴詞資訊的特徵模版可以提高召回率以及準確率，在MSR機構提供的語料庫裡，其F-measure值可達到0.964; 透過對連續單一字元的判斷，疊字也可在不需使用額外資源的情況下被重新正確斷詞; 在日期的斷詞方面，透過針對數字、日期、以及量詞的處理，錯誤的日期可被重新正確斷出; 如果分割出的詞與相對應標準語料庫的詞不同時，透過該詞和前後斷詞的重組，可求出更適當的斷詞。對於以條件隨機域模型做中文斷詞的研究與應用，我們提出了一個較佳的特徵模版以獲得較好的斷詞結果，並提出其它三種方法處理特定的斷詞問題。
Abstract
In Chinese language, the smallest meaningful unit is a word which is composed of a sequence of characters. A Chinese sentence is composed of a sequence of words without any separation between them. In the area of information retrieval or data mining, the segmentation of a sequence of Chinese characters should be done before anyone starts to use these segments of characters. The process is called the Chinese word segmentation. The researches of Chinese word segmentation have been developed for many years. Although some recent researches have achieved very high performance, the recall of those words that are not in the dictionary only achieves sixty or seventy percent. An approach described in this paper makes use of the linear-chain conditional random fields (CRFs) to have a more accurate Chinese word segmentation. The discriminatively trained model that uses two of our proposed feature templates for deciding the boundaries between characters is used in our study. We also propose three other methods, which are the duplicate word repartition, the date representation repartition, and the segment refinement, to enhance the accuracy of the processed segments. In the experiments, we use several different approaches for testing and compare the results with those proposed by Li et al. and Lau and King based on three different Chinese word corpora. The results prove that the improved feature template which makes use of the information of prefix and postfix could increase both the recall and the precision. For example, the F-measure reaches 0.964 in the MSR dataset. By detecting repeat characters, the duplicated characters could also be better repartitioned without using extra resources. In the representation of date, the wrongly segmented date could be better repartitioned by using the proposed method which deals with numbers, date, and measure words. If a word is segmented differently from that of the corresponding standard segmentation corpus, a proper segment could be produced by repartitioning the assembled segment which is composed of the current segment and the adjacent segment. In the area of using the conditional random fields for Chinese word segmentation, we have proposed a feature template for better result and three methods which focus on other specific segmentation problems.

目次 Table of Contents
Contents List of Tables iii List of Figures iv Chapter 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Related Research 4 2.1 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Inconsist Word Segmentation Standards . . . . . . . . . . . . . . . . 4 2.1.2 Word Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 Unknown Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Recent Researches in Chinese Word Segmentation . . . . . . . . . . . . . . 8 Chapter 3 Background Concept 13 3.1 Lexeme Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Linear-Chain CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.2.1 Basic Type of Features . . . . . . . . . . . . . . . . . 23 3.2.2.2 Feature Template . . . . . . . . . . . . . . . . . . . . . 24 3.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 i 3.2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 4 The Proposed Method 36 4.1 Feature Template Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Duplicated Word Repartition . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Date Representation Repartition . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Segment Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5 Experiments 64 5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.1 Evaluation of Performance on Dataset MSR . . . . . . . . . . . . . . 67 5.3.2 Evaluation of Performance on Dataset PKU . . . . . . . . . . . . . . 72 5.3.3 Evaluation of Performance on Dataset CityU . . . . . . . . . . . . . 77 Chapter 6 Conclusion and Future Works 82 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Appendix A Table of Notation 91

參考文獻 References
Bibliography [1] Asahara, M., Fukuoka, K., Azuma, A., Goh, C., Watanabe, Y., Matsumoto, Y., and Tsuzuki, T. (2005). Combination of machine learning methods for optimum chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pages 134– 137. Jeju Island, Korea. [2] Asahara, M., Goh, C., Wang, X., and Matsumoto, Y. (2003). Combining segmenter and chunker for Chinese word segmentation. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, pages 144– 147. Sapporo, Japan. [3] Byrd, R., Nocedal, J., and Schnabel, R. (1994). Representations of quasi- Newton matrices and their use in limited memory methods. Mathematical Programming, 63(1):129–156. [4] Chen, K. and Bai, M. (1998). Unknown word detection for Chinese by a corpus-based learning method. Computational Linguistics, 3(1):27–44. [5] Chen, K. and Ma, W. (2002). Unknown word extraction for Chinese documents. In Proceedings of the 19th international conference on Compu86 tational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics Morristown, NJ, USA. [6] Gao, J., Li, M., Huang, C., and Wu, A. (2005). Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics, 31(4):531–574. [7] Goh, C., Asahara, M., and Matsumoto, Y. (2005). Chinese word segmentation by classification of characters. International Journal of Computational Linguistics and Chinese Language Processing, 10(3):381–396. [8] Klinger, R. and Tomanek, K. (2007). Classical probabilistic models and conditional random fields. Technische Universit ”at Dortmund, Dortmund,”Electronic Publication. [9] Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 282–289. Citeseer. [10] Lau, T. and King, I. (2005). Two-Phase LMR-RC Tagging for Chinese Word Segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pages 183–186. Jeju Island, Korea. [11] Li, M., Gao, J., Huang, C., and Li, J. (2003). Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of the second SIGHAN workshop on Chinese language processing, pages 11–12. Sapporo, Japan. 87 [12] Li, Y., Miao, C., Bontcheva, K., and Cunningham, H. (2005). Perceptron Learning for Chinese Word Segmentation. In Proceedings of Fourth SIGHAN Workshop on Chinese Language processing (Sighan-05), pages 154–157. Jeju Island, Korea. [13] Low, J., Ng, H., and Guo, W. (2005). A maximum entropy approach to Chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pages 161–164. Jeju Island, Korea. [14] Lu, X. (2005). Towards a Hybrid Model for Chinese Word Segmentation. In Proceedings of Fourth SIGHAN Workshop on Chinese Language Processing, pages 189–192. Jeju Island, Korea. [15] Luo, X., Sun, M., and Tsou, B. (2002). Covering ambiguity resolution in Chinese word segmentation based on contextual information. In Proceedings of the 19th international conference on Computational linguistics- Volume 1, page 7. Association for Computational Linguistics Morristown, NJ, USA. [16] Ma, W. and Chen, K. (2003). A bottom-up merging algorithm for Chinese unknown word extraction. In Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17, page 38. Association for Computational Linguistics Morristown, NJ, USA. [17] Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation. In International Conference On Computational Linguistics, pages 1–7. Association for Computational Linguistics Morristown, NJ, USA. 88 [18] Peng, F., Feng, F., and McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th international conference on Computational Linguistics, page 562. Association for Computational Linguistics Morristown, NJ, USA. [19] Peng, F. and McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information Processing and Management, 42(4):963–979. [20] Pietra, S., Pietra, V., and Lafferty, J. (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393. [21] Rabiner, L. (1990). A tutorial on hidden Markov models and selected applications in speech recognition. Readings in speech recognition, 53(3):267–296. [22] Richard, S. and Yen, L. (1990). The n-best algorithm: An efficient and exact procedure for finding the n most likely sentence. Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1:81–84. [23] Sha, F. and Pereira, F. (2003). Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL, volume 1, pages 134–141. Association for Computational Linguistics Morristown, NJ, USA. [24] Sutton, C. and McCallum, A. (2007). An Introduction to Conditional Random Fields for Relational Learning. Introduction to statistical relational learning, page 93. 89 [25] Tsai, R., Hung, H., Sung, C., Dai, H., and Hsu, W. (2006). On closed task of Chinese word segmentation: An improved CRF model coupled with character clustering and automatically generated template matching. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 108–117. Sydney, Australia. [26] Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. (2005). A conditional random field word segmenter for Sighan bakeoff 2005. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pages 168–171. Jeju Island, Korea. [27] Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48. [28] Xue, N. and Shen, L. (2003). Chinese word segmentation as LMR tagging. In Proceedings of the second SIGHAN workshop on Chinese language processing-Volume 17, pages 176–179. Association for Computational Linguistics Morristown, NJ, USA. [29] Zhang, K., Liu, Q., Zhang, H., and Cheng, X. (2002). Automatic recognition of Chinese unknown words based on roles tagging. In Proceedings of the first SIGHAN workshop on Chinese language processing-Volume 18, page 7. Association for Computational Linguistics Morristown, NJ, USA. [30] Zhang, R., Kikui, G., and Sumita, E. (2006). Subword-based tagging by conditional random fields for Chinese word segmentation. In Proceedings of the Human Language Technology Conference of the NAACL, Compan90 ion Volume: Short Papers on XX, pages 193–196. Association for Computational Linguistics. [31] Zhang, Y. and Clark, S. (2007). Chinese segmentation with a wordbased perceptron algorithm. In ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, volume 45, page 840. Association for Computational Linguistics Morristown, NJ, USA. [32] Zhao, H., Huang, C., and Li, M. (2006). An improved Chinese word segmentation system with conditional random field. In Proceedings of the Fifth SIGHANWorkshop on Chinese Language Processing, pages 162–165. Sydney, Australia. [33] Zheng, J. and Wu, F. (1999). Study on segmentation of ambiguous phrases with the combinatorial type. Collections of Papers on Computational Linguistics, pages 129–134. [34] 国家标准化管理委员会(1992). 信息处理用现代汉语分词规范. 中国标准出版社. [35] 朱怡霖(2001). 中文斷詞與專有名詞辨識之研究. 台大資工,碩士文. [36] 詞庫小組(1995). 研究院語料庫的內容及說明. Technical report,中央研究院.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內立即公開，校外一年後公開 off campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0203110-093833.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS