Responsive image
博碩士論文 etd-0203110-093833 詳細資訊
Title page for etd-0203110-093833
論文名稱
Title
改善條件隨機域模型於中文斷詞
An Enhanced Conditional Random Field Model for Chinese Word Segmentation
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
101
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2010-01-18
繳交日期
Date of Submission
2010-02-03
關鍵字
Keywords
條件隨機域模型、中文斷詞、特徵模版
Conditional Random Fields (CRF), Chinese Word Segmentation, Feature Template
統計
Statistics
本論文已被瀏覽 5765 次,被下載 1418
The thesis/dissertation has been browsed 5765 times, has been downloaded 1418 times.
中文摘要
在中文裡,詞是有意義的最小語意單位。一個中文句子由許多的詞所
組成,而一個詞通常由一連串的字元所組成,且每個詞間並沒有使用
分隔符號分開詞與詞。在資訊擷取或資料探勘領域裡,使用文件裡的
中文詞之前,這些文件裡的句子都必須先被分割成正確的詞,稱為中
文斷詞。近年來有許多關於中文斷詞的研究,儘管有些研究達成的效
能已經非常高, 但對於未知詞的召回率仍然只有六成到七成。本論文
中,我們使用一個線性的條件隨機域模型以達成更準確的中文斷詞,
並提出兩種改善的特徵模版應用於條件隨機域模型以決定字元與字元
間的邊界; 另外,我們更提出三種方法:疊字處理、日期處理、及斷
詞精煉以改善初步分割的結果。在實驗中,我們使用三個機構提供的
不同語料庫,並使用數種不同方法測試,然後分別和由Li et al. 及Lau
and King 提出的方法相比較。實驗結果顯示利用前綴詞和後綴詞資訊的
特徵模版可以提高召回率以及準確率,在MSR機構提供的語料庫裡,
其F-measure值可達到0.964; 透過對連續單一字元的判斷,疊字也可在
不需使用額外資源的情況下被重新正確斷詞; 在日期的斷詞方面,透
過針對數字、日期、以及量詞的處理,錯誤的日期可被重新正確斷出;
如果分割出的詞與相對應標準語料庫的詞不同時,透過該詞和前後斷
詞的重組,可求出更適當的斷詞。對於以條件隨機域模型做中文斷詞
的研究與應用,我們提出了一個較佳的特徵模版以獲得較好的斷詞結
果,並提出其它三種方法處理特定的斷詞問題。
Abstract
In Chinese language, the smallest meaningful unit is a word which is composed of a sequence
of characters. A Chinese sentence is composed of a sequence of words without any separation
between them. In the area of information retrieval or data mining, the segmentation of a
sequence of Chinese characters should be done before anyone starts to use these segments of
characters. The process is called the Chinese word segmentation. The researches of Chinese
word segmentation have been developed for many years. Although some recent researches
have achieved very high performance, the recall of those words that are not in the dictionary
only achieves sixty or seventy percent. An approach described in this paper makes use of the
linear-chain conditional random fields (CRFs) to have a more accurate Chinese word segmentation.
The discriminatively trained model that uses two of our proposed feature templates for
deciding the boundaries between characters is used in our study. We also propose three other
methods, which are the duplicate word repartition, the date representation repartition, and the segment refinement, to enhance the accuracy of the processed segments. In the experiments, we use several different approaches for testing and compare the results with those proposed by Li et al. and Lau and King based on three different Chinese word corpora. The results prove that the improved feature template which makes use of the information of prefix and postfix
could increase both the recall and the precision. For example, the F-measure reaches 0.964 in the MSR dataset. By detecting repeat characters, the duplicated characters could also be better repartitioned without using extra resources. In the representation of date, the wrongly segmented date could be better repartitioned by using the proposed method which deals with numbers, date, and measure words. If a word is segmented differently from that of the corresponding standard segmentation corpus, a proper segment could be produced by repartitioning the assembled segment which is composed of the current segment and the adjacent segment.
In the area of using the conditional random fields for Chinese word segmentation, we have
proposed a feature template for better result and three methods which focus on other specific
segmentation problems.
目次 Table of Contents
Contents
List of Tables iii
List of Figures iv
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Related Research 4
2.1 Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Inconsist Word Segmentation Standards . . . . . . . . . . . . . . . . 4
2.1.2 Word Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Unknown Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Recent Researches in Chinese Word Segmentation . . . . . . . . . . . . . . 8
Chapter 3 Background Concept 13
3.1 Lexeme Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Linear-Chain CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2.1 Basic Type of Features . . . . . . . . . . . . . . . . . 23
3.2.2.2 Feature Template . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
i
3.2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 4 The Proposed Method 36
4.1 Feature Template Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Duplicated Word Repartition . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Date Representation Repartition . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Segment Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 5 Experiments 64
5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Preprocess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Evaluation of Performance on Dataset MSR . . . . . . . . . . . . . . 67
5.3.2 Evaluation of Performance on Dataset PKU . . . . . . . . . . . . . . 72
5.3.3 Evaluation of Performance on Dataset CityU . . . . . . . . . . . . . 77
Chapter 6 Conclusion and Future Works 82
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Appendix A Table of Notation 91
參考文獻 References
Bibliography
[1] Asahara, M., Fukuoka, K., Azuma, A., Goh, C., Watanabe, Y., Matsumoto,
Y., and Tsuzuki, T. (2005). Combination of machine learning
methods for optimum chinese word segmentation. In Proceedings of the
Fourth SIGHAN Workshop on Chinese Language Processing, pages 134–
137. Jeju Island, Korea.
[2] Asahara, M., Goh, C., Wang, X., and Matsumoto, Y. (2003). Combining
segmenter and chunker for Chinese word segmentation. In Proceedings of
the 2nd SIGHAN Workshop on Chinese Language Processing, pages 144–
147. Sapporo, Japan.
[3] Byrd, R., Nocedal, J., and Schnabel, R. (1994). Representations of quasi-
Newton matrices and their use in limited memory methods. Mathematical
Programming, 63(1):129–156.
[4] Chen, K. and Bai, M. (1998). Unknown word detection for Chinese by a
corpus-based learning method. Computational Linguistics, 3(1):27–44.
[5] Chen, K. and Ma, W. (2002). Unknown word extraction for Chinese documents.
In Proceedings of the 19th international conference on Compu86
tational linguistics-Volume 1, pages 1–7. Association for Computational
Linguistics Morristown, NJ, USA.
[6] Gao, J., Li, M., Huang, C., and Wu, A. (2005). Chinese word segmentation
and named entity recognition: A pragmatic approach. Computational
Linguistics, 31(4):531–574.
[7] Goh, C., Asahara, M., and Matsumoto, Y. (2005). Chinese word segmentation
by classification of characters. International Journal of Computational
Linguistics and Chinese Language Processing, 10(3):381–396.
[8] Klinger, R. and Tomanek, K. (2007). Classical probabilistic models and
conditional random fields. Technische Universit
”at Dortmund, Dortmund,”Electronic Publication.
[9] Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random
fields: Probabilistic models for segmenting and labeling sequence
data. In MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN
CONFERENCE-, pages 282–289. Citeseer.
[10] Lau, T. and King, I. (2005). Two-Phase LMR-RC Tagging for Chinese
Word Segmentation. In Proceedings of the Fourth SIGHAN Workshop on
Chinese Language Processing, pages 183–186. Jeju Island, Korea.
[11] Li, M., Gao, J., Huang, C., and Li, J. (2003). Unsupervised training
for overlapping ambiguity resolution in Chinese word segmentation. In
Proceedings of the second SIGHAN workshop on Chinese language processing,
pages 11–12. Sapporo, Japan.
87
[12] Li, Y., Miao, C., Bontcheva, K., and Cunningham, H. (2005). Perceptron
Learning for Chinese Word Segmentation. In Proceedings of Fourth
SIGHAN Workshop on Chinese Language processing (Sighan-05), pages
154–157. Jeju Island, Korea.
[13] Low, J., Ng, H., and Guo, W. (2005). A maximum entropy approach to
Chinese word segmentation. In Proceedings of the Fourth SIGHAN Workshop
on Chinese Language Processing, pages 161–164. Jeju Island, Korea.
[14] Lu, X. (2005). Towards a Hybrid Model for Chinese Word Segmentation.
In Proceedings of Fourth SIGHAN Workshop on Chinese Language
Processing, pages 189–192. Jeju Island, Korea.
[15] Luo, X., Sun, M., and Tsou, B. (2002). Covering ambiguity resolution
in Chinese word segmentation based on contextual information. In Proceedings
of the 19th international conference on Computational linguistics-
Volume 1, page 7. Association for Computational Linguistics Morristown,
NJ, USA.
[16] Ma, W. and Chen, K. (2003). A bottom-up merging algorithm for Chinese
unknown word extraction. In Proceedings of the second SIGHAN
workshop on Chinese language processing-Volume 17, page 38. Association
for Computational Linguistics Morristown, NJ, USA.
[17] Malouf, R. (2002). A comparison of algorithms for maximum entropy
parameter estimation. In International Conference On Computational Linguistics,
pages 1–7. Association for Computational Linguistics Morristown,
NJ, USA.
88
[18] Peng, F., Feng, F., and McCallum, A. (2004). Chinese segmentation and
new word detection using conditional random fields. In Proceedings of
the 20th international conference on Computational Linguistics, page 562.
Association for Computational Linguistics Morristown, NJ, USA.
[19] Peng, F. and McCallum, A. (2006). Information extraction from research
papers using conditional random fields. Information Processing and Management,
42(4):963–979.
[20] Pietra, S., Pietra, V., and Lafferty, J. (1997). Inducing features of random
fields. IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(4):380–393.
[21] Rabiner, L. (1990). A tutorial on hidden Markov models and selected
applications in speech recognition. Readings in speech recognition,
53(3):267–296.
[22] Richard, S. and Yen, L. (1990). The n-best algorithm: An efficient
and exact procedure for finding the n most likely sentence. Proceedings
of International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 1:81–84.
[23] Sha, F. and Pereira, F. (2003). Shallow parsing with conditional random
fields. In Proceedings of HLT-NAACL, volume 1, pages 134–141. Association
for Computational Linguistics Morristown, NJ, USA.
[24] Sutton, C. and McCallum, A. (2007). An Introduction to Conditional
Random Fields for Relational Learning. Introduction to statistical relational
learning, page 93.
89
[25] Tsai, R., Hung, H., Sung, C., Dai, H., and Hsu, W. (2006). On closed
task of Chinese word segmentation: An improved CRF model coupled with
character clustering and automatically generated template matching. In
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing,
pages 108–117. Sydney, Australia.
[26] Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. (2005).
A conditional random field word segmenter for Sighan bakeoff 2005. In
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing,
pages 168–171. Jeju Island, Korea.
[27] Xue, N. (2003). Chinese word segmentation as character tagging. Computational
Linguistics and Chinese Language Processing, 8(1):29–48.
[28] Xue, N. and Shen, L. (2003). Chinese word segmentation as LMR tagging.
In Proceedings of the second SIGHAN workshop on Chinese language
processing-Volume 17, pages 176–179. Association for Computational
Linguistics Morristown, NJ, USA.
[29] Zhang, K., Liu, Q., Zhang, H., and Cheng, X. (2002). Automatic recognition
of Chinese unknown words based on roles tagging. In Proceedings
of the first SIGHAN workshop on Chinese language processing-Volume 18,
page 7. Association for Computational Linguistics Morristown, NJ, USA.
[30] Zhang, R., Kikui, G., and Sumita, E. (2006). Subword-based tagging by
conditional random fields for Chinese word segmentation. In Proceedings
of the Human Language Technology Conference of the NAACL, Compan90
ion Volume: Short Papers on XX, pages 193–196. Association for Computational
Linguistics.
[31] Zhang, Y. and Clark, S. (2007). Chinese segmentation with a wordbased
perceptron algorithm. In ANNUAL MEETING-ASSOCIATION FOR
COMPUTATIONAL LINGUISTICS, volume 45, page 840. Association for
Computational Linguistics Morristown, NJ, USA.
[32] Zhao, H., Huang, C., and Li, M. (2006). An improved Chinese word
segmentation system with conditional random field. In Proceedings of the
Fifth SIGHANWorkshop on Chinese Language Processing, pages 162–165.
Sydney, Australia.
[33] Zheng, J. and Wu, F. (1999). Study on segmentation of ambiguous
phrases with the combinatorial type. Collections of Papers on Computational
Linguistics, pages 129–134.
[34] 国家标准化管理委员会(1992). 信息处理用现代汉语分词规范. 中国
标准出版社.
[35] 朱怡霖(2001). 中文斷詞與專有名詞辨識之研究. 台大資工,碩士文.
[36] 詞庫小組(1995). 研究院語料庫的內容及說明. Technical report,中央
研究院.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內立即公開,校外一年後公開 off campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code