Responsive image
博碩士論文 etd-0522112-103522 詳細資訊
Title page for etd-0522112-103522
論文名稱
Title
簡化型中文注音輸入法
Chinese input method based on reduced phonetic transcription
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
45
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2012-04-26
繳交日期
Date of Submission
2012-05-22
關鍵字
Keywords
中文輸入法、平滑化、語言模型、動態規劃
smoothing, language model, Chinese input method, dynamic programing
統計
Statistics
本論文已被瀏覽 5650 次,被下載 424
The thesis/dissertation has been browsed 5650 times, has been downloaded 424 times.
中文摘要
本論文提出一種簡化型中文注音輸入法以增加輸入效率,相較於傳統的注音輸
入法必須輸入每一個字的完整注音,我們所提出的方法只需要輸入每一個字的起始
注音符號,就會將此起始注音序列轉換為字元序列。根據輸入的起始注音符號,以
及在詞與詞之間加入空白,系統輸出最佳的候選詞序列,供使用者選擇。詞與詞之
間的機率關聯性乃是利用基本的雙連詞語言模型(bi-gram),由語料庫所訓練。候選
詞序列的篩選則是利用動態規劃法。並且對本論文所提出的簡化注音輸入法做可行
性的評估,以及針對史丹佛斷詞器在簡體中文以及繁體中文上的效能作評估。首先
在斷詞器的實驗中,史丹佛斷詞器在簡體中文的斷詞上,查準率及查全率上分別
為84.52%及85.20%,皆大於在繁體中文的斷詞上的68.43%及65.43%,最後在簡化注
音輸入法的解碼效能上,分別針對漢語平衡語料庫及維基語料庫此兩種不同特性的語
料庫所訓練的語言模型作評估,在漢語平衡語料庫方面的句子正確率及詞正確率分
別為39.8%及70.3%,而維基語料庫的句子正確率及詞正確率分別為20.3%及53.3%,
實驗的結果呈現出本系統的句子正確率及詞正確率在漢語平衡語料庫上的表現結果
較為良好。候選序列的數量選擇上分別對10句及20句候選序列作評估,實驗結果顯示
出20句候選序列在句子及詞的正確率上與10句候選序列相比提昇都在兩個百分點以
內。
Abstract
In this paper, we investigate a highly efficient input method in Chinese. In the traditional
Mandarin phonetic input method, users have to input the complete Mandarin phonetic symbol.
The proposed new Chinese input method is which transforms the first Mandarin phonetic
symbol sequence to character sequence. Users only have to input the first Mandarin phonetic
symbol. Users input first Mandarin phonetic symbol and follow the input rule that spaces are
inserted between the words. The system outputs the candidate character sequence hypotheses.
Bigram model is used to describe the relation between words. We use the dynamic programing
for decoding. We estimate the feasibility for our new Chinese input method and estimate the
Stanford segmenter. In the experiment, we estimate the Standford Segmenter works on the
simplified Chinese and Traditional Chinese firstly. We observe that the precision and recall on
the simplified Chinese are 84.52% and 85.20% which is better than works on the Traditional
Chinese 68.43% and 63.43%. And we estimate system efficiency based on language model
that trained by WIKI corpus and ASBC corpus separately. The sentence and word accuracy
for the ASBC corpus are 39.8% and 70.3%. And the word and character accuracy for WIKI
corpus are 20.3% and 53.3%. Finally we estimate the number of candidate hypotheses. The
research shows the 10 hypotheses and 20 hypotheses the sentence accuracy are closed.
目次 Table of Contents
List of Tables viii
List of Figures ix
Chapter 1 介紹1
1.1 動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 中文輸入法介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 語言模型及語料庫簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 解碼機制簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2 系統架構概觀6
Chapter 3 語言模型及解碼機制9
3.1 簡化注音輸入法可行性分析. . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 平滑化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 拉普拉斯平滑法(Add-one smoothing) . . . . . . . . . . . . . . . . 12
3.2.2 凱式平滑法(Katz smoothing) . . . . . . . . . . . . . . . . . . . . . 13
3.3 動態規劃解碼機制. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 連續輸入序列解碼機制. . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2 使用者斷詞序列解碼機制. . . . . . . . . . . . . . . . . . . . . . 15
Chapter 4 實驗19
4.1 實驗資料. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 漢語平衡語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vi
4.1.2 維基語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Xcin發音辭典. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 破音字標音處理. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 可行性分析以及系統評估. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
附錄A 34
參考文獻 References
[1] “大易輸入法.” http://www.dayi.com/.
[2] “嘸蝦米輸入法.” http://boshiamy.com/.
[3] “Pollster中文輸入法線上市調.” http://www.pollster.com.tw/
Aboutlook/lookview_item.aspx?ms_sn=1476.
[4] Z. Chen and K.-F. Lee, “A new statistical approach to Chinese pinyin input,” in proceeding
of 38th Annual Meeting of the Association for Computational Linguistics, Hong
Kong, pp. 241–247, 2000.
[5] F. Zhang, Z. Chen, M. Li, and G. Dai, “Chinese pinyin input method for mobile phone,”
in proceeding of International Symposium on Chinese Spoken Language Processing
2000, Beijing, China, pp. 291–294, 2000.
[6] M. Lin and A. Sears, “Graphics matter: A case study of mobile phone keypad design
for Chinese input,” in proceedings of CHI ’05 extended abstracts on Human factors in
computing systems, pp. 1593–1596, 2005.
[7] “許氏鍵盤介紹.” http://iasl.iis.sinica.edu.tw/products/
going99/hsu-key/gokey.html.
[8] A. Eisele and Y. Chen, “MultiUN: A multilingual corpus from United Nation documents,”
in Proceedings of the Seventh International Conference on Language Resources
and Evaluation (LREC’10), (Valletta, Malta), May 2010.
32
[9] D. D. Lewis, Y. Yang, T. G. Rose, F. Li, G. Dietterich, and F. Li, “RCV1: A new
benchmark collection for text categorization research,” Journal of Machine Learning
Research, vol. 5, pp. 361–397, 2004.
[10] C. D. Manning and H. Sch‥utze, Foundations of statistical natural language processing.
Cambridge, MA, USA: MIT Press, 1999.
[11] “Google n-grams corpus.” http://ngrams.googlelabs.com/datasets.
[12] S. Dreyfus, “Richard Bellman on the birth of dynamic programming,” Operations Research,
vol. 50, pp. 48–51, Jan. 2002.
[13] A. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding
algorithm,” IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260–
269, 1967.
[14] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to Algorithms.
McGraw-Hill Higher Education, 2nd ed., 2001.
[15] C.-H. Tseng and C.-P. Chen, “Chinese input method based on reduced Mandarin phonetic
alphabet,” in proceedings of INTERSPEECH 2006, USA, 2006.
[16] “SRILM toolkit download.” http://www.speech.sri.com/projects/
srilm/.
[17] A. Stolcke, “SRILM - an extensible language modeling toolkit,” in Proceedings International
Conference on Spoken Language Processing 2002, USA, pp. 901–904, 2002.
[18] G. J. Lidstone, “Note on the general case of the Bayes-Laplace formula for inductive or
a posteriori probabilities,” Transactions of the Faculty of Actuaries, vol. 8, pp. 182–192,
1920.
[19] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language
modeling,” in proceedings of 34th Annual Meeting of the Association for Computational
Linguistics, USA, pp. 310–318, 1996.
33
[20] S. M. Katz, “Estimation of probabilities from sparse data for the language model component
of a speech recognizer 1987,” in IEEE Transactions on Acoustics, Speech and
Singal processing, vol. ASSP-35, pp. 400–401, March 1987.
[21] I. J. Good, “The population frequencies of species and the estimation of population
parameters,” Biometrika, vol. 40, no. 3-4, pp. 237–264, 1953.
[22] “中央研究院漢語料庫的內容與說明.” http://db1x.sinica.edu.tw/kiwi/
mkiwi/98-04.pdf.
[23] “English wikipedia database download.” http://en.wikipedia.org/wiki/
Wikipedia:Database_download.
[24] R. W. Scheifler, J. Gettys, A. Mento, and D. Converse, X Window System: core and
extension protocols : X version 11, releases 6 and 6.1. Digital Press, 1997.
[25] H. Tseng, “A conditional random field word segmenter,” in proceedings of Fourth
SIGHAN Workshop on Chinese Language Processing, Korea, pp. 168–171, 2005.
[26] N. Xue, F. Xia, F.-d. Chiou, and M. Palmer, “The Penn Chinese Treebank: Phrase structure
annotation of a large corpus,” Natural Language Engineering, vol. 11, no. 2.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code