國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於旅遊對話運用嵌入式中文語音辨識系統之實作,Implementation of Embedded Mandarin Speech Recognition System in Travel Domain

論文名稱 Title	基於旅遊對話運用嵌入式中文語音辨識系統之實作 Implementation of Embedded Mandarin Speech Recognition System in Travel Domain
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	97 學年度第 2 學期 The spring semester of Academic Year 97	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	65
研究生 Author	陳柏含 Bo-han Chen
指導教授 Advisor	陳嘉平 Chia-Ping Chen
召集委員 Convenor	王新民 Hsin-Min Wang
口試委員 Advisory Committee	吳宗憲, 洪志偉, 張景新 Chung-Hsien Wu; Jeih-weih Hung; Jing-Shin Chang
口試日期 Date of Exam	2009-07-25	繳交日期 Date of Submission	2009-09-07
關鍵字 Keywords	加權有限狀態轉換機、隱藏式馬可夫模型、自動語音辨識 Weighted Finite State Transducer, Hidden Markov Model, Automatic Speech Recognition
統計 Statistics	本論文已被瀏覽 5641 次，被下載 0 次 The thesis/dissertation has been browsed 5641 times, has been downloaded 0 times.

中文摘要
在本論文中，我們在行動裝置上開發一套二階段式中文自動語音辨識器。第一個辨識階段主要是辨識中文音節，以離散隱藏式馬可夫模型作為基礎模型，搜尋方式則為時間同步詞彙樹維特比搜尋。在第二階段，我們則是運用加權有限狀態轉換機來表示語言模型、發音模型以及前Ｎ名音節假說結果，再經由加權有限狀態轉換機上之組合及最短路徑運算，得到最好的詞串結果。本系統主要應用於旅遊領域，並且分割聲學模型及語言模型的應用於獨立的階段。實驗部份提供在實機ASUS P565（硬體配備：800MHz CPU 128 RAM作業系統：Window Mobile 6.1）上獲得的辨識數據。我們採用26小時TCC-300麥克風語料作為151個聲學模型的訓練集。為了在PC及PDA平台測試音節及字的辨識率，我們採用3分鐘自行錄制旅遊語料作為測試集。第二階段的語言模型則是選用BTEC語料庫中由3500個詞訓練得到的詞雙連模型。在第一階段中，所獲得的最好的音節辨識結果38.8%（前30個假說）。前述的結果是使用連續隱藏式馬可夫模型。同樣音節結果在第二階段下達到27.6%的字辨識率。
Abstract
We build a two-pass Mandarin Automatic Speech Recognition (ASR) decoder on mobile device (PDA). The first-pass recognizing base syllable is implemented by discrete Hidden Markov Model (HMM) with time-synchronous, tree-lexicon Viterbi search. The second-pass dealing with language model, pronunciation lexicon and N-best syllable hypotheses from first-pass is implemented by Weighted Finite State Transducer (WFST). The best word sequence is obtained by shortest path algorithms over the composition result. This system limits the application in travel domain and it decouples the application of acoustic model and the application of language model into independent recognition passes. We report the real-time recognition performance performed on ASUS P565 with a 800MHz processor, 128MB RAM running Microsoft Window Mobile 6 operating system. The 26-hour TCC-300 speech data is used to train 151 acoustic model. The 3-minute speech data recorded by reading the travel-domain transcriptions is used as the testing set for evaluating the performances (syllable, character accuracies) and real-time factors on PC and on PDA. The trained bi-gram model with 3500-word from BTEC corpus is used in second-pass. In the first-pass, the best syllable accuracy is 38.8% given 30-best syllable hypotheses using continuous HMM and 26-dimension feature. Under the above syllable hypotheses and acoustic model, we obtain 27.6% character accuracy on PC after the second-pass.

目次 Table of Contents
List of Tables iii List of Figures v 誌謝vii Chapter 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Speech Recognition System 5 2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Decoding Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Output Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Mandarin Pronunciation Model . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Search Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5.1 Linearly-Structured Lexicon . . . . . . . . . . . . . . . . . . . . . 11 2.5.2 Prefix Tree-Structured Lexicon . . . . . . . . . . . . . . . . . . . 12 Chapter 3 Weighted Finite State Machines 16 3.1 Semi-ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Related Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 Kleene Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.3 Composition Algorithm . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.3.1 Composition Filter . . . . . . . . . . . . . . . . . . . . . 25 3.4 Transducers in ASR Decoder . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.1 N-best Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.2 Lexicon Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.3 Language Model Transducer . . . . . . . . . . . . . . . . . . . . 28 Chapter 4 Experiment 36 4.1 Evaluation Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Description of Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Description of Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Evaluation of Syllable Accuracy (1-best) . . . . . . . . . . . . . . . . . . 38 4.5.1 Mixture Number Reduction and Feature Dimension Reduction 39 4.5.2 Comparisons of Tree Lexicon and Linear Lexicon . . . . . . . . 42 4.6 Evaluation of Syllable Accuracy (Oracle) . . . . . . . . . . . . . . . . . . 42 4.7 Evaluation of Character Accuracy . . . . . . . . . . . . . . . . . . . . . . 43 4.7.1 BTEC Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.7.2 Application of XCIN . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 5 Conclusion and Future Works 50 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

參考文獻 References
[1] D. Huggins-Daines, M. Kumar, A. Chan, A. Black, M. Ravishankar, and A. Rudnicky, “Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices,” 2006. [2] 陳鴻彬，陳柏琳，林順喜，語音辨識及資訊檢索技術於數位典藏多媒體文物之應用，第三屆數位典藏技術研討會，頁239-246。 [3] X. L. Aubert, “An overview of decoding techniques for large vocabulary continuous speech recognition,” Computer Speech and Language, vol. 16, 2002. [4] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state transducers,” Springer Handbook of Speech Processing., vol. 3, 2007. [5] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” in ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), ISCA, 2000. [6] C. Allauzen, M. Mohri, M. Riley, and B. Roark, “A generalized construction of integrated speech recognition transducers,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04), vol. 1, 2004. [7] I. Hetherington, “PocketSUMMIT: small-footprint continuous speech recognition,” in Proc. of INTERSPEECH, pp. 1465–1468, 2007. [8] C. H. Yu, “Large Vocabulary Continuous Mandarin Speech Recognition Using Finite-State Machine,” Master’s thesis, National Taiwan University, 2004. [9] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. [10] J. T. Huang, “Improved large vocabulary continuous mandarin speech recognition by prosody modeling,” Master’s thesis, National Taiwan University, 2006. [11] S. Young, N. Russell, and J. Thornton, Token passing: a simple conceptual model for connected speech recognition systems. University of Cambridge, Department of Engineering, 1989. [12] D. Jurafsky, J. Martin, A. Kehler, K. Vander Linden, and N. Ward, Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. MIT Press, 2000. [13] E. Matusov, S. Kanthak, and H. Ney, “On the integration of speech recognition and statistical machine translation,” in Ninth European Conference on Speech Communication and Technology, ISCA, 2005. [14] M. Mohri and M. Riley, “A weight pushing algorithm for large vocabulary speech recognition,” in Seventh European Conference on Speech Communication and Technology, ISCA, 2001. [15] M. Mohri, “Semiring Frameworks and Algorithms for Shortest-Distance Problem,” Journal of Automata, Languages and Combinatorics, vol. 7. [16] M. Mohri, “Generic Epsilon-Removal and Input Epsilon-Normalization Algorithms forWeighted Transducers,” International Journal of Foundations of Computer Science, vol. 13, no. 1, pp. 129–143, 2002. [17] “HTK Toolkit, http://htk.eng.cam.ac.uk/.” [18] C. Allauzen, M. Mohri, and B. Roark, “Generalized algorithms for constructing statistical language models,” in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pp. 40–47, Association for Computational Linguistics Morristown, NJ, USA, 2003. [19] T. Takezawa, E. Sumita, F. Sugaya, H. Yamamoto, and S. Yamamoto, “Toward a broad-coverage bilingual corpus for speech translation of travel conversations in the real world,” in Proc. of the Third Int. Conf. on Language Resources and Evaluation (LREC), pp. 147–152, 2002. [20] A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Seventh International Conference on Spoken Language Processing, ISCA, 2002. [21] E. Bocchieri and D. Blewett, “A decoder for LVCSR based on fixed-point arithmetic,” in 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings, vol. 1, 2006. [22] T. Kohler, C. Fugen, S. St ‥ uker, and A. Waibel, “Rapid porting of ASR-systems to mobile devices,” in Ninth European Conference on Speech Communication and Technology, ISCA, 2005.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.133.12.172 論文開放下載的時間是校外不公開 Your IP address is 3.133.12.172 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS