Responsive image
博碩士論文 etd-0907109-153255 詳細資訊
Title page for etd-0907109-153255
論文名稱
Title
基於旅遊對話運用嵌入式中文語音辨識系統之實作
Implementation of Embedded Mandarin Speech Recognition System in Travel Domain
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
65
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2009-07-25
繳交日期
Date of Submission
2009-09-07
關鍵字
Keywords
加權有限狀態轉換機、隱藏式馬可夫模型、自動語音辨識
Weighted Finite State Transducer, Hidden Markov Model, Automatic Speech Recognition
統計
Statistics
本論文已被瀏覽 5641 次,被下載 0
The thesis/dissertation has been browsed 5641 times, has been downloaded 0 times.
中文摘要
在本論文中,我們在行動裝置上開發一套二階段式中文自動語音辨識器。第一個辨識階段主要是辨識中文音節,以離散隱藏式馬可夫模型作為基礎模型,搜尋方式則為時間同步詞彙樹維特比搜尋。在第二階段,我們則是運用加權有限狀態轉換機來表示語言模型、發音模型以及前N名音節假說結果,再經由加權有限狀態轉換機上之組合及最短路徑運算,得到最好的詞串結果。本系統主要應用於旅遊領域,並且分割聲學模型及語言模型的應用於獨立的階段。實驗部份提供在實機ASUS P565(硬體配備:800MHz CPU 128 RAM作業系統:Window Mobile 6.1)上獲得的辨識數據。我們採用26小時TCC-300麥克風語料作為151個聲學模型的訓練集。為了在PC及PDA平台測試音節及字的辨識率,我們採用3分鐘自行錄制旅遊語料作為測試集。第二階段的語言模型則是選用BTEC語料庫中由3500個詞訓練得到的詞雙連模型。
在第一階段中,所獲得的最好的音節辨識結果38.8%(前30個假說)。前述的結果是使用連續隱藏式馬可夫模型。同樣音節結果在第二階段下達到27.6%的字辨識
率。
Abstract
We build a two-pass Mandarin Automatic Speech Recognition (ASR) decoder on mobile device (PDA). The first-pass recognizing base syllable is implemented by discrete Hidden Markov Model (HMM) with time-synchronous, tree-lexicon Viterbi search. The second-pass dealing with language model, pronunciation lexicon and N-best syllable hypotheses from first-pass is implemented by Weighted Finite State Transducer (WFST). The best word sequence is obtained by shortest path algorithms over the composition result. This system limits the application in travel domain and it decouples the application of acoustic model and the application of language model into independent recognition passes. We report the real-time recognition performance performed on ASUS P565 with a 800MHz processor, 128MB RAM running Microsoft Window Mobile 6 operating system.
The 26-hour TCC-300 speech data is used to train 151 acoustic model. The 3-minute speech data recorded by reading the travel-domain transcriptions is used as the testing set for evaluating the performances (syllable, character accuracies) and real-time factors on PC and on PDA. The trained bi-gram model with 3500-word from BTEC corpus is used in second-pass.
In the first-pass, the best syllable accuracy is 38.8% given 30-best syllable hypotheses using continuous HMM and 26-dimension feature. Under the above syllable hypotheses and acoustic model, we obtain 27.6% character accuracy on PC after the second-pass.
目次 Table of Contents
List of Tables iii
List of Figures v
誌謝vii
Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Speech Recognition System 5
2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Decoding Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Output Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Mandarin Pronunciation Model . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Search Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.1 Linearly-Structured Lexicon . . . . . . . . . . . . . . . . . . . . . 11
2.5.2 Prefix Tree-Structured Lexicon . . . . . . . . . . . . . . . . . . . 12
Chapter 3 Weighted Finite State Machines 16
3.1 Semi-ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Related Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Kleene Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 Composition Algorithm . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3.1 Composition Filter . . . . . . . . . . . . . . . . . . . . . 25
3.4 Transducers in ASR Decoder . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 N-best Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Lexicon Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Language Model Transducer . . . . . . . . . . . . . . . . . . . . 28
Chapter 4 Experiment 36
4.1 Evaluation Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Description of Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Description of Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Evaluation of Syllable Accuracy (1-best) . . . . . . . . . . . . . . . . . . 38
4.5.1 Mixture Number Reduction and Feature Dimension Reduction 39
4.5.2 Comparisons of Tree Lexicon and Linear Lexicon . . . . . . . . 42
4.6 Evaluation of Syllable Accuracy (Oracle) . . . . . . . . . . . . . . . . . . 42
4.7 Evaluation of Character Accuracy . . . . . . . . . . . . . . . . . . . . . . 43
4.7.1 BTEC Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7.2 Application of XCIN . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 5 Conclusion and Future Works 50
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
參考文獻 References
[1] D. Huggins-Daines, M. Kumar, A. Chan, A. Black, M. Ravishankar, and A. Rudnicky,
“Pocketsphinx: A free, real-time continuous speech recognition system
for hand-held devices,” 2006.
[2] 陳鴻彬,陳柏琳,林順喜,語音辨識及資訊檢索技術於數位典藏多媒體文物之
應用,第三屆數位典藏技術研討會,頁239-246。
[3] X. L. Aubert, “An overview of decoding techniques for large vocabulary continuous
speech recognition,” Computer Speech and Language, vol. 16, 2002.
[4] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state
transducers,” Springer Handbook of Speech Processing., vol. 3, 2007.
[5] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech
recognition,” in ASR2000-Automatic Speech Recognition: Challenges for the new
Millenium ISCA Tutorial and Research Workshop (ITRW), ISCA, 2000.
[6] C. Allauzen, M. Mohri, M. Riley, and B. Roark, “A generalized construction
of integrated speech recognition transducers,” in IEEE International Conference
on Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04), vol. 1,
2004.
[7] I. Hetherington, “PocketSUMMIT: small-footprint continuous speech recognition,”
in Proc. of INTERSPEECH, pp. 1465–1468, 2007.
[8] C. H. Yu, “Large Vocabulary Continuous Mandarin Speech Recognition Using
Finite-State Machine,” Master’s thesis, National Taiwan University, 2004.
[9] L. Rabiner, “A tutorial on hidden Markov models and selected applications in
speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[10] J. T. Huang, “Improved large vocabulary continuous mandarin speech recognition
by prosody modeling,” Master’s thesis, National Taiwan University, 2006.
[11] S. Young, N. Russell, and J. Thornton, Token passing: a simple conceptual model
for connected speech recognition systems. University of Cambridge, Department of
Engineering, 1989.
[12] D. Jurafsky, J. Martin, A. Kehler, K. Vander Linden, and N. Ward, Speech and
language processing: An introduction to natural language processing, computational
linguistics, and speech recognition. MIT Press, 2000.
[13] E. Matusov, S. Kanthak, and H. Ney, “On the integration of speech recognition
and statistical machine translation,” in Ninth European Conference on Speech
Communication and Technology, ISCA, 2005.
[14] M. Mohri and M. Riley, “A weight pushing algorithm for large vocabulary
speech recognition,” in Seventh European Conference on Speech Communication and
Technology, ISCA, 2001.
[15] M. Mohri, “Semiring Frameworks and Algorithms for Shortest-Distance Problem,”
Journal of Automata, Languages and Combinatorics, vol. 7.
[16] M. Mohri, “Generic Epsilon-Removal and Input Epsilon-Normalization Algorithms
forWeighted Transducers,” International Journal of Foundations of Computer
Science, vol. 13, no. 1, pp. 129–143, 2002.
[17] “HTK Toolkit, http://htk.eng.cam.ac.uk/.”
[18] C. Allauzen, M. Mohri, and B. Roark, “Generalized algorithms for constructing
statistical language models,” in Proceedings of the 41st Annual Meeting on Association
for Computational Linguistics-Volume 1, pp. 40–47, Association for Computational
Linguistics Morristown, NJ, USA, 2003.
[19] T. Takezawa, E. Sumita, F. Sugaya, H. Yamamoto, and S. Yamamoto, “Toward a
broad-coverage bilingual corpus for speech translation of travel conversations in
the real world,” in Proc. of the Third Int. Conf. on Language Resources and Evaluation
(LREC), pp. 147–152, 2002.
[20] A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Seventh International
Conference on Spoken Language Processing, ISCA, 2002.
[21] E. Bocchieri and D. Blewett, “A decoder for LVCSR based on fixed-point arithmetic,”
in 2006 IEEE International Conference on Acoustics, Speech and Signal Processing,
2006. ICASSP 2006 Proceedings, vol. 1, 2006.
[22] T. Kohler, C. Fugen, S. St ‥ uker, and A. Waibel, “Rapid porting of ASR-systems
to mobile devices,” in Ninth European Conference on Speech Communication and
Technology, ISCA, 2005.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 3.133.12.172
論文開放下載的時間是 校外不公開

Your IP address is 3.133.12.172
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code