國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於 Sphinx 可快速個人化行動語音辨識系統,Quickly Personalizable Digit Mobile Speech Recognition System Based on Sphinx

論文名稱 Title	基於 Sphinx 可快速個人化行動語音辨識系統 Quickly Personalizable Digit Mobile Speech Recognition System Based on Sphinx
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	101 學年度第 2 學期 The spring semester of Academic Year 101	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	49
研究生 Author	顏宗芃 Tsung-peng Yen
指導教授 Advisor	陳嘉平 Chia-Ping Chen
召集委員 Convenor	吳宗憲 Chung-hsien Wu
口試委員 Advisory Committee	賴玟杏, 王新民 wen-Hsing Lai; Hsin-Min Wang
口試日期 Date of Exam	2013-07-25	繳交日期 Date of Submission	2013-09-02
關鍵字 Keywords	Sphinx、行動化、調適、個人化、語音辨識、強健性 speech recognition, mobile, adapt, noise-robustness, personalizable, sphinx
統計 Statistics	本論文已被瀏覽 5684 次，被下載 2393 次 The thesis/dissertation has been browsed 5684 times, has been downloaded 2393 times.

中文摘要
本論文建立了一個以提供數字語音辨識服務為基礎的系統。此系統透過網路提供使用者自動語音辨識的服務，除了語音辨識功能也提供線上個人化調適功能來克服在不同環境中的噪音強健性。以英文數字辨識來說使用此系統只需要經過少許的調適就能夠在少許的時間內打造出正確率高達80% 的個人化英文數字語音辨識系統，可以使用此系統來開發應用程式與發展其它相關應用。Sphinx-4 是專門為了研究而開發的工具，具有延展性、模組化、可插拔的架構，因為這些特性我們選擇使用Sphinx-4做為語音辨識系統的核心，藉由Sphinx-4可插拔的特性，只要對配置檔案稍做修改就能更換字典、文法或是聲學模型。為了讓選擇聲學模型與訓練語料及調適聲學模型上有一個依據，我們利用英文數字語料庫AURORA 2 及台灣口音英語語料庫EAT 與自行使用Android 裝置錄製的語料庫來進行不同環境及不同裝置調適的相關實驗結果。
Abstract
In this paper, we are going to introduce a system which provide digit speech recognition services. This system is built on internet, users can easily utilize the system through the network. Besides the speech recognition service in our system, we also provide adaptation function to bring up the Noise-Robust between differences environment. In the case of English digit recognition, our recognition system can achieve 80% accuracy for a specific speaker by using a few adaptation. Our system can also be expanded for building program and other relevant application. We use Sphinx-4 as a speech recognition kernel in our system. Because Sphinx-4 is a system prepared exclusively for researchers, it is a flexible, modular and pluggable framework. By the pluggability characteristic of Sphinx-4, we can replace the dictionary, grammar and acoustic model easily by edit the configuration files. In order to make sense about choosing acoustic model, training data and adaptation data. We provide our experiment results on AURORA2, EAT and Android device recording from corpus for references.

目次 Table of Contents
Contents List of Tables vi List of Figures viii Chapter 1 簡介1 1.1 研究背景. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Chapter 2 網路語音辨識系統與工具介紹3 2.1 系統架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 分散式系統. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 工具介紹CMU Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3.1 Sphinx-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3.2 Sphinx-4系統架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3.3 其它工具. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 語料庫介紹9 3.1 AURORA 2.0語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 台灣口音英語(English Across Taiwan, or EAT)語料庫. . . . . . . . . . . 10 3.2.1 EAT數字語料庫(EAT DIGIT語料庫) . . . . . . . . . . . . . . . . 11 3.3 Android裝置錄音語料庫(NOISE1語料庫) . . . . . . . . . . . . . . . . . 12 3.3.1 錄音計畫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.2 語者及句數. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 4 實驗14 4.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 AURORA2聲學模型使用EAT DIGIT語料庫語料調適實驗. . . . . . . . 18 4.2.1 EAT DIGIT 英語系與非英語系腔調比較. . . . . . . . . . . . . . 18 4.2.2 EAT DIGIT 調適效果與句數關係. . . . . . . . . . . . . . . . . . 20 4.2.3 EAT DIGIT 跨環境調適效果. . . . . . . . . . . . . . . . . . . . . 22 4.3 AURORA2聲學模型使用NOISE1語料調適測試. . . . . . . . . . . . . . 24 4.3.1 NOISE1 調適效果與句數. . . . . . . . . . . . . . . . . . . . . . . 24 4.3.2 NOISE1 不同噪音環境的調適效果. . . . . . . . . . . . . . . . . . 24 4.3.3 NOISE1 不同裝置的調適效果比較. . . . . . . . . . . . . . . . . . 27 Chapter 5 總結與未來展望28 5.1 總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Appendix A CMU Sphinx安裝步驟與執行範例34 A.1 SphinxTrain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A.1.1 安裝步驟. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A.1.2 Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.2 Sphinx-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.2.1 安裝步驟. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.2.2 Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 A.3 Sphinx-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 A.3.1 安裝步驟. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 A.3.2 Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

參考文獻 References
Bibliography [1] M. Kamvar and S. Baluja, “A Large Scale Study of Wireless Search Behavior: Google Mobile Search,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’06, (New York, NY, USA), pp. 701–709, ACM, 2006. [2] I. VanDuyn, “Comparison of Voice Search Applications on iOS.” http:// www.isaacvanduyn.com/downloads/research-proposal.pdf. [Online]. Available. [3] T. X. He and J.-J. Liou, “Cyberon Voice Commander 多國語言語音命令系統(Cyberon Voice Commander - a Multilingual Voice Command System) [In Chinese],” in ROCLING, Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Taiwan, 2007. [4] Y. Lu, L. Liu, S. Chen, and Q. Huang, “Voice Based Control for Humanoid Teleoperation,” in Proceedings of the 2010 International Conference on Intelligent System Design and Engineering Application - Volume 02, ISDEA ’10, (Washington, DC, USA), pp. 814–818, IEEE Computer Society, 2010. [5] B.-K. Shim, Y.-K. Cho, J.-B. Won, and S.-H. Han, “A study on real-time control of mobile robot with based on voice command,” in Control, Automation and Systems (ICCAS), 2011 11th International Conference on, pp. 1102–1103, 2011. [6] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “Above the Clouds: A Berkeley View of Cloud Computing,” tech. rep., University of California at Berkeley, February 2009. 31 [7] J. Borges, J. Jimenez, and N. Rodriquez, “Speech browsing the World Wide Web,” in IEEE International Conference on Systems, Man, and Cybernetics, 1999. IEEE SMC ’99 Conference Proceedings, vol. 4, pp. 80–86 vol.4, 1999. [8] D. Pearce and H. günter Hirsch, “The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noisy Conditions,” in in ISCA ITRW ASR2000, pp. 29–32, 2000. [9] L. Rabiner and B.-H. Juang, “An introduction to hidden Markov models,” ASSP Magazine, IEEE, vol. 3, no. 1, pp. 4–16, 1986. [10] Y. Zhao and B.-H. Juang, “Stranded Gaussian mixture hidden Markov models for robust speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4301–4304, 2012. [11] Y. Zhao and B.-H. Juang, “Exploiting sparsity in stranded hidden Markov models for automatic speech recognition,” in Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 1623–1625, 2012. [12] L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek, N. Goel, M. Karafiat, D. Povey, A. Rastrow, R. Rose, and S. Thomas, “Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models,” in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4334–4337, 2010. [13] D. Povey, L. Burget, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek, N. Goel, M. Karafiat, A. Rastrow, R. Rose, P. Schwarz, and S. Thomas, “Subspace Gaussian Mixture Models for speech recognition,” in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4330–4333, 2010. [14] S. J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P.Woodland, “The HTK Book Version 3.4,” Cambridge University Press, 2006. 32 [15] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel, “Sphinx-4: a flexible open source framework for speech recognition,” tech. rep., Mountain View, CA, USA, 2004. [16] X. Huang, F. Alleva, H. wuen Hon, M. yuh Hwang, and R. Rosenfeld, “The SPHINX-II Speech Recognition System: An Overview,” Computer, Speech and Language, vol. 7, pp. 137–148, 1992. [17] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj, M. Ravishankar, R. Rosenfeld, K. Seymore, M. Siegler, R. Stern, and E. Thayer, “The 1996 Hub-4 Sphinx-3 System,” in Proc. ARPA Spoken Language Technology Workshop,Chantilly, Chantilly, VA, 1996. [18] P. Lamere, P. Kwok, W. Walker, E. B. Gouvêa, R. Singh, B. Raj, and P. Wolf, “Design of the CMU sphinx-4 decoder,” in 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003, ISCA, 2003. [19] S. Young and S. Young, “The HTK Hidden Markov Model Toolkit: Design and Philosophy,” Entropic Cambridge Research Laboratory, Ltd, vol. 2, pp. 2–44, 1994. [20] X. Liu, Y. Zhao, X. Pi, L. Liang, and A. V. Nefian, “Audio-visual continuous speech recognition using a coupled hidden Markov model,” in INTERSPEECH, 2002. [21] K.-F. Lee, H.-W. Hon, and R. Reddy, “An overview of the SPHINX speech recognition system,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38, pp. 35 – 45, January 1990. [22] M. K. Ravishankar, “Efficient algorithms for speech recognition,” tech. rep., phD Thesis (CMU Technical Report CS-96-143), Carnegie Mellon University, Pittsburgh, PA, 1996. [23] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 28, no. 4, pp. 357–366, 1980. 33 [24] H. Hermansky, “Perceptual linear prediction (PLP) analysis for speech,” in Journal of the Acoustical Society of Ameriac, vol. 87, pp. 1738–1752, 1990. [25] C.-H. Lee and J.-L. Gauvain, “Speaker adaptation based on MAP estimation of HMM parameters,” in Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II, ICASSP’93, (Washington, DC, USA), pp. 558–561, IEEE Computer Society, 1993. [26] J. L. Gauvain and C. H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 2, pp. 291 –298, April 1994.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0727113-135938.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS