Responsive image
博碩士論文 etd-0727113-135938 詳細資訊
Title page for etd-0727113-135938
論文名稱
Title
基於 Sphinx 可快速個人化行動語音辨識系統
Quickly Personalizable Digit Mobile Speech Recognition System Based on Sphinx
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
49
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2013-07-25
繳交日期
Date of Submission
2013-09-02
關鍵字
Keywords
Sphinx、行動化、調適、個人化、語音辨識、強健性
speech recognition, mobile, adapt, noise-robustness, personalizable, sphinx
統計
Statistics
本論文已被瀏覽 5684 次,被下載 2393
The thesis/dissertation has been browsed 5684 times, has been downloaded 2393 times.
中文摘要
本論文建立了一個以提供數字語音辨識服務為基礎的系統。此系統透過網路提供使用
者自動語音辨識的服務,除了語音辨識功能也提供線上個人化調適功能來克服在不同
環境中的噪音強健性。以英文數字辨識來說使用此系統只需要經過少許的調適就能
夠在少許的時間內打造出正確率高達80% 的個人化英文數字語音辨識系統,可以使
用此系統來開發應用程式與發展其它相關應用。Sphinx-4 是專門為了研究而開發的工
具,具有延展性、模組化、可插拔的架構,因為這些特性我們選擇使用Sphinx-4做為
語音辨識系統的核心,藉由Sphinx-4可插拔的特性,只要對配置檔案稍做修改就能更
換字典、文法或是聲學模型。為了讓選擇聲學模型與訓練語料及調適聲學模型上有一
個依據,我們利用英文數字語料庫AURORA 2 及台灣口音英語語料庫EAT 與自行使
用Android 裝置錄製的語料庫來進行不同環境及不同裝置調適的相關實驗結果。
Abstract
In this paper, we are going to introduce a system which provide digit speech recognition
services. This system is built on internet, users can easily utilize the system through the
network. Besides the speech recognition service in our system, we also provide adaptation
function to bring up the Noise-Robust between differences environment. In the case of English
digit recognition, our recognition system can achieve 80% accuracy for a specific
speaker by using a few adaptation. Our system can also be expanded for building program
and other relevant application. We use Sphinx-4 as a speech recognition kernel in our system.
Because Sphinx-4 is a system prepared exclusively for researchers, it is a flexible, modular
and pluggable framework. By the pluggability characteristic of Sphinx-4, we can replace the
dictionary, grammar and acoustic model easily by edit the configuration files. In order to
make sense about choosing acoustic model, training data and adaptation data. We provide
our experiment results on AURORA2, EAT and Android device recording from corpus for
references.
目次 Table of Contents
Contents
List of Tables vi
List of Figures viii
Chapter 1 簡介1
1.1 研究背景. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2 網路語音辨識系統與工具介紹3
2.1 系統架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 分散式系統. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 工具介紹CMU Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Sphinx-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Sphinx-4系統架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 其它工具. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 3 語料庫介紹9
3.1 AURORA 2.0語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 台灣口音英語(English Across Taiwan, or EAT)語料庫. . . . . . . . . . . 10
3.2.1 EAT數字語料庫(EAT DIGIT語料庫) . . . . . . . . . . . . . . . . 11
3.3 Android裝置錄音語料庫(NOISE1語料庫) . . . . . . . . . . . . . . . . . 12
3.3.1 錄音計畫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 語者及句數. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 4 實驗14
4.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 AURORA2聲學模型使用EAT DIGIT語料庫語料調適實驗. . . . . . . . 18
4.2.1 EAT DIGIT 英語系與非英語系腔調比較. . . . . . . . . . . . . . 18
4.2.2 EAT DIGIT 調適效果與句數關係. . . . . . . . . . . . . . . . . . 20
4.2.3 EAT DIGIT 跨環境調適效果. . . . . . . . . . . . . . . . . . . . . 22
4.3 AURORA2聲學模型使用NOISE1語料調適測試. . . . . . . . . . . . . . 24
4.3.1 NOISE1 調適效果與句數. . . . . . . . . . . . . . . . . . . . . . . 24
4.3.2 NOISE1 不同噪音環境的調適效果. . . . . . . . . . . . . . . . . . 24
4.3.3 NOISE1 不同裝置的調適效果比較. . . . . . . . . . . . . . . . . . 27
Chapter 5 總結與未來展望28
5.1 總結. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Appendix A CMU Sphinx安裝步驟與執行範例34
A.1 SphinxTrain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
A.1.1 安裝步驟. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
A.1.2 Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.2 Sphinx-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.2.1 安裝步驟. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A.2.2 Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A.3 Sphinx-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A.3.1 安裝步驟. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A.3.2 Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
參考文獻 References
Bibliography
[1] M. Kamvar and S. Baluja, “A Large Scale Study of Wireless Search Behavior: Google
Mobile Search,” in Proceedings of the SIGCHI Conference on Human Factors in Computing
Systems, CHI ’06, (New York, NY, USA), pp. 701–709, ACM, 2006.
[2] I. VanDuyn, “Comparison of Voice Search Applications on iOS.” http://
www.isaacvanduyn.com/downloads/research-proposal.pdf. [Online].
Available.
[3] T. X. He and J.-J. Liou, “Cyberon Voice Commander 多國語言語音命令系統(Cyberon
Voice Commander - a Multilingual Voice Command System) [In Chinese],” in
ROCLING, Association for Computational Linguistics and Chinese Language Processing
(ACLCLP), Taiwan, 2007.
[4] Y. Lu, L. Liu, S. Chen, and Q. Huang, “Voice Based Control for Humanoid Teleoperation,”
in Proceedings of the 2010 International Conference on Intelligent System Design
and Engineering Application - Volume 02, ISDEA ’10, (Washington, DC, USA),
pp. 814–818, IEEE Computer Society, 2010.
[5] B.-K. Shim, Y.-K. Cho, J.-B. Won, and S.-H. Han, “A study on real-time control of mobile
robot with based on voice command,” in Control, Automation and Systems (ICCAS),
2011 11th International Conference on, pp. 1102–1103, 2011.
[6] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson,
A. Rabkin, I. Stoica, and M. Zaharia, “Above the Clouds: A Berkeley View of
Cloud Computing,” tech. rep., University of California at Berkeley, February 2009.
31
[7] J. Borges, J. Jimenez, and N. Rodriquez, “Speech browsing the World Wide Web,” in
IEEE International Conference on Systems, Man, and Cybernetics, 1999. IEEE SMC
’99 Conference Proceedings, vol. 4, pp. 80–86 vol.4, 1999.
[8] D. Pearce and H. günter Hirsch, “The Aurora Experimental Framework for the Performance
Evaluation of Speech Recognition Systems under Noisy Conditions,” in in ISCA
ITRW ASR2000, pp. 29–32, 2000.
[9] L. Rabiner and B.-H. Juang, “An introduction to hidden Markov models,” ASSP Magazine,
IEEE, vol. 3, no. 1, pp. 4–16, 1986.
[10] Y. Zhao and B.-H. Juang, “Stranded Gaussian mixture hidden Markov models for robust
speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 4301–4304, 2012.
[11] Y. Zhao and B.-H. Juang, “Exploiting sparsity in stranded hidden Markov models for
automatic speech recognition,” in Conference Record of the Forty Sixth Asilomar Conference
on Signals, Systems and Computers (ASILOMAR), pp. 1623–1625, 2012.
[12] L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek,
N. Goel, M. Karafiat, D. Povey, A. Rastrow, R. Rose, and S. Thomas, “Multilingual
acoustic modeling for speech recognition based on subspace Gaussian Mixture Models,”
in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP),
pp. 4334–4337, 2010.
[13] D. Povey, L. Burget, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek, N. Goel,
M. Karafiat, A. Rastrow, R. Rose, P. Schwarz, and S. Thomas, “Subspace Gaussian
Mixture Models for speech recognition,” in IEEE International Conference on Acoustics
Speech and Signal Processing (ICASSP), pp. 4330–4333, 2010.
[14] S. J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P.Woodland, “The HTK
Book Version 3.4,” Cambridge University Press, 2006.
32
[15] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel,
“Sphinx-4: a flexible open source framework for speech recognition,” tech. rep., Mountain
View, CA, USA, 2004.
[16] X. Huang, F. Alleva, H. wuen Hon, M. yuh Hwang, and R. Rosenfeld, “The SPHINX-II
Speech Recognition System: An Overview,” Computer, Speech and Language, vol. 7,
pp. 137–148, 1992.
[17] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj, M. Ravishankar, R. Rosenfeld,
K. Seymore, M. Siegler, R. Stern, and E. Thayer, “The 1996 Hub-4 Sphinx-3 System,”
in Proc. ARPA Spoken Language Technology Workshop,Chantilly, Chantilly, VA,
1996.
[18] P. Lamere, P. Kwok, W. Walker, E. B. Gouvêa, R. Singh, B. Raj, and P. Wolf, “Design
of the CMU sphinx-4 decoder,” in 8th European Conference on Speech Communication
and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland,
September 1-4, 2003, ISCA, 2003.
[19] S. Young and S. Young, “The HTK Hidden Markov Model Toolkit: Design and Philosophy,”
Entropic Cambridge Research Laboratory, Ltd, vol. 2, pp. 2–44, 1994.
[20] X. Liu, Y. Zhao, X. Pi, L. Liang, and A. V. Nefian, “Audio-visual continuous speech
recognition using a coupled hidden Markov model,” in INTERSPEECH, 2002.
[21] K.-F. Lee, H.-W. Hon, and R. Reddy, “An overview of the SPHINX speech recognition
system,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38, pp. 35
– 45, January 1990.
[22] M. K. Ravishankar, “Efficient algorithms for speech recognition,” tech. rep., phD Thesis
(CMU Technical Report CS-96-143), Carnegie Mellon University, Pittsburgh, PA, 1996.
[23] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences,” Acoustics, Speech and Signal
Processing, IEEE Transactions on, vol. 28, no. 4, pp. 357–366, 1980.
33
[24] H. Hermansky, “Perceptual linear prediction (PLP) analysis for speech,” in Journal of
the Acoustical Society of Ameriac, vol. 87, pp. 1738–1752, 1990.
[25] C.-H. Lee and J.-L. Gauvain, “Speaker adaptation based on MAP estimation of HMM
parameters,” in Proceedings of the 1993 IEEE international conference on Acoustics,
speech, and signal processing: speech processing - Volume II, ICASSP’93, (Washington,
DC, USA), pp. 558–561, IEEE Computer Society, 1993.
[26] J. L. Gauvain and C. H. Lee, “Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains,” IEEE Transactions on Audio, Speech, and
Language Processing, vol. 2, pp. 291 –298, April 1994.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code