國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,強健性自動語音辨識之基於聽覺模型的梅爾倒頻譜參數擷取調整,Auditory Based Modification of MFCC Feature Extraction for Robust Automatic Speech Recognition

論文名稱 Title	強健性自動語音辨識之基於聽覺模型的梅爾倒頻譜參數擷取調整 Auditory Based Modification of MFCC Feature Extraction for Robust Automatic Speech Recognition
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	97 學年度第 2 學期 The spring semester of Academic Year 97	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	80
研究生 Author	邱聖權 Sheng-chiuan Chiou
指導教授 Advisor	陳嘉平 Chia-ping Chen
召集委員 Convenor	王新民 Hsin-min Wang
口試委員 Advisory Committee	吳宗憲, 張景新, 洪志偉, 張景新, 洪志偉 Chung-hsien Wu; Jing-shin Chang; Jeih-weih Hung; Jing-shin Chang; Jeih-weih Hung
口試日期 Date of Exam	2009-07-25	繳交日期 Date of Submission	2009-09-01
關鍵字 Keywords	噪音強健性、聽覺模型、後遮蔽、自動語音辨識 forward masking, auditory model, syanptic adaptation, temporal integration, noise robust, ASR, automatic speech recognition
統計 Statistics	本論文已被瀏覽 5683 次，被下載 1012 次 The thesis/dissertation has been browsed 5683 times, has been downloaded 1012 times.

中文摘要
人類的聽覺感知模型比起現存的自動語音辨識系統更加精確也更不受噪音的影響, 若能在自動語音辨識系統中模擬人類聽覺感知模型，可以藉此提昇自動語音辨識系統的噪音強健性。後遮蔽（forward masking）是一種聽覺感知的現象，一個較強的聲音會遮蔽其後的聲音。我們使用兩個聽覺機制來模擬後遮蔽，分別是突觸調適（synaptic adaptation）與時序統合（temporal integration），並且以濾波器來實做這些功能。在梅爾頻率倒頻譜之中加入後遮蔽的模型藉此提高語音特徵參數的噪音強健性。我們使用Aurora 3語料庫來作實驗，並且依照Aurora 3提供的標準流程進行訓練及測試。實驗結果顯示synaptic adaptation可以提昇語音辨識正確率16.6%，temporal integration 可以提昇21.6%，另一種temporal integration可以提昇22.5%。將synaptic adaptation 與兩種temporal integration方法結合之後可以進一步提昇達到26.3%及25.5%。接下來再進行濾波器參數的最佳化，將synaptic adaptation濾波器的相對進步率提高為18.4%, temporal integration濾波器的相對進步提高為25.2%，第二種temporal integration 提高為22.6%。對於兩種濾波器結合的方法，其相對進步率增為26.9%及26.3%。
Abstract
The human auditory perception system is much more noise-robust than any state-of theart automatic speech recognition (ASR) system. It is expected that the noise-robustness of speech feature vectors may be improved by employing more human auditory functions in the feature extraction procedure. Forward masking is a phenomenon of human auditory perception, that a weaker sound is masked by the preceding stronger masker. In this work, two human auditory mechanisms, synaptic adaptation and temporal integration are implemented by filter functions and incorporated to model forward masking into MFCC feature extraction. A filter optimization algorithm is proposed to optimize the filter parameters. The performance of the proposed method is evaluated on Aurora 3 corpus, and the procedure of training/testing follows the standard setting provided by the Aurora 3 task. The synaptic adaptation filter achieves relative improvements of 16.6% over the baseline. The temporal integration and modified temporal integration filter achieve relative improvements of 21.6% and 22.5% respectively. The combination of synaptic adaptation with each of temporal integration filters results in further improvements of 26.3% and 25.5%. Applying the filter optimization improves the synaptic adaptation filter and two temporal integration filters, results in the 18.4%, 25.2%, 22.6% improvements respectively. The performance of the combined-filters models are also improved, the relative improvement are 26.9% and 26.3%.

目次 Table of Contents
List of Tables iii List of Figures iv Acknowledgments vi Chapter 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Related Works 4 2.1 MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Common Methods of Noise Robustness . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Temporal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Cepstral Mean Subtraction . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 RelAtive SpecTrAl (RASTA) . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 TempoRAl Patterns (TRAPs) . . . . . . . . . . . . . . . . . . . . . . 11 2.3.4 MVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.5 Temporal Structure Normalization . . . . . . . . . . . . . . . . . . . 13 2.3.6 Data Driven Temporal Filter . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Human Auditory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 i 2.4.1 Two-Tone Suppression . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2 Zero-Crossings with Peak Amplitudes (ZCPA) . . . . . . . . . . . . 17 2.4.3 A Dynamic Forward Masking Model . . . . . . . . . . . . . . . . . 18 Chapter 3 The Proposed Algorithm 20 3.1 Forward Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.1 Synaptic Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Temporal Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2.1 A Modification of Temporal Integration Filter . . . . . . . 30 3.1.3 Synaptic Adaptation with Temporal Integration . . . . . . . . . . . . 33 3.1.3.1 Synaptic Adaptation with Modified Temporal Integration . 34 3.2 Filter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 Synaptic Adaptation Filter . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.3 Temporal Integration Filter . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.3.1 Modified Temporal Integration Filter . . . . . . . . . . . . 43 3.2.4 Synaptic Adaptation with Temporal Integration . . . . . . . . . . . . 45 3.2.4.1 Synaptic adaptation with Modified Temporal Integration . . 47 3.2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 4 Experimental Results 51 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.1 Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 5 Conclusion and Future Works 61 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Bibliography 63 Appendix A The Training Data for Filter Optimization 67

參考文獻 References
[1] L. Turicchia and R. Sarpeshkar, “A bio-inspired companding strategy for spectral enhancement,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2, pp. 243–253, 2005. [2] X. Huang, A. Acero, and H. Hon, Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR Upper Saddle River, NJ, USA, 2001. [3] B. Gold and N. Morgan, Speech and audio signal processing: processing and perception of speech and music. John Wiley & Sons, Inc. New York, NY, USA, 1999. [4] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980. [5] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990. [6] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979. [7] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” J. Acoust. Soc. Amer, pp. 1304–1312, 1974. [8] C. Chen and J. Bilmes, “MVA Processing of Speech Features,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 1, pp. 257–270, 2007. [9] X. Xiao, E. Chng, and H. Li, “Normalization of the Speech Modulation Spectra for Robust Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 8, pp. 1662–1674, 2008. [10] J. Droppo, L. Deng, and A. Acero, “Evaluation of the SPLICE algorithm on the Aurora2 database,” in Seventh European Conference on Speech Communication and Technology, ISCA, 2001. [11] M. Gales and S. Young, “Robust speech recognition in additive and convolutional noise using parallel model combination,” Computer speech & language(Print), vol. 9, no. 4, pp. 289–307, 1995. [12] P. Moreno, B. Raj, and R. Stern, “A vector Taylor series approach for environmentindependent speech recognition,” in IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, vol. 2, 1996. [13] C. Yang, F. Soong, and T. Lee, “Static and dynamic spectral features: Their noise robustness and optimal weights for ASR,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05), vol. 1, 2005. [14] J. Hung and L. Lee, “Optimization of temporal filters for constructing robust features in speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, pp. 808–832, 2006. [15] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Transactions on Speech Audio Processing, pp. 578–589, 1994. [16] H. Hermansky and S. Sharma, “TRAPS-Classifiers Of Temporal Patterns,” in Fifth International Conference on Spoken Language Processing, ISCA, 1998. [17] C. Avendano, S. Vuuren, and H. Hermansky, “Data based filter design for RASTA-like channel normalization in ASR,” in Fourth International Conference on Spoken Language Processing, ISCA, 1996. [18] C. Avendano and H. Hermansky, “On the properties of temporal processing for speech in adverse environments,” Applications of Signal Processing to Audio and Acoustics, 1997. 1997 IEEE ASSP Workshop on, Oct 1997. [19] D. Lee and R. Kil, “Auditory processing of speech signals for robust speech recognitionin real-world noisy environments,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 1, pp. 55–69, 1999. [20] J. Kates, “A time-domain digital cochlear model,” IEEE Transactions on signal processing, vol. 39, no. 12, pp. 2573–2592, 1991. [21] S. Haque, R. Togneri, and A. Zaknich, “Perceptual features for automatic speech recognition in noisy environments,” Speech Commun., vol. 51, no. 1, pp. 58–75, 2009. [22] B. Strope and A. Alwan, “A model of dynamic auditory perception and its application torobust word recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 5, pp. 451–464, 1997. [23] B. Moore, An introduction to the psychology of hearing. Academic press, 2003. [24] A. Oxenham, “Forward masking: Adaptation or integration?,” The Journal of the Acoustical Society of America, vol. 109, p. 732, 2001. [25] M. Holmberg, D. Gelbart, and W. Hemmert, “Automatic speech recognition with an adaptation model motivated by auditory processing,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 43–49, 2006. [26] K. Park and S. Lee, “An engineering model of the masking for the noise-robust speech recognition,” Neurocomputing, vol. 52, no. 54, pp. 615–620, 2003. [27] S. Vuuren and H. Hermansky, “Data-driven design of RASTA-like filters,” in Fifth European Conference on Speech Communication and Technology, ISCA, 1997. [28] S. Haykin et al., “Adaptive filtering theory,” Book, Prentice-Hall Information Systems Science Series, 1986. [29] S. Haykin, “Adaptive filters,” Signal Processing Magazine. IEEE Computer Society, 1999. [30] J. Droppo, M. Mahajan, A. Gunawardana, and A. Acero, “How to train a discriminative front end with stochastic gradient descent and maximum mutual information,” in Proceedings IEEE Automatic Speech Recognition and Understanding Workshop, 2005. [31] H. Hirsch and D. Pearce, “The AURORA Experimental Framework For The Performance Evaluation of Speech Recognition Systems Under Noisy Conditions,” in ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), ISCA, 2000. [32] A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukry, S. Euler, and J. Allen, “SPEECH DAT CAR. A Large Speech Database For Automotive Environments,” [33] N. Parihar and J. Picone, “DSR front end LVCSR evaluation AU/384/02,” Aurora Working Group, European Telecommunications Standards Institute, 2002.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0901109-143419.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS