論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
強健性自動語音辨識之基於聽覺模型的梅爾倒頻譜參數擷取調整 Auditory Based Modification of MFCC Feature Extraction for Robust Automatic Speech Recognition |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
80 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2009-07-25 |
繳交日期 Date of Submission |
2009-09-01 |
關鍵字 Keywords |
噪音強健性、聽覺模型、後遮蔽、自動語音辨識 forward masking, auditory model, syanptic adaptation, temporal integration, noise robust, ASR, automatic speech recognition |
||
統計 Statistics |
本論文已被瀏覽 5683 次,被下載 1012 次 The thesis/dissertation has been browsed 5683 times, has been downloaded 1012 times. |
中文摘要 |
人類的聽覺感知模型比起現存的自動語音辨識系統更加精確也更不受噪音的影 響, 若能在自動語音辨識系統中模擬人類聽覺感知模型,可以藉此提昇自動語音辨識 系統的噪音強健性。 後遮蔽(forward masking)是一種聽覺感知的現象,一個較強的聲音會遮蔽 其後的聲音。我們使用兩個聽覺機制來模擬後遮蔽,分別是突觸調適(synaptic adaptation)與時序統合(temporal integration),並且以濾波器來實做這些功能。在 梅爾頻率倒頻譜之中加入後遮蔽的模型藉此提高語音特徵參數的噪音強健性。 我們使用Aurora 3語料庫來作實驗,並且依照Aurora 3提供的標準流程進行訓練及 測試。實驗結果顯示synaptic adaptation可以提昇語音辨識正確率16.6%,temporal integration 可以提昇21.6%,另一種temporal integration可以提昇22.5%。將synaptic adaptation 與兩種temporal integration方法結合之後可以進一步提昇達到26.3%及25.5%。接 下來再進行濾波器參數的最佳化,將synaptic adaptation濾波器的相對進步率提高 為18.4%, temporal integration濾波器的相對進步提高為25.2%,第二種temporal integration 提高為22.6%。對於兩種濾波器結合的方法,其相對進步率增為26.9%及26.3%。 |
Abstract |
The human auditory perception system is much more noise-robust than any state-of theart automatic speech recognition (ASR) system. It is expected that the noise-robustness of speech feature vectors may be improved by employing more human auditory functions in the feature extraction procedure. Forward masking is a phenomenon of human auditory perception, that a weaker sound is masked by the preceding stronger masker. In this work, two human auditory mechanisms, synaptic adaptation and temporal integration are implemented by filter functions and incorporated to model forward masking into MFCC feature extraction. A filter optimization algorithm is proposed to optimize the filter parameters. The performance of the proposed method is evaluated on Aurora 3 corpus, and the procedure of training/testing follows the standard setting provided by the Aurora 3 task. The synaptic adaptation filter achieves relative improvements of 16.6% over the baseline. The temporal integration and modified temporal integration filter achieve relative improvements of 21.6% and 22.5% respectively. The combination of synaptic adaptation with each of temporal integration filters results in further improvements of 26.3% and 25.5%. Applying the filter optimization improves the synaptic adaptation filter and two temporal integration filters, results in the 18.4%, 25.2%, 22.6% improvements respectively. The performance of the combined-filters models are also improved, the relative improvement are 26.9% and 26.3%. |
目次 Table of Contents |
List of Tables iii List of Figures iv Acknowledgments vi Chapter 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Related Works 4 2.1 MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Common Methods of Noise Robustness . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Temporal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Cepstral Mean Subtraction . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 RelAtive SpecTrAl (RASTA) . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 TempoRAl Patterns (TRAPs) . . . . . . . . . . . . . . . . . . . . . . 11 2.3.4 MVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.5 Temporal Structure Normalization . . . . . . . . . . . . . . . . . . . 13 2.3.6 Data Driven Temporal Filter . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Human Auditory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 i 2.4.1 Two-Tone Suppression . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2 Zero-Crossings with Peak Amplitudes (ZCPA) . . . . . . . . . . . . 17 2.4.3 A Dynamic Forward Masking Model . . . . . . . . . . . . . . . . . 18 Chapter 3 The Proposed Algorithm 20 3.1 Forward Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.1 Synaptic Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Temporal Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2.1 A Modification of Temporal Integration Filter . . . . . . . 30 3.1.3 Synaptic Adaptation with Temporal Integration . . . . . . . . . . . . 33 3.1.3.1 Synaptic Adaptation with Modified Temporal Integration . 34 3.2 Filter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 Synaptic Adaptation Filter . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.3 Temporal Integration Filter . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.3.1 Modified Temporal Integration Filter . . . . . . . . . . . . 43 3.2.4 Synaptic Adaptation with Temporal Integration . . . . . . . . . . . . 45 3.2.4.1 Synaptic adaptation with Modified Temporal Integration . . 47 3.2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 4 Experimental Results 51 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.1 Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 5 Conclusion and Future Works 61 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Bibliography 63 Appendix A The Training Data for Filter Optimization 67 |
參考文獻 References |
[1] L. Turicchia and R. Sarpeshkar, “A bio-inspired companding strategy for spectral enhancement,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2, pp. 243–253, 2005. [2] X. Huang, A. Acero, and H. Hon, Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR Upper Saddle River, NJ, USA, 2001. [3] B. Gold and N. Morgan, Speech and audio signal processing: processing and perception of speech and music. John Wiley & Sons, Inc. New York, NY, USA, 1999. [4] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980. [5] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990. [6] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979. [7] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” J. Acoust. Soc. Amer, pp. 1304–1312, 1974. [8] C. Chen and J. Bilmes, “MVA Processing of Speech Features,” IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 1, pp. 257–270, 2007. [9] X. Xiao, E. Chng, and H. Li, “Normalization of the Speech Modulation Spectra for Robust Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 8, pp. 1662–1674, 2008. [10] J. Droppo, L. Deng, and A. Acero, “Evaluation of the SPLICE algorithm on the Aurora2 database,” in Seventh European Conference on Speech Communication and Technology, ISCA, 2001. [11] M. Gales and S. Young, “Robust speech recognition in additive and convolutional noise using parallel model combination,” Computer speech & language(Print), vol. 9, no. 4, pp. 289–307, 1995. [12] P. Moreno, B. Raj, and R. Stern, “A vector Taylor series approach for environmentindependent speech recognition,” in IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, vol. 2, 1996. [13] C. Yang, F. Soong, and T. Lee, “Static and dynamic spectral features: Their noise robustness and optimal weights for ASR,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05), vol. 1, 2005. [14] J. Hung and L. Lee, “Optimization of temporal filters for constructing robust features in speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, pp. 808–832, 2006. [15] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Transactions on Speech Audio Processing, pp. 578–589, 1994. [16] H. Hermansky and S. Sharma, “TRAPS-Classifiers Of Temporal Patterns,” in Fifth International Conference on Spoken Language Processing, ISCA, 1998. [17] C. Avendano, S. Vuuren, and H. Hermansky, “Data based filter design for RASTA-like channel normalization in ASR,” in Fourth International Conference on Spoken Language Processing, ISCA, 1996. [18] C. Avendano and H. Hermansky, “On the properties of temporal processing for speech in adverse environments,” Applications of Signal Processing to Audio and Acoustics, 1997. 1997 IEEE ASSP Workshop on, Oct 1997. [19] D. Lee and R. Kil, “Auditory processing of speech signals for robust speech recognitionin real-world noisy environments,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 1, pp. 55–69, 1999. [20] J. Kates, “A time-domain digital cochlear model,” IEEE Transactions on signal processing, vol. 39, no. 12, pp. 2573–2592, 1991. [21] S. Haque, R. Togneri, and A. Zaknich, “Perceptual features for automatic speech recognition in noisy environments,” Speech Commun., vol. 51, no. 1, pp. 58–75, 2009. [22] B. Strope and A. Alwan, “A model of dynamic auditory perception and its application torobust word recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 5, pp. 451–464, 1997. [23] B. Moore, An introduction to the psychology of hearing. Academic press, 2003. [24] A. Oxenham, “Forward masking: Adaptation or integration?,” The Journal of the Acoustical Society of America, vol. 109, p. 732, 2001. [25] M. Holmberg, D. Gelbart, and W. Hemmert, “Automatic speech recognition with an adaptation model motivated by auditory processing,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 43–49, 2006. [26] K. Park and S. Lee, “An engineering model of the masking for the noise-robust speech recognition,” Neurocomputing, vol. 52, no. 54, pp. 615–620, 2003. [27] S. Vuuren and H. Hermansky, “Data-driven design of RASTA-like filters,” in Fifth European Conference on Speech Communication and Technology, ISCA, 1997. [28] S. Haykin et al., “Adaptive filtering theory,” Book, Prentice-Hall Information Systems Science Series, 1986. [29] S. Haykin, “Adaptive filters,” Signal Processing Magazine. IEEE Computer Society, 1999. [30] J. Droppo, M. Mahajan, A. Gunawardana, and A. Acero, “How to train a discriminative front end with stochastic gradient descent and maximum mutual information,” in Proceedings IEEE Automatic Speech Recognition and Understanding Workshop, 2005. [31] H. Hirsch and D. Pearce, “The AURORA Experimental Framework For The Performance Evaluation of Speech Recognition Systems Under Noisy Conditions,” in ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), ISCA, 2000. [32] A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukry, S. Euler, and J. Allen, “SPEECH DAT CAR. A Large Speech Database For Automotive Environments,” [33] N. Parihar and J. Picone, “DSR front end LVCSR evaluation AU/384/02,” Aurora Working Group, European Telecommunications Standards Institute, 2002. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內校外完全公開 unrestricted 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |