國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於聽覺前端之噪音強健性自動語音辨識 ,Auditory Front-Ends for Noise-Robust Automatic Speech Recognition

論文名稱 Title	基於聽覺前端之噪音強健性自動語音辨識 Auditory Front-Ends for Noise-Robust Automatic Speech Recognition
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	98 學年度第 2 學期 The spring semester of Academic Year 98	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	52
研究生 Author	葉佳璋 Ja-Zang Yeh
指導教授 Advisor	陳嘉平 Chia-Ping Chen
召集委員 Convenor	王新民 Hsin-Min Wang
口試委員 Advisory Committee	吳宗憲, 葉瑞峰 Chung-Hsien Wu; Jui-Feng Yeh
口試日期 Date of Exam	2010-07-28	繳交日期 Date of Submission	2010-08-25
關鍵字 Keywords	自動語音辨識、噪音強健性、參數萃取、前端處理、頻率遮蔽 frequency masking, front end processing, feature extraction, noise-robust, speech recognition
統計 Statistics	本論文已被瀏覽 5626 次，被下載 0 次 The thesis/dissertation has been browsed 5626 times, has been downloaded 0 times.

中文摘要
人類的聽覺感系統比起現存的任一自動語音辨識系統還要來的更加精確且更不受噪音的影響，因此可以期待若能在自動語音辨識系統中去模擬人類聽覺感知模型，就可以藉此提昇自動語音辨識系統的噪音強健性。在這篇論文中，對於自動語音辨識系統(automatic speech recognition systems)，我們去研究並修改常見的參數萃取系統。使用一個基於將基膜模擬成為一個串連的帶有阻尼的簡諧振器(simple harmonic oscillators)的新的頻率遮蔽曲線（frequency masking curve），來取代原有的critical-band遮蔽曲線而去計算遮蔽門檻（masking threshold）。我們使用數學的方法分析當振盪器被短時間的穩定（short-time stationary）語音訊號驅動時耦合的運動情形。基於分析，我們可以得到彼此相鄰的振盪器之間的振幅的關係。藉此，我們插入一個人耳模型到萃取特徵參數的程序之中來改變原有之語音頻譜（speech spectrum）。我們使用Aurora 2.0 的語料庫來作為我們實驗的評估。當我們所提出的聽覺相關前端模型與常見的倒頻譜均值正規化(cepstral mean subtraction)後處理的方法相結合時，可以達到不錯的進步效果。在我們的相關性影響的方法中加入倒頻譜均值正規化後，對於基本結果(Baseline)而言可達到25.9%的相對改善率。並在重複作用我們所提出的方法下更可使得相對進步率由原本的25.9%提升到30.3%。
Abstract
The human auditory perception system is much more noise-robust than any state-of the art automatic speech recognition (ASR) system. It is expected that the noise-robustness of speech feature can be improved by employing the human auditory based feature extraction procedure. In this thesis, we investigate modifying the commonly-used feature extraction process for automatic speech recognition systems. A novel frequency masking curve, which is based on modeling the basilar membrane as a cascade system of damped simple harmonic oscillators, is used to replace the critical-band masking curve to compute the masking threshold. We mathematically analyze the coupled motion of the oscillator system (basilar membrane) when they are driven by short-time stationary (speech) signals. Based on the analysis, we derive the relation between the amplitudes of neighboring oscillators, and accordingly insert a masking module in the front-end signal processing stage to modify the speech spectrum. We evaluate the proposed method on the Aurora 2.0 noisy-digit speech database. When combined with the commonly-used cepstral mean subtraction post-processing, the proposed auditory front-end module achieves a significant improvement. The method of correlational masking effect curve combine with CMS can achieves relative improvements of 25.9% over the baseline respectively. After applying the methods iteratively, the relative improvement improves from 25.9% to 30.3%.

目次 Table of Contents
List of Tables iii List of Figures iv 誌謝vi Chapter 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2 Related Works 5 2.1 Ear Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 The Outer Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 The Middel Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 The Inner Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3.1 The Cochlea . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3.2 The Basilar Membrane . . . . . . . . . . . . . . . . . . . 8 2.2 Auditory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Frequency Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Two-Tone Inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3 Synaptic Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4 Temporal Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.5 Temporal Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Common Methods in Each Stage of ASR . . . . . . . . . . . . . . . . . . . 15 2.3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2.1 CMN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2.2 MVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.3 Inclusion of Temporal Information . . . . . . . . . . . . . . . . . . . 19 2.3.3.1 The Differential Cepstrum . . . . . . . . . . . . . . . . . . 19 2.3.3.2 The Cepstral-Time Matrix . . . . . . . . . . . . . . . . . . 20 Chapter 3 The Proposed Algorithm 21 3.1 Frequency Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.1 Threshold of Hearing . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.2 Critical-Band Masking Curve . . . . . . . . . . . . . . . . . . . . . 24 3.2 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Model for the Basilar Membrane . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Damped Simple Harmonic Oscillation . . . . . . . . . . . . . . . . . 28 3.3 Coupled Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Coupled Masking Effect . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.2 Implementation of Masking Curve . . . . . . . . . . . . . . . . . . . 30 Chapter 4 Simulation Results 32 4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.1 Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 5 Conclusion and Future Works 37 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

參考文獻 References
[1] L. E. Kinsler, Fundamentals of acoustic. John Wiley and Sons, 1982. [2] K. Paliwal, B. Shannon, J. Lyons, and K. W´ojcicki, “Speech-Signal-Based Frequency Warping,” IEEE Signal Processing Letters, vol. 16, no. 4, p. 319, 2009. [3] R. Arthur, R. Pfeiffer, and N. Suga, “Properties oftwo-tone inhibition’in primary auditory neurones,” The Journal of Physiology, vol. 212, no. 3, p. 593, 1971. [4] L. Turicchia and R. Sarpeshkar, “A bio-inspired companding strategy for spectral enhancement,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2, pp. 243–253, 2005. [5] B. Gold and N. Morgan, Speech and audio signal processing and perception of speech and music. John Wiley and Sons, 1999. [6] X. Huang, A. Acero, and H. Hon, Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR Upper Saddle River, NJ, USA, 2001. [7] B. Strope and A. Alwan, “A model of dynamic auditory perception and its application torobust word recognition,” IEEE transactions on Speech and Audio Processing, vol. 5, no. 5, pp. 451–464, 1997. [8] B. Milner, “A comparison of front-end configurations for robust speechrecognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002. Proceedings.(ICASSP’02), vol. 1, 2002. [9] H. Fletcher and W. Munson, “Loudness, its definition, measurement and calculation.,” Journal of the Acoustical Society of America, vol. 5, pp. 82–108, 1933. [10] S. Boll, “Supression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, pp. 113–120, April 1979. [11] A. Berstein and I. Shallom, “An hypothesized Wiener filtering approach to noisy speechrecognition,” in Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on, pp. 913–916, 1991. [12] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990. [13] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 29, no. 2, pp. 254–272, 1981. [14] O. Viikki, D. Bye, and K. Laurila, “A recursive feature vector normalization approach for robust speechrecognition in noise,” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 2, 1998. [15] A. de La Torre, A. Peinado, J. Segura, J. Perez-Cordoba, M. Benitez, and A. Rubio, “Histogram equalization of speech representation for robust speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 355–366, 2005. [16] B. Hanson and T. Applebaum, “Robust speaker-independent word recognition using static, dynamic and acceleration features: experiments with Lombard and noisy speech,” in proc. ICASSP, vol. 90, pp. 857–860, 1990. [17] B. Milner, “Inclusion of temporal information into features for speechrecognition,” in Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, vol. 1, 1996. [18] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994. [19] D. Kim, S. Lee, and R. Kil, “Auditory processing of speech signals for robust speech recognition in real-world noisy environments,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 1, pp. 55–69, 1999. [20] H. Fletcher, “Auditory patterns,” Reviews of Modern Physics, vol. 12, no. 1, pp. 47–65, 1940. [21] J. Johnston, “Transform coding of audio signals using perceptual noise criteria,” IEEE Journal on selected areas in communications, vol. 6, no. 2, pp. 314–323, 1988. [22] K. Paliwal and B. Lilly, “Auditory masking based acoustic front-end for robust speech recognition,” in Proceedings of IEEE TENCON, vol. 1, pp. 165–168, 1997. [23] G. Von B´ek´esy and E. Wever, Experiments in hearing. McGraw-Hill New York, 1960. [24] S. Stevens, J. Volkmann, and E. Newman, “A scale for the measurement of the psychological magnitude pitch,” The Journal of the Acoustical Society of America, vol. 8, p. 185, 1937. [25] E. Zwicker, “Subdivision of the audible frequency range into critical bands (Frequenzgruppen),” Acoustical Society of America Journal, vol. 33, p. 248, 1961. [26] B. Moore and B. Glasberg, “A revision of Zwickerzs loudness model,” Acta Acustica, vol. 82, no. 335-345, p. 17, 1996. [27] M. Holmberg, D. Gelbart, and W. Hemmert, “Automatic speech recognition with an adaptation model motivated by auditory processing,” TIEEE Transactions on Speech and Audio Processing. [28] B. Moore, An introduction to the psychology of hearing. Emerald Group Pub Ltd, 2003. [29] K. Park and S. Lee, “An engineering model of the masking for the noise-robust speech recognition,” Neurocomputing, vol. 52, pp. 615–620, 2003. [30] S. Chiou and C. Chen, “Noise-Robust Feature Extraction Based on Forward Masking,” pp. 1259–1262, 2009. [31] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” Readings in speech recognition, pp. 65–74, 1990. [32] C. Chen and J. Bilmes, “MVA processing of speech features,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 257–270, 2007. [33] D. Pearce and H. Hirsch, “The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions,” in ICSA ITRW ASR2000, September 2000.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.118.140.108 論文開放下載的時間是校外不公開 Your IP address is 18.118.140.108 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS