Responsive image
博碩士論文 etd-0825110-171640 詳細資訊
Title page for etd-0825110-171640
論文名稱
Title
基於聽覺前端之噪音強健性自動語音辨識
Auditory Front-Ends for Noise-Robust Automatic Speech Recognition
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
52
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2010-07-28
繳交日期
Date of Submission
2010-08-25
關鍵字
Keywords
自動語音辨識、噪音強健性、參數萃取、前端處理、頻率遮蔽
frequency masking, front end processing, feature extraction, noise-robust, speech recognition
統計
Statistics
本論文已被瀏覽 5626 次,被下載 0
The thesis/dissertation has been browsed 5626 times, has been downloaded 0 times.
中文摘要
人類的聽覺感系統比起現存的任一自動語音辨識系統還要來的更加精確且更不受噪音的影響,因此可以期待若能在自動語音辨識系統中去模擬人類聽覺感知模型,就可以藉此提昇自動語音辨識
系統的噪音強健性。
在這篇論文中,對於自動語音辨識系統(automatic speech recognition systems),我們去研究並修改常見的參數萃取系統。使用一個基於將基膜模擬成為一個串連的帶有阻尼的簡諧振器(simple harmonic oscillators)的新的頻率遮蔽曲線(frequency masking curve),來取代原有的critical-band遮蔽曲線而去計算遮蔽門檻(masking threshold)。我們使用數學的方法分析當振盪器被短時間的穩定(short-time stationary)語音訊號驅動時耦合的運動情形。基於分析,我們可以得到彼此相鄰的振盪器之間的振幅的關係。藉此,我們插入一個人耳模型到萃取特徵參數的程序之中來改變原有之語音頻譜(speech spectrum)。
我們使用Aurora 2.0 的語料庫來作為我們實驗的評估。當我們所提出的聽覺相關前端模型與常見的倒頻譜均值正規化(cepstral mean subtraction)後處理的方法相結合時,可以達到不錯的進步效果。在我們的相關性影響的方法中加入倒頻譜均值正規化後,對於基本結果(Baseline)而言可達到25.9%的相對改善率。
並在重複作用我們所提出的方法下更可使得相對進步率由原本的25.9%提升到30.3%。
Abstract
The human auditory perception system is much more noise-robust than any state-of the art automatic speech recognition (ASR) system. It is expected that the noise-robustness of speech feature can be improved by employing the human auditory based
feature extraction procedure.


In this thesis, we investigate modifying the commonly-used feature extraction process for automatic speech recognition systems. A novel frequency masking curve, which is based on modeling the basilar
membrane as a cascade system of damped simple harmonic oscillators, is used to replace the critical-band masking curve to compute the masking
threshold. We mathematically analyze the coupled motion of the oscillator system (basilar membrane) when they are driven by short-time stationary (speech) signals. Based on the analysis, we derive the relation between the amplitudes of neighboring oscillators,
and accordingly insert a masking module in the front-end signal processing stage to modify the speech spectrum.

We evaluate the proposed method on the Aurora 2.0
noisy-digit speech database. When combined with the commonly-used cepstral mean subtraction post-processing, the proposed auditory front-end module achieves a significant improvement. The method
of correlational masking effect curve combine with CMS can achieves relative improvements of 25.9%
over the baseline respectively. After applying the methods iteratively, the relative improvement
improves from 25.9% to 30.3%.
目次 Table of Contents
List of Tables iii
List of Figures iv
誌謝vi
Chapter 1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2 Related Works 5
2.1 Ear Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 The Outer Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 The Middel Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 The Inner Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3.1 The Cochlea . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3.2 The Basilar Membrane . . . . . . . . . . . . . . . . . . . 8
2.2 Auditory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Frequency Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Two-Tone Inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Synaptic Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Temporal Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.5 Temporal Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Common Methods in Each Stage of ASR . . . . . . . . . . . . . . . . . . . 15
2.3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2.1 CMN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2.2 MVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Inclusion of Temporal Information . . . . . . . . . . . . . . . . . . . 19
2.3.3.1 The Differential Cepstrum . . . . . . . . . . . . . . . . . . 19
2.3.3.2 The Cepstral-Time Matrix . . . . . . . . . . . . . . . . . . 20
Chapter 3 The Proposed Algorithm 21
3.1 Frequency Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Threshold of Hearing . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Critical-Band Masking Curve . . . . . . . . . . . . . . . . . . . . . 24
3.2 Model Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Model for the Basilar Membrane . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Damped Simple Harmonic Oscillation . . . . . . . . . . . . . . . . . 28
3.3 Coupled Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Coupled Masking Effect . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Implementation of Masking Curve . . . . . . . . . . . . . . . . . . . 30
Chapter 4 Simulation Results 32
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 5 Conclusion and Future Works 37
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
參考文獻 References
[1] L. E. Kinsler, Fundamentals of acoustic. John Wiley and Sons, 1982.
[2] K. Paliwal, B. Shannon, J. Lyons, and K. W´ojcicki, “Speech-Signal-Based Frequency
Warping,” IEEE Signal Processing Letters, vol. 16, no. 4, p. 319, 2009.
[3] R. Arthur, R. Pfeiffer, and N. Suga, “Properties oftwo-tone inhibition’in primary auditory
neurones,” The Journal of Physiology, vol. 212, no. 3, p. 593, 1971.
[4] L. Turicchia and R. Sarpeshkar, “A bio-inspired companding strategy for spectral enhancement,”
IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2,
pp. 243–253, 2005.
[5] B. Gold and N. Morgan, Speech and audio signal processing and perception of speech
and music. John Wiley and Sons, 1999.
[6] X. Huang, A. Acero, and H. Hon, Spoken language processing: A guide to theory,
algorithm, and system development. Prentice Hall PTR Upper Saddle River, NJ, USA,
2001.
[7] B. Strope and A. Alwan, “A model of dynamic auditory perception and its application
torobust word recognition,” IEEE transactions on Speech and Audio Processing, vol. 5,
no. 5, pp. 451–464, 1997.
[8] B. Milner, “A comparison of front-end configurations for robust speechrecognition,”
in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002.
Proceedings.(ICASSP’02), vol. 1, 2002.
[9] H. Fletcher and W. Munson, “Loudness, its definition, measurement and calculation.,”
Journal of the Acoustical Society of America, vol. 5, pp. 82–108, 1933.
[10] S. Boll, “Supression of acoustic noise in speech using spectral subtraction,” IEEE Transactions
on Acoustics, Speech and Signal Processing, vol. 27, pp. 113–120, April 1979.
[11] A. Berstein and I. Shallom, “An hypothesized Wiener filtering approach to noisy
speechrecognition,” in Acoustics, Speech, and Signal Processing, 1991. ICASSP-91.,
1991 International Conference on, pp. 913–916, 1991.
[12] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the
Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
[13] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Transactions
on Acoustics, Speech and Signal Processing, vol. 29, no. 2, pp. 254–272, 1981.
[14] O. Viikki, D. Bye, and K. Laurila, “A recursive feature vector normalization approach
for robust speechrecognition in noise,” in Acoustics, Speech and Signal Processing,
1998. Proceedings of the 1998 IEEE International Conference on, vol. 2, 1998.
[15] A. de La Torre, A. Peinado, J. Segura, J. Perez-Cordoba, M. Benitez, and A. Rubio,
“Histogram equalization of speech representation for robust speech recognition,” IEEE
Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 355–366, 2005.
[16] B. Hanson and T. Applebaum, “Robust speaker-independent word recognition using
static, dynamic and acceleration features: experiments with Lombard and noisy speech,”
in proc. ICASSP, vol. 90, pp. 857–860, 1990.
[17] B. Milner, “Inclusion of temporal information into features for speechrecognition,” in
Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on,
vol. 1, 1996.
[18] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Transactions on
Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994.
[19] D. Kim, S. Lee, and R. Kil, “Auditory processing of speech signals for robust speech
recognition in real-world noisy environments,” IEEE Transactions on Speech and Audio
Processing, vol. 7, no. 1, pp. 55–69, 1999.
[20] H. Fletcher, “Auditory patterns,” Reviews of Modern Physics, vol. 12, no. 1, pp. 47–65,
1940.
[21] J. Johnston, “Transform coding of audio signals using perceptual noise criteria,” IEEE
Journal on selected areas in communications, vol. 6, no. 2, pp. 314–323, 1988.
[22] K. Paliwal and B. Lilly, “Auditory masking based acoustic front-end for robust speech
recognition,” in Proceedings of IEEE TENCON, vol. 1, pp. 165–168, 1997.
[23] G. Von B´ek´esy and E. Wever, Experiments in hearing. McGraw-Hill New York, 1960.
[24] S. Stevens, J. Volkmann, and E. Newman, “A scale for the measurement of the psychological
magnitude pitch,” The Journal of the Acoustical Society of America, vol. 8,
p. 185, 1937.
[25] E. Zwicker, “Subdivision of the audible frequency range into critical bands (Frequenzgruppen),”
Acoustical Society of America Journal, vol. 33, p. 248, 1961.
[26] B. Moore and B. Glasberg, “A revision of Zwickerzs loudness model,” Acta Acustica,
vol. 82, no. 335-345, p. 17, 1996.
[27] M. Holmberg, D. Gelbart, and W. Hemmert, “Automatic speech recognition with an
adaptation model motivated by auditory processing,” TIEEE Transactions on Speech
and Audio Processing.
[28] B. Moore, An introduction to the psychology of hearing. Emerald Group Pub Ltd, 2003.
[29] K. Park and S. Lee, “An engineering model of the masking for the noise-robust speech
recognition,” Neurocomputing, vol. 52, pp. 615–620, 2003.
[30] S. Chiou and C. Chen, “Noise-Robust Feature Extraction Based on Forward Masking,”
pp. 1259–1262, 2009.
[31] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences,” Readings in speech recognition,
pp. 65–74, 1990.
[32] C. Chen and J. Bilmes, “MVA processing of speech features,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 257–270, 2007.
[33] D. Pearce and H. Hirsch, “The AURORA experimental framework for the performance
evaluation of speech recognition systems under noisy conditions,” in ICSA ITRW
ASR2000, September 2000.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 18.118.140.108
論文開放下載的時間是 校外不公開

Your IP address is 18.118.140.108
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code