Responsive image
博碩士論文 etd-0901109-143419 詳細資訊
Title page for etd-0901109-143419
論文名稱
Title
強健性自動語音辨識之基於聽覺模型的梅爾倒頻譜參數擷取調整
Auditory Based Modification of MFCC Feature Extraction for Robust Automatic Speech Recognition
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
80
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2009-07-25
繳交日期
Date of Submission
2009-09-01
關鍵字
Keywords
噪音強健性、聽覺模型、後遮蔽、自動語音辨識
forward masking, auditory model, syanptic adaptation, temporal integration, noise robust, ASR, automatic speech recognition
統計
Statistics
本論文已被瀏覽 5683 次,被下載 1012
The thesis/dissertation has been browsed 5683 times, has been downloaded 1012 times.
中文摘要
人類的聽覺感知模型比起現存的自動語音辨識系統更加精確也更不受噪音的影
響, 若能在自動語音辨識系統中模擬人類聽覺感知模型,可以藉此提昇自動語音辨識
系統的噪音強健性。
後遮蔽(forward masking)是一種聽覺感知的現象,一個較強的聲音會遮蔽
其後的聲音。我們使用兩個聽覺機制來模擬後遮蔽,分別是突觸調適(synaptic
adaptation)與時序統合(temporal integration),並且以濾波器來實做這些功能。在
梅爾頻率倒頻譜之中加入後遮蔽的模型藉此提高語音特徵參數的噪音強健性。
我們使用Aurora 3語料庫來作實驗,並且依照Aurora 3提供的標準流程進行訓練及
測試。實驗結果顯示synaptic adaptation可以提昇語音辨識正確率16.6%,temporal integration
可以提昇21.6%,另一種temporal integration可以提昇22.5%。將synaptic adaptation
與兩種temporal integration方法結合之後可以進一步提昇達到26.3%及25.5%。接
下來再進行濾波器參數的最佳化,將synaptic adaptation濾波器的相對進步率提高
為18.4%, temporal integration濾波器的相對進步提高為25.2%,第二種temporal integration
提高為22.6%。對於兩種濾波器結合的方法,其相對進步率增為26.9%及26.3%。
Abstract
The human auditory perception system is much more noise-robust than any state-of theart
automatic speech recognition (ASR) system. It is expected that the noise-robustness of
speech feature vectors may be improved by employing more human auditory functions in the
feature extraction procedure.
Forward masking is a phenomenon of human auditory perception, that a weaker sound
is masked by the preceding stronger masker. In this work, two human auditory mechanisms,
synaptic adaptation and temporal integration are implemented by filter functions and incorporated
to model forward masking into MFCC feature extraction. A filter optimization algorithm
is proposed to optimize the filter parameters.
The performance of the proposed method is evaluated on Aurora 3 corpus, and the procedure
of training/testing follows the standard setting provided by the Aurora 3 task. The
synaptic adaptation filter achieves relative improvements of 16.6% over the baseline. The
temporal integration and modified temporal integration filter achieve relative improvements
of 21.6% and 22.5% respectively. The combination of synaptic adaptation with each of temporal
integration filters results in further improvements of 26.3% and 25.5%. Applying the
filter optimization improves the synaptic adaptation filter and two temporal integration filters,
results in the 18.4%, 25.2%, 22.6% improvements respectively. The performance of the
combined-filters models are also improved, the relative improvement are 26.9% and 26.3%.
目次 Table of Contents
List of Tables iii
List of Figures iv
Acknowledgments vi
Chapter 1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Related Works 4
2.1 MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Common Methods of Noise Robustness . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Temporal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Cepstral Mean Subtraction . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 RelAtive SpecTrAl (RASTA) . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 TempoRAl Patterns (TRAPs) . . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 MVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.5 Temporal Structure Normalization . . . . . . . . . . . . . . . . . . . 13
2.3.6 Data Driven Temporal Filter . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Human Auditory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
i
2.4.1 Two-Tone Suppression . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Zero-Crossings with Peak Amplitudes (ZCPA) . . . . . . . . . . . . 17
2.4.3 A Dynamic Forward Masking Model . . . . . . . . . . . . . . . . . 18
Chapter 3 The Proposed Algorithm 20
3.1 Forward Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Synaptic Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Temporal Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2.1 A Modification of Temporal Integration Filter . . . . . . . 30
3.1.3 Synaptic Adaptation with Temporal Integration . . . . . . . . . . . . 33
3.1.3.1 Synaptic Adaptation with Modified Temporal Integration . 34
3.2 Filter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Synaptic Adaptation Filter . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 Temporal Integration Filter . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.3.1 Modified Temporal Integration Filter . . . . . . . . . . . . 43
3.2.4 Synaptic Adaptation with Temporal Integration . . . . . . . . . . . . 45
3.2.4.1 Synaptic adaptation with Modified Temporal Integration . . 47
3.2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 4 Experimental Results 51
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 5 Conclusion and Future Works 61
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Bibliography 63
Appendix A The Training Data for Filter Optimization 67
參考文獻 References
[1] L. Turicchia and R. Sarpeshkar, “A bio-inspired companding strategy for spectral enhancement,”
IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2,
pp. 243–253, 2005.
[2] X. Huang, A. Acero, and H. Hon, Spoken language processing: A guide to theory,
algorithm, and system development. Prentice Hall PTR Upper Saddle River, NJ, USA,
2001.
[3] B. Gold and N. Morgan, Speech and audio signal processing: processing and perception
of speech and music. John Wiley & Sons, Inc. New York, NY, USA, 1999.
[4] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic
word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics,
Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[5] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Journal of the
Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
[6] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE
Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113–120,
1979.
[7] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for
automatic speaker identification and verification,” J. Acoust. Soc. Amer, pp. 1304–1312,
1974.
[8] C. Chen and J. Bilmes, “MVA Processing of Speech Features,” IEEE Transactions on
Audio, Speech and Language Processing, vol. 15, no. 1, pp. 257–270, 2007.
[9] X. Xiao, E. Chng, and H. Li, “Normalization of the Speech Modulation Spectra for
Robust Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 16, no. 8, pp. 1662–1674, 2008.
[10] J. Droppo, L. Deng, and A. Acero, “Evaluation of the SPLICE algorithm on the Aurora2
database,” in Seventh European Conference on Speech Communication and Technology,
ISCA, 2001.
[11] M. Gales and S. Young, “Robust speech recognition in additive and convolutional noise
using parallel model combination,” Computer speech & language(Print), vol. 9, no. 4,
pp. 289–307, 1995.
[12] P. Moreno, B. Raj, and R. Stern, “A vector Taylor series approach for environmentindependent
speech recognition,” in IEEE INTERNATIONAL CONFERENCE ON
ACOUSTICS SPEECH AND SIGNAL PROCESSING, vol. 2, 1996.
[13] C. Yang, F. Soong, and T. Lee, “Static and dynamic spectral features: Their noise robustness
and optimal weights for ASR,” in IEEE International Conference on Acoustics,
Speech, and Signal Processing, 2005. Proceedings.(ICASSP’05), vol. 1, 2005.
[14] J. Hung and L. Lee, “Optimization of temporal filters for constructing robust features in
speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing,
vol. 14, pp. 808–832, 2006.
[15] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Transactions on
Speech Audio Processing, pp. 578–589, 1994.
[16] H. Hermansky and S. Sharma, “TRAPS-Classifiers Of Temporal Patterns,” in Fifth International
Conference on Spoken Language Processing, ISCA, 1998.
[17] C. Avendano, S. Vuuren, and H. Hermansky, “Data based filter design for RASTA-like
channel normalization in ASR,” in Fourth International Conference on Spoken Language
Processing, ISCA, 1996.
[18] C. Avendano and H. Hermansky, “On the properties of temporal processing for speech
in adverse environments,” Applications of Signal Processing to Audio and Acoustics,
1997. 1997 IEEE ASSP Workshop on, Oct 1997.
[19] D. Lee and R. Kil, “Auditory processing of speech signals for robust speech recognitionin
real-world noisy environments,” IEEE Transactions on Speech and Audio Processing,
vol. 7, no. 1, pp. 55–69, 1999.
[20] J. Kates, “A time-domain digital cochlear model,” IEEE Transactions on signal processing,
vol. 39, no. 12, pp. 2573–2592, 1991.
[21] S. Haque, R. Togneri, and A. Zaknich, “Perceptual features for automatic speech recognition
in noisy environments,” Speech Commun., vol. 51, no. 1, pp. 58–75, 2009.
[22] B. Strope and A. Alwan, “A model of dynamic auditory perception and its application
torobust word recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5,
no. 5, pp. 451–464, 1997.
[23] B. Moore, An introduction to the psychology of hearing. Academic press, 2003.
[24] A. Oxenham, “Forward masking: Adaptation or integration?,” The Journal of the Acoustical
Society of America, vol. 109, p. 732, 2001.
[25] M. Holmberg, D. Gelbart, and W. Hemmert, “Automatic speech recognition with an
adaptation model motivated by auditory processing,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 14, no. 1, pp. 43–49, 2006.
[26] K. Park and S. Lee, “An engineering model of the masking for the noise-robust speech
recognition,” Neurocomputing, vol. 52, no. 54, pp. 615–620, 2003.
[27] S. Vuuren and H. Hermansky, “Data-driven design of RASTA-like filters,” in Fifth European
Conference on Speech Communication and Technology, ISCA, 1997.
[28] S. Haykin et al., “Adaptive filtering theory,” Book, Prentice-Hall Information Systems
Science Series, 1986.
[29] S. Haykin, “Adaptive filters,” Signal Processing Magazine. IEEE Computer Society,
1999.
[30] J. Droppo, M. Mahajan, A. Gunawardana, and A. Acero, “How to train a discriminative
front end with stochastic gradient descent and maximum mutual information,” in
Proceedings IEEE Automatic Speech Recognition and Understanding Workshop, 2005.
[31] H. Hirsch and D. Pearce, “The AURORA Experimental Framework For The Performance
Evaluation of Speech Recognition Systems Under Noisy Conditions,” in
ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial
and Research Workshop (ITRW), ISCA, 2000.
[32] A. Moreno, B. Lindberg, C. Draxler, G. Richard, K. Choukry, S. Euler, and J. Allen,
“SPEECH DAT CAR. A Large Speech Database For Automotive Environments,”
[33] N. Parihar and J. Picone, “DSR front end LVCSR evaluation AU/384/02,” Aurora Working
Group, European Telecommunications Standards Institute, 2002.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code