Responsive image
博碩士論文 etd-0716117-130558 詳細資訊
Title page for etd-0716117-130558
論文名稱
Title
結合深層類神經網路除噪自動編碼器於噪音強健性數字語音辨識
Combined with Deep Neural Network De-noising Auto Encoder on Noise-Robust Digit Continuous Speech Recognition
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
65
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2017-07-21
繳交日期
Date of Submission
2017-08-17
關鍵字
Keywords
除噪自動編碼器、卷積神經網路、深度學習、自動語音辨識、全連接神經網路
speech recognition, denoising auto encoder, convolutional neural network, deep learning, fully connected neural network
統計
Statistics
本論文已被瀏覽 5761 次,被下載 28
The thesis/dissertation has been browsed 5761 times, has been downloaded 28 times.
中文摘要
本論文結合了深層類神經網路除噪自動編碼器以及高斯混合模型實做了一個自動語音辨識系統於Aurora 2.0資料庫,系統的除噪模型部分採用了不同種類的除噪自動編碼器對輸入的語音特徵進行除噪,聲學模型的部分則是使用高斯混和模型來對音素進行辨識。在本論文中,我們所使用的除噪自動編碼器是基於A.Mass等人所提出的遞回除噪自動編碼器來進行改良,並取得更好的結果;首先我們提出了使用不同的激活函數來對特徵進行還原,並將原本的激活函數tanh替換為LeakyReLU,藉此來充分描述特徵的非線性關係;在模型的架構上,我們提出了全連接除噪自動編碼器(Full Connected Denoising Auto Encoder, FCDAE)、捲積除噪自動編碼器(Convolutional Denoising Auto Encoder, CDAE)、捲積全連接除噪自動編碼器三種架構三種架構,並從不同種的除噪編碼器來探討模型架構對於語音除噪之效能;在特徵的還原上,我們採用了大範圍的特徵還原方式,透過考慮不同範圍下的音框來對特徵進行還原,藉此考慮音框前後之相關性。透過上述方法的改良,當使用MFCC特徵進行辨識時,我們的方法比起MFCC乾淨環境與多環境訓練下之基準實驗結果有76.5%與41%的相對改善率,並且達到與2007年由歐洲電信標準協會所提出的進階分散式前端(Advanced Front End, AFE)的特徵擷取方法相近的辨識率;本研究最後也將此方法與AFE的方法進行結合,得到了92.42%的詞正確率,比起AFE於Aurora 2.0資料庫的多環境之基準實驗獲得了7.8%的相對改善率。
Abstract
In this paper, we combine the deep neural network De-noising Auto Encoder and Gaussian Mixture Model to implement an automatic speech recognition system in the Aurora 2.0 database. The de-noise model of the system uses different types of De-noising Auto Encoder to de-noise the input speech feature, the acoustic model is used to identify the phonemes using the Gaussian Mixture Model. Our experiments are based on the DRDAE proposed by A. Mass et al and achieve better word correct rate; First, we proposed to use a different activation function to reconstruct the feature, and replace the original activation function tanh with Leaky ReLU to fully describe the non-linear relationship of the feature; Second, we proposed three kinds of de-noising models : Fully Connected De-noising Auto Encoder (FCDAE), Convolutional De-noising Auto Encoder (CDAE) and Convolutional Fully Connected Denoising Auto Encoder. At the same time, we compare the different architectures noise reduction ability. In the stage of reconstructing the feature, we use the different size of the temporal context window to consider the relevance of the frame before and after. To summary up, we get the better average word correct rate of the system than the MFCC baseline experiment in the paper released by Aurora 2.0 in 2000, our method has a relative improvement rate of 76.5% and 41% compared to the baseline of the MFCC clean condition and multi condition. At the end of this study, this method was combined with the AFE method to obtain 92.42% word correct rate and achieve 7.8% improvement compared to the AFE multi condition baseline experiment on Aurora 2.0 database.
目次 Table of Contents
論文審訂書 i
Acknowledgments ii
摘要 iii
ABSTRACT v
Table of Contents vii
List of Tables ix
List of Figures xi

Chapter 1 緒論 1
1.1 背景與研究動機 1
1.2 論文架構 3

Chapter 2 相關文獻回顧 4
2.1 傳統語音辨識 4
2.2 傳統語音辨識於Aurora 2.0 5
2.3 類神經網路的崛起 5
2.4 類神經網路與語音辨識之結合 6

Chapter 3 研究方法 8
3.1 語料庫介紹 8
3.2 特徵擷取 10
3.2.1 MFCC特徵擷取 10
3.2.2 分散式語音系統之進階前端特徵擷取 12
3.3 特徵正規化 13
3.4 除噪自動編碼器 13
3.4.1 全連接除噪自動編碼器 15
3.4.2 卷積除噪自動編碼器 16
3.4.3 深度遞迴除噪自動編碼器 19
3.5 聲學模型架構 19
3.6 工具介紹:Tensorflow / Keras 20

Chapter 4 實驗 23
4.1 實驗設定 23
4.2 基準實驗與整體實驗結果之平均詞正確率 24
4.2.1 MFCC基準實驗 27
4.2.2 遞迴自動編碼器(DRDAE) 28
4.2.3 Advanced Front End 28
4.3 激活函數對除噪模型之影響程度比較 29
4.4 不同架構之除噪能力比較 33
4.5 卷積神經網路與全連接網路之結合 36
4.6 小範圍與大範圍的特徵還原 38
4.7 AFE與大範圍特徵還原之結合 41
4.8 乾淨資料的使用與除噪與否之探討 44

Chapter 5 結論與未來展望 46

Bibliography 48
參考文獻 References
[1] S. Processing, “Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Advanced Front-End Feature Extraction Algorithm; Compression Algorithms,” ETSI ES, vol. 202, no. 050, p. V1, 2002.
[2] P. Lockwood and J. Boudy, “Experiments with a Nonlinear Spectral Subtractor (NSS), Hidden Markov Models and the Projection, for Robust Speech Recognition in Cars,” Speech communication, vol. 11, no. 2-3, pp. 215–228, 1992.
[3] X. Huang, A. Acero, H.-W. Hon, and R. Foreword By-Reddy, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, 2001.
[4] Y. Ephraim and D. Malah, “Speech Enhancement Using a Minimum-Mean Square error Short-time Spectral Amplitude Estimator,” IEEE Trans. Speech and Signal Processing, vol. 32, no. 6, pp. 1109–1121, 1984.
[5] P. J. Moreno, B. Raj, and R. M. Stern, “Data-Driven Environmental Compensation for Speech Recognition: A Unified Approach,” Speech Communication, vol. 24, no. 4, pp. 267–285, 1998.
[6] M. J. Gales and S. J. Young, “Cepstral Parameter Compensation for HMM Recognition in Noise,” Speech communication, vol. 12, no. 3, pp. 231–239, 1993.
[7] H.-G. Hirsch and D. Pearce, “The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems under Noisy Conditions,” in ITRW, pp. 29–32, 2000.
[8] D. P. Ellis and M. J. R. Gomez, “Investigations into Tandem Acoustic Modeling for the Aurora Task,” in INTERSPEECH, pp. 189–192, 2001.
[9] J. Barker, M. Cooke, and P. D. Green, “Robust ASR Based on Clean Speech Models: An Evaluation of Missing Data Techniques for Connected Digit Recognition in noise,” in INTERSPEECH, pp. 213–217, 2001.
[10] J. de Veth, L. Mauuary, B. Noe, F. de Wet, J. Sienel, L. Boves, and D. Jouvet, “Feature Vector Selection to Improve ASR Robustness in Noisy Conditions,” in INTERSPEECH, pp. 201–204, 2001.
[11] L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang, “High-Performance Robust Speech Recognition Using Stereo Training Data,” in ICASSP, vol. 1, pp. 301–304, 2001.
[12] M. Lieb and A. Fischer, “Experiments with the Philips Continuous ASR System on the AURORA Noisy Digits Database,” in INTERSPEECH, pp. 625–628, 2001.
[13] H. Hirsch and D. Pearce, “Applying the Advanced ETSI Frontend to the Aurora-2 Task,” vol. 1, 2006.
[14] W. S. McCulloch and W. Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943.
[15] D. O. Hebb, The Organization of Behavior: A Neuropsychological Theory. Psychology Press, 2005.
[16] G. E. Hinton and R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
[17] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech Recognition with Deep Recurrent Neural Networks,” in ICASSP, pp. 6645–6649, IEEE, 2013.
[18] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “Achieving Human Parity in Conversational Speech Recognition,” arXiv
preprint arXiv:1610.05256, 2016.
[19] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, et al., “English Conversational Telephone Speech Recognition by Humans and Machines,” arXiv preprint arXiv:1703.02136, 2017.
[20] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy Layer-Wise Training of Deep Networks,” in Advances in neural information processing systems, pp. 153–160, 2007.
[21] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech Enhancement Based on Deep Denoising Autoencoder,” in Interspeech, pp. 436–440, 2013.
[22] P. Lin, D.-C. Lyu, F. Chen, S.-S. Wang, and Y. Tsao, “Multi-style Learning with Denoising Autoencoders for Acoustic Modeling in the Internet of Things (IoT),” Computer Speech & Language, 2017.
[23] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, “Recurrent Neural Networks for Noise Reduction in Robust ASR,” in INTERSPEECH, pp. 22–25, 2012.
[24] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE/ACM Trans. Speech and Language Processing, vol. 23, no. 1, pp. 7–19, 2015.
[25] J. Li, Y. Huang, and Y. Gong, “Improved Cepstra Minimum-Mean-Square-Error Noise Reduction Algorithm for Robust Speech Recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 4865–4869, IEEE, 2017.
[26] B. Li and K. C. Sim, “Noise Adaptive Front-End Normalization Based on Vector Taylor Series for Deep Neural Networks in Robust Speech Recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7408–7412, IEEE, 2013.
[27] B. Li and K. C. Sim, “Improving Robustness of Deep Neural Networks via Spectral Masking for Automatic Speech Recognition,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pp. 279–284, IEEE, 2013.
[28] O. Viikki and K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition,” Speech Communication, vol. 25, no. 1, pp. 133–147, 1998.
[29] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and Composing Robust Features with Denoising Autoencoders,” in ICML, pp. 1096–1103, ACM, 2008.
[30] H. Lee, P. Pham, Y. Largman, and A. Y. Ng, “Unsupervised Feature Learning for Audio Classification Using Convolutional Deep Belief Networks,” in NIPS, pp. 1096–1104, 2009.
[31] A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” arXiv preprint
arXiv:1511.06434, 2015.
[32] S. Processing, “Transmission and quality aspects (stq); distributed speech recognition; front-end feature extraction algorithm; compression algorithms,” ETSI ES, vol. 201, no. 108, p. v1, 2000
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code