Responsive image
博碩士論文 etd-0730118-100155 詳細資訊
Title page for etd-0730118-100155
論文名稱
Title
應用深度神經網路與集成學習於語音情緒辨識
Deep Neural Networks and Ensemble Learning with Application to Speech Emotion Recognition
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
56
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2018-07-31
繳交日期
Date of Submission
2018-08-30
關鍵字
Keywords
語音情緒辨識、集成學習、深度神經網路、注意機制
attention mechanism, speech emotion recognition, ensemble learning, deep neural networks
統計
Statistics
本論文已被瀏覽 5675 次,被下載 1164
The thesis/dissertation has been browsed 5675 times, has been downloaded 1164 times.
中文摘要
本論文使用深度神經網路來建構靜態及動態的語音情緒辨識系統,並以集成學習的方式結合靜態與動態模型。靜態模型使用的架構為多層感知器及卷積神經網路;而動態模型採用的架構則是遞迴神經網路。在卷積神經網路中,我們會使用注意機制使網路能夠學習到語音訊號中顯著的部份並且藉由多尺度卷積模組促進一組多尺度卷積核之間的競爭。遞迴神經網路也結合了注意機制使網路在學習時能夠專注於訊息顯著的部份。為了處理資料類別不平衡的問題,我們採用了偏斜強健性的訓練過程。我們也使用了兩階段的teacher-student訓練方法來處理噪音標註的問題。論文中提出的語音情緒辨識系統根據Interspeech 2009 Emotion Challenge分類器子挑戰中所定義的任務以未加權平均召回率(Unweighted Average recall rate, UA)對FAU-Aibo語料庫進行評估。在靜態模型方面,我們使用的多層感知器及卷積神經網路分別達到了46.2%、46.4%的未加權平均召回率;而我們以深度遞迴神經網路為架構的動態模型則是達到了47.2%的未加權平均召回率,超越了之前最佳的結果46.4%。進一步使用集成學習中的內插將靜態模型與動態模型做結合之後未加權平均召回率能夠達到50.5%,自挑戰公布以來首次突破FAU-Aibo任務的50.0%障礙。
Abstract
This study uses deep neural networks to construct the static and dynamic speech emotion recognition systems and integrates the static and dynamic models by ensemble learning. The static model is based on multi-layer perceptron (MLP) and convolutional neural network (CNN). The dynamic model is based on recurrent neural network (RNN). Our CNN recognizer learns to focus on salient parts of signal by the attention mechanism, and promotes competition among a set of multi-scale convolutional filters by multi-scale convolution module. The RNN recognizer also incorporates the attention mechanism to learn to focus on the informative segments. We adopt a skew-robust training criterion to deal with unbalanced data. We also exploit a two-pass teacher-student training scheme to deal with the issue of noisy labels. The proposed speech emotion recognition systems are evaluated on the FAU-Aibo corpus, using the tasks as defined in the Interspeech 2009 Emotion Challenge classifier sub-challenge, with the performance measure of unweighted average recall rate (UA). Our MLP and CNN models achieve 46.2% and 46.4% UA respectively, and our dynamic model with deep RNN achieves 47.2% UA, surpassing the previous best mark of 46.4%. Further, an ensemble learning implemented by interpolation that combines the static and dynamic models achieves 50.5% UA, breaking the 50.0% barrier on the FAU-Aibo tasks for the first time since the Challenge is posted.
目次 Table of Contents
List of Tables vii
List of Figures viii
Chapter 1 緒論 1
1.1 背景與研究動機 1
1.2 文獻回顧 2
1.3 論文架構 4
Chapter 2 基本架構與工具介紹 5
2.1 深度神經網路 5
2.1.1 多層感知器 5
2.1.2 遞迴神經網路 6
2.1.3 卷積神經網路 9
2.2 雙線性內插 10
2.3 工具介紹:Tensorflow/Keras 11
Chapter 3 語料庫介紹與研究方法 13
3.1 語料庫介紹與特徵擷取 13
3.2 語者正規化與資料平衡 16
3.3 Teacher-student訓練 18
3.4 注意機制(Attention Mechanism) 19
3.4.1 應用於多尺度卷積神經網路(Multi-Scale CNN) 19
3.4.2 應用於遞迴神經網路 20
3.5 Bagging(Bootstrap Aggregating) 20
Chapter 4 實驗 22
4.1 實驗流程與設定 22
4.2 基準實驗 23
4.3 靜態模型實驗 24
4.3.1 MLP 24
4.3.2 Ordinary CNN 25
4.3.3 Multi-Scale CNN 26
4.3.4 Multi-Scale CNN with Attention Mechanism 26
4.4 靜態模型實驗 28
4.4.1 LSTM-RNN 28
4.4.1.1 Last-frame only 28
4.4.1.1 Mean-pooling over time 28
4.4.2 LSTM-RNN with Attention Mechanism 29
4.5 靜態、動態模型複雜度分析 32
4.6 集成學習實驗 34
4.6.1 Bagging 34
4.6.2 透過max-out unit結合 36
4.6.3 內插 37
Chapter 5 結論與未來展望 39
參考文獻 References
[1] J. A. Russell and G. Pratt, “A description of the affective quality attributed to environ-
ments.,” Journal of personality and social psychology, vol. 38, no. 2, p. 311, 1980.
[2] C. E. Osgood, G. J. Suci, and P. H. Tannenbaum, “The measurement of meaning. 1957,”
Urbana: University of Illinois Press, 1978.
[3] R. Van Bezooijen, S. A. Otto, and T. A. Heenan, “Recognition of vocal expressions of
emotion: A three-nation study to identify universal characteristics,” Journal of Cross-
Cultural Psychology, vol. 14, no. 4, pp. 387–406, 1983.
[4] C.-H. Wu and W.-B. Liang, “Emotion recognition of affective speech based on multiple
classifiers using acoustic-prosodic information and semantic labels,” IEEE Transactions
on Affective Computing, vol. 2, no. 1, pp. 10–21, 2011.
[5] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu,
N. Amir, L. Kessous, et al., “The relevance of feature type for the automatic classifica-
tion of emotional user states: low level descriptors and functionals,” in Proceedings of
Interspeech, 2007.
[6] B. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009 emotion challenge,” in
Proceedings of Interspeech, pp. 312–315, 2009.
[7] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition:
Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3,
pp. 572–587, 2011.
[8] S. Steidl, Automatic classification of emotion related user states in spontaneous chil-
dren’s speech. University of Erlangen-Nuremberg Erlangen, Germany, 2009.
[9] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of
german emotional speech.,” in Proceedings of Interspeech, vol. 5, pp. 1517–1520, 2005.
[10] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee,
and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,”
Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
[11] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 audio-visual emotion
database,” in Proceedings of the 22nd IEEE International Conference on Data Engi-
neering Workshops, 2006., pp. 8–8, IEEE, 2006.
[12] G. Costantini, I. Iaderola, A. Paoloni, and M. Todisco, “Emovo corpus: an italian
emotional speech database,” in Proceedings of the 2014 International Conference on
Language Resources and Evaluation (LREC), pp. 3501–3504, European Language Re-
sources Association (ELRA), 2014.
[13] X. Cheng and Q. Duan, “Speech emotion recognition using gaussian mixture model,” in
Proceedings of the 2nd International Conference on Computer Application and System
Modeling, 2012.
[14] H. Tang, S. M. Chu, M. Hasegawa-Johnson, and T. S. Huang, “Emotion recognition from
speech via boosted gaussian mixture models,” in Proceedings of IEEE International
Conference on Multimedia and Expo, 2009. ICME 2009, pp. 294–297, IEEE, 2009.
[15] S. Xu, Y. Liu, and X. Liu, “Speaker recognition and speech emotion recognition based
on gmm,” in Proceedings of the Third International Conference on Electric and Elec-
tronics (EEIC 2013), pp. 434–436, 2013.
[16] A. Metallinou, A. Katsamanis, and S. Narayanan, “A hierarchical framework for mod-
eling multimodality and emotional evolution in affective dialogs,” in Proceedings of
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 2401–2404, 2012.
[17] B. Schuller, G. Rigoll, and M. Lang, “Hidden markov model-based speech emotion
recognition,” in Proceedings of 2003 International Conference on Multimedia and Expo,
2003. ICME’03, vol. 1, pp. I–401, IEEE, 2003.
[18] H. Hu, M.-X. Xu, and W. Wu, “GMM supervector based svm with spectral features
for speech emotion recognition,” in Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. IV–413, 2007.
[19] N. Kamaruddin and A. Wahab, “Emulating human cognitive approach for speech emo-
tion using mlp and gensofnn,” in Proceedings of IEEE International Conference on
Information and Communication Technology for the Muslim World (ICT4M), pp. 1–5,
2013.
[20] J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition in speech using neu-
ral networks,” Neural computing & applications, vol. 9, no. 4, pp. 290–296, 2000.
[21] J. J. Hopfield, “Neural networks and physical systems with emergent collective com-
putational abilities,” Proceedings of the national academy of sciences, vol. 79, no. 8,
pp. 2554–2558, 1982.
[22] L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, E. Valentin, and H. Sahli,
“Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emo-
tion recognition,” in Proceedings of 2013 Humaine Association Conference on Affective
Computing and Intelligent Interaction (ACII), pp. 312–317, IEEE, 2013.
[23] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion
recognition using convolutional neural networks,” IEEE Transactions on Multimedia,
vol. 16, no. 8, pp. 2203–2213, 2014.
[24] S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition using deep
convolutional neural network and discriminant temporal pyramid matching,” IEEE
Transactions on Multimedia, 2017.
[25] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,
vol. 9, no. 8, pp. 1735–1780, 1997.
[26] W. Lim, D. Jang, and T. Lee, “Speech emotion recognition using convolutional and
recurrent neural networks,” in Proceedings of 2016 Asia-Pacific Signal and information
processing association annual summit and conference (APSIPA), pp. 1–4, IEEE, 2016.
[27] G. Keren and B. Schuller, “Convolutional rnn: an enhanced model for extracting features
from sequential data,” in Proceedings of 2016 IEEE International Joint Conference on
Neural Networks (IJCNN), pp. 3412–3419, IEEE, 2016.
[28] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and
S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep con-
volutional recurrent network,” in Proceedings of 2016 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204, IEEE, 2016.
[29] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor anal-
ysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Pro-
cessing, vol. 19, no. 4, pp. 788–798, 2011.
[30] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural net-
work acoustic models using i-vectors.,” in Proceedings of ASRU, pp. 55–59, 2013.
[31] N. Dehak, P. A. Torres-Carrasquillo, D. Reynolds, and R. Dehak, “Language recognition
via i-vectors and dimensionality reduction,” in Proceedings of Twelfth annual conference
of the International Speech Communication Association, 2011.
[32] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and
affect in speech: State of the art and lessons learnt from the first challenge,” Journal of
Speech Communication, vol. 53, no. 9, pp. 1062–1087, 2011.
[33] M. Kockmann, L. Burget, and J. Černockỳ, “Brno university of technology system for
interspeech 2009 emotion challenge,” in Proceedings of Interspeech, 2009.
[34] Y. Attabi, M. J. Alam, P. Dumouchel, P. Kenny, and D. O’Shaughnessy, “Multiple win-
dowed spectral features for emotion recognition,” in Proceedings of 2013 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7527–
7531, IEEE, 2013.
[35] H. Cao, R. Verma, and A. Nenkova, “Combining ranking and classification to improve
emotion recognition in spontaneous speech,” in Proceedings of Thirteenth Annual Con-
ference of the International Speech Communication Association, 2012.
[36] P.-Y. Shih, C.-P. Chen, and H.-M. Wang, “Speech emotion recognition with skew-
robust neural networks,” in Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 2751–2755, 2017.
[37] D. Le and E. M. Provost, “Emotion recognition from spontaneous speech using hid-
den markov models with deep belief networks,” in Proceedings of IEEE Workshop on
Automatic Speech Recognition and Understanding (ASRU), pp. 216–221, 2013.
[38] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans-
actions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[39] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in Proceed-
ings of the 2000 IEEE-INNS-ENNS International Joint Conference on Neural Networks
(IJCNN), vol. 3, pp. 189–194, IEEE, 2000.
[40] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and
Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical
machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[41] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[42] B.-C. Chiou, Cross-lingual automatic speech emotion recognition. PhD thesis, Master’s
thesis, National Sun Yat-sen University, 2014.
[43] V. Mnih, N. Heess, A. Graves, et al., “Recurrent models of visual attention,” in Proceed-
ings of advances in neural information processing systems, pp. 2204–2212, 2014.
[44] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[45] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
[46] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic
minority over-sampling technique,” Journal of artificial intelligence research, vol. 16,
pp. 321–357, 2002.
[47] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learn-
ing research, vol. 9, no. Nov, pp. 2579–2605, 2008.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code