Responsive image
博碩士論文 etd-0731116-160031 詳細資訊
Title page for etd-0731116-160031
論文名稱
Title
應用創新數據偏斜強健性類神經網路於語音情緒辨識
Skewness-Robust Neural Networks with Application to Speech Emotion Recognition
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
65
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2016-07-29
繳交日期
Date of Submission
2016-08-31
關鍵字
Keywords
類神經網路、直方圖均衡法、重要性加權技術、類別不平衡、語音情緒辨識
Neural Network, Skewed Data, Histogram Equalization, Importance Weighting, Speech Emotion Recognition
統計
Statistics
本論文已被瀏覽 5677 次,被下載 1021
The thesis/dissertation has been browsed 5677 times, has been downloaded 1021 times.
中文摘要
本論文實作一個語音情緒辨識系統並實驗於FAU-Aibo情緒語料庫上,後端則採用類神經網路架構,並與支持向量機的結果來作比較。基於INTERSPEECH 2009 Emotion Challenge的分類器子挑戰,我們的目標是得到最佳的辨識率。有鑒於FAU-Aibo語料庫的類別不平衡問題,我們基於重要性加權技術的概念,提出了一個數據偏斜強健性的方法於類神經網路上。我們使用交叉熵函數計算誤差,對於每一筆資料賦予一個修正項,並於反向傳播演算法的過程中調整各類別之重要性,以解決訓練時資料不平衡的問題。在特徵上,我們使用該挑戰所定義的基準特徵集,並採用直方圖均衡法作為語者正規化之方法,以消除語者之間的差異性而保留情緒變異。實驗中,我們採用了統計學上常用的採樣技術,從數據層面來處理不平衡的分類問題,並與我們提出的方法作比較。在資料平衡化實驗上,我們結合隨機下採樣技術與合成少數類過採樣技術(SMOTE),將資料平衡化至每一類別並評估實驗結果。而為了解決隨機下採樣技術造成多數類別上過多的資料丟失,我們基於組合模型的概念,將訓練集分為多個平衡化的子集合來訓練子分類器,並利用這些子分類器的集成來提高辨識率,得到了最佳45.0%的辨識結果。我們進一步將集成學習輸出的事後機率用於特徵表示學習上,將特徵維度從384維減少至40維,並保有良好的辨識結果。而我們提出的數據偏斜強健性方法,在採用原始資料而未經過任何採樣技術之下,得到了最佳45.8%以及平均45.3%的辨識率。我們的實驗結果為現階段FAU-Aibo靜態模型上之最佳實驗結果。
Abstract
In this paper, we propose a speech emotion recognition system on the well-known FAU-Aibo database with neural network, and compare to the results of support vector machine. Based on the the Classifier Sub-Challenge of INTERSPEECH 2009 Emotion Challenge, our goal is to achieve the state-of-the-art performance on the database. In FAU-Aibo, we are faced with the problem of skewed data problem, and we propose a skewness-robust method in neural network based on importance weighting technique. To solve the problem of class-imbalanced data, we apply the cross-entropy objective function and optimize the gradient computing according to each class in back-propagation algorithm. In the feature set, we use 384 features defined by the challenge and apply histogram equalization technique to reduce the part of data variance due to speaker variation. In our experiments, we apply sampling methods commonly used in statistic to data, and compare to our proposed method. In data balancing experiments, we apply the SMOTE method and a random sub-sampling to balance data to each class and evaluate the results. However, a random sub-sampling would result in many lost data in majority class, we apply the ensemble learning method according to the idea of combination model. We get the best 45.0% result by averaging the posterior of test data in each subclassifiers for classification. We further combine the output probability of ensemble learning for feature representation learning, and success to reduce features from 384 to 40 and keep good results. For our proposed method, we use the raw imbalance data without applying any sampling method and get the best 45.8% result and 45.3% average result. This is the best result achieved by static modeling framework.
目次 Table of Contents
論文審定書 i
Acknowledgments ii
摘要 iii
ABSTRACT v
Table of Contents vii
List of Tables ix
List of Figures x
Chapter 1 簡介 1
1.1 背景與研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 文獻回顧. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2 基本架構與工具介紹 7
2.1 基準特徵集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 支持向量機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 多層感知器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 工具介紹:Theano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 3 語料庫介紹與研究方法 16
3.1 語料庫介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 語者正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 資料平衡化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 採樣技術. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 組合模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 數據偏斜強健性類神經網路. . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 小量批次訓練平衡化. . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.2 基於支持向量機之重要性加權技術. . . . . . . . . . . . . . . . . 22
3.4.3 數據偏斜強健性類神經網路. . . . . . . . . . . . . . . . . . . . . 23
Chapter 4 實驗 26
4.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 支持向量機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 多層感知器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 基準實驗與語者正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 資料平衡化實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.1 結合上採樣與下採樣技術. . . . . . . . . . . . . . . . . . . . . . 29
4.3.2 類神經網路訓練平衡化. . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 結合採樣法與組合模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 重要性加權技術與數據偏斜強健性實驗. . . . . . . . . . . . . . . . . . . 36
4.6 基於集成分類器之特徵表示學習. . . . . . . . . . . . . . . . . . . . . . . 38
4.7 不同語者正規化方式之評估. . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 5 結論與未來展望 46
參考文獻 References
[1] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 emotion challenge,” in Proceedings of INTERSPEECH, pp. 312–315, 2009.
[2] LISA. lab, “Theano 0.7 documentation.” , available at:http://deeplearning.net/software/theano/index.html, 2008–2016.
[3] R. Plutchik, “The nature of emotions human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice,” Journal of American Scientist, vol. 89, no. 4, pp. 344–350, 2001.
[4] C. E. Osgood, “The nature and measurement of meaning,” Journal of Psychological Bulletin, vol. 49, no. 3, p. 197, 1952.
[5] R. Van Bezooijen, S. A. Otto, and T. A. Heenan, “Recognition of vocal expressions of emotion a three-nation study to identify universal characteristics,” Journal of Cross-Cultural Psychology, vol. 14, no. 4, pp. 387–406, 1983.
[6] B.-C. Chiou and C.-P. Chen, “Feature space dimension reduction in speech emotion recognition using support vector machine,” in Proceedings of Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013.
[7] X. Cheng and Q. Duan, “Speech emotion recognition using gaussian mixture model,” in Proceedings of the International Conference on Computer Application and System Modeling, 2012.
[8] F. Shah et al., “Automatic emotion recognition from speech using artificial neural networks with gender-dependent databases,” in Proceedings of International Conference on Advances in Computing, Control, and Telecommunication Technologies, pp. 162–164, 2009.
[9] A. Metallinou, A. Katsamanis, and S. Narayanan, “A hierarchical framework for modeling multimodality and emotional evolution in affective dialogs,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2401–2404, 2012.
[10] H. Hu, M.-X. Xu, and W. Wu, “GMM supervector based SVM with spectral features for speech emotion recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 413–416, 2007.
[11] N. Kamaruddin and A. Wahab, “Emulating human cognitive approach for speech emotion using mlp and gensofnn,” in Proceedings of IEEE International Conference on Information and Communication Technology for the Muslim World (ICT4M), pp. 1–5, 2013.
[12] V. Sethu, E. Ambikairajah, and J. Epps, “Speaker normalisation for speech-based emotion detection,” in Proceedings of International Conference on Digital Signal Processing, pp. 611–614, 2007.
[13] B.-C. Chiou, “Cross-lingual automatic speech emotion recognition,” Master’s thesis, National Sun Yat-sen University, 2014.
[14] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Journal of Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
[15] S.Wu, T. H. Falk, andW.-Y. Chan, “Automatic speech emotion recognition using modulation spectral features,” Journal of Speech Communication, vol. 53, no. 5, pp. 768–785, 2011.
[16] A. Shahzadi, A. Ahmadyfard, A. Harimi, and K. Yaghmaie, “Speech emotion recognition using nonlinear dynamics features,” Journal of Electrical Engineering & Computer Sciences, vol. 23, no. Sup. 1, pp. 2056–2073, 2013.
[17] P. Henr´ıquez, J. B. Alonso, M. A. Ferrer, C. M. Travieso, and J. R. Orozco-Arroyave, “Nonlinear dynamics characterization of emotional speech,” Journal of Neurocomputing, vol. 132, pp. 126–135, 2014.
[18] B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–577, 2004.
[19] X. H. Le, G. Qu´enot, and E. Castelli, “Recognizing emotions for the audio-visual document indexing,” in Proceedings of International Symposium on Computers and Communications, vol. 2, pp. 580–584, 2004.
[20] S. Steidl, Automatic classification of emotion related user states in spontaneous children’s speech. University of Erlangen-Nuremberg Erlangen, Germany, 2009.
[21] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, pp. 321–357, 2002.
[22] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge,” Journal of Speech Communication, vol. 53, no. 9, pp. 1062–1087, 2011.
[23] C.-C. Lee, E. Mower, C. Busso, S. Lee, and S. Narayanan, “Emotion recognition using a hierarchical binary decision tree approach,” in Proceedings of INTERSPEECH, 2009.
[24] M. Kockmann, L. Burget, and J. Cˇ ernocky`, “Brno university of technology system for interspeech 2009 emotion challenge,” in Proceedings of INTERSPEECH, 2009.
[25] A. Rosenberg, “Classifying skewed data: importance weighting to optimize average recall,” in Proceedings of INTERSPEECH, pp. 2242–2245, 2012.
[26] Y. Attabi and P. Dumouchel, “Anchor models for emotion recognition from speech,” Journal of Transactions on Affective Computing, vol. 4, no. 3, pp. 280–290, 2013.
[27] Z. Zha, Yang and Zhao, “Spontaneous speech emotion recognition via multiple kernel learning,” in Proceedings of international Conference on Measuring Technology and Mechatronics Automation (ICMTMA), 2016.
[28] D. Le and E. M. Provost, “Emotion recognition from spontaneous speech using hidden markov models with deep belief networks,” in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 216–221, 2013.
[29] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Journal of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, 1982.
[30] Numpy Developers, “Numpy.” , available at:http://www.numpy.org/, 2006–2016.
[31] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio, “Theano: new features and speed improvements,” Journal of NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
[32] R. Collobert, K. Kavukcuoglu, and Farabet, “Torch, a scientific computing framework for luajit.” , available at:http://torch.ch/, 2002–2016.
[33] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: a matlab-like environment for machine learning,” in Proceedings of NIPS Workshop on BigLearn, 2011.
[34] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler,” in Proceedings of the Python for scientific computing conference (SciPy), 2010.
[35] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in Proceedings of INTERSPEECH, vol. 5, pp. 1517–1520, 2005.
[36] A. Estabrooks, T. Jo, and N. Japkowicz, “A multiple resampling method for learning from imbalanced data sets,” Journal of Computational Intelligence, vol. 20, no. 1, pp. 18–36, 2004.
[37] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.Witten, “The weka data mining software: an update,” Journal of ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
[38] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” Proceedings of NIPS Workshop on Deep Learning, 2015.
[39] F. Eyben, M. W¨ollmer, and B. Schuller, “openSMILE: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, pp. 1459–1462, 2010.
[40] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” Journal of Advances in Kernel Methods, vol. 12, no. 1, pp. 185–208, 1999.
[41] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” in CoRR, arXiv:1207.0580, 2012.
[42] D. Le and E. M. Provost, “Data selection for acoustic emotion recognition: analyzing and comparing utterance and sub-utterance selection strategies,” in Proceedings of International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 146–152, 2015.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code