Responsive image
博碩士論文 etd-0809116-110726 詳細資訊
Title page for etd-0809116-110726
論文名稱
Title
應用因素分析與識別向量於語音情緒辨識
Speech Emotion Recognition Using Factor Analysis and Identity Vectors
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
58
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2016-07-29
繳交日期
Date of Submission
2016-09-09
關鍵字
Keywords
識別向量、語音情緒辨識、高斯混合模型、支持向量機、聲道長度正規化
i-vector, Speech Emotion Recognition, Vocal Tract Length Normalization, Gaussian Mixture Model, Support Vector Machine
統計
Statistics
本論文已被瀏覽 5652 次,被下載 1906
The thesis/dissertation has been browsed 5652 times, has been downloaded 1906 times.
中文摘要
本論文對於INTERSPEECH 2009 Emotion Challenge進行五類情緒的子挑戰,實驗於FAU Aibo愛寶語料庫,我們使用OpenSMILE擷取官方設定的基本特徵,並將語者辨別(Speaker Identification)以及語者驗證(Speaker Verification)上常用的高斯混合模型(Gaussian Mixture Model, GMM)系統,引用至語音情緒辨識的系統下,包含基本的GMM系統、改善訓練資料不足的GMM-UBM系統、使用超級向量(super-vectors)做為輸入特徵的GMM-SVM系統、對於語音特徵進行分析的識別向量(Identity Vector or i-vector)系統。

在動態模型部分,我們分別於GMM以及GMM-UBM系統得到最好的39.2%(UA)以及39.3%(UA)結果,相對於基準實驗,我們提高了3%的辨識率。而在靜態模型的部分,我們先行使用SMOTE以及Under-sampling來解決語料庫不平衡的問題,之後在GMM-SVM以及識別向量系統上,我們分別在使用SMOTE後得到了38.9%(UA)以及40.5%(UA)的辨識率,相較於IS 2009情緒挑戰裡的基準實驗,有了0.7%以及2.3%的提升。
本論文同時也證實,應用在語者辨別的系統,也可以將其應用至語音情緒辨識的系統裡,對於結果方面也有提升的效果。
Abstract
In this paper, we challenge INTERSPEECH 2009 Emotion open performance sub-Challenge of 5 class problem. Our research evaluates on the well-known FAU Aibo database. We use OpenSMILE toolkit to extract low-level descriptors and compute the delta coefficients. Gaussian Mixture Model (GMM) is popular approach in speaker identification and speaker verification, we use GMM systems to speech emotion recognition. It contains four systems, the first one is simple GMM system. The second one is GMM-UBM system, it resolve the insufficiency of training data. The third is GMM-SVM system, it uses GMM super-vectors as new
input feature. The fourth is Identity Vectors system (or i-vector system), it uses factor analysis (FA) for GMM super-vectors.

In the dynamic modeling classifier, we achieve an unweighted average (UA) recall rate of 39.2% in GMM system, and 39.3% in GMM-UBM system, over a baseline of 35.5%. In the static modeling classifier, we use SMOTE and Under-sampling to solve the problem of unbalance data,
then we achieve the 38.9% UA in GMM-SVM system, and 40.5% in Identity system, it also over a baseline of 38.2%. This paper confirmed that the system in speaker recognition can also use to speech emotion recognition, and it also can improve the result of emotion recognition accuracies.
目次 Table of Contents
論文審定書i
Acknowledgments ii
摘要iii
ABSTRACT iv
List of Tables vii
List of Figures viii
Chapter 1 簡介 1
1.1 研究動機與背景. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 文獻回顧. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2 基本架構 6
2.1 基準特徵集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 基本聲學特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 高斯混合模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 單一高斯機率密度函數. . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 高斯混合模型的參數估計. . . . . . . . . . . . . . . . . . . . . . 9
2.3 支持向量機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 3 系統架構 15
3.1 GMM系統. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 GMM-UBM系統. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 通用背景模型. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 最大事後機率. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 GMM-SVM系統. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 高斯混合模型-超級向量. . . . . . . . . . . . . . . . . . . . . . . 20
3.4 識別向量系統. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1 i-vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 4 語料庫介紹以及實驗結果 25
4.1 愛寶語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 改善資料不平衡之方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 基準實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 GMM以及GMM-UBM系統之實驗結果. . . . . . . . . . . . . . . . . . . 32
4.6 GMM-SVM以及識別向量系統之實驗結果. . . . . . . . . . . . . . . . . 34
4.7 VTLN之實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7.1 聲道長度正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7.2 VTLN之實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chapter 5 總結以及未來展望 42
參考文獻 43
參考文獻 References
[1] S. Young, G. Evermann, T. Hain, and et al, The HTK Book(for HTK version 3.4.1).
Cambridge University Engineering Department, 2002.
[2] X. Zhang, Y. Sun, and S. Duan, “Progress in speech emotion recognition,” TENCON
2015 - 2015 IEEE Region 10 Conference, 2015.
[3] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 Emotion Challenge,”
in Proceedings of INTERSPEECH, pp. 312–315, 2009.
[4] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and
affect in speech: State of the art and lessons learnt from the first challenge,” Sensing
Emotion and Affect - Facing Realism in Speech Processing, pp. 1062–1087, 2011.
[5] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Fctor Analysis
for Speaker Verification,” IEEE Transactions on Audio, Speech and Language Processing,
vol. 19, pp. 788–798, May 2010.
[6] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition:
Features, classification schemes, and databases,” Pattern Recognition, vol. 44, pp. 572–
587, Mar. 2011.
[7] J. Jiang, Z.Wu, M. Xu, J. Jia, and L. Cai, “Comparing Feature Dimension Reduction Algorithms
for GMM-SVM based Speech Emotion Recognition,” Signal and Information
Processing Association Annual Summit and Conference (APSIPA), 2013.
[8] V. Sethu, E. Ambikairajah, and J. Epps, “Speaker Normalisation for Speech-Based Emotion
Detection,” in Proceedings of Digital Signal Processing, pp. 611–614, July 2007.
[9] T. Zhang and J. Wu, “Speech emotion recognition with i-vector feature and RNN
model,” Signal and Information Processing (ChinaSIP), 2015 IEEE China Summit and
International Conference, pp. 524–528, 2015.
[10] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine learning, vol. 20, no. 3,
pp. 273–297, 1995.
[11] A. Metallinou, A. Katsamanis, and S. S. Narayanan, “A hierarchical framework for modeling
multimodality and emotional evolution in affective dialogs.,” in Proceedings of
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 2401–2404, March 2012.
[12] S. Firoz, S. Raji, and A. Babu, “Automatic Emotion Recognition from Speech Using
Artificial Neural Networks with Gender-Dependent Databases,” in Proceedings of the
Advances in Computing, Control, and Telecommunication Technologies, pp. 162–164,
December 2009.
[13] N. Ntalampiras, S.and Fakotakis, “Anchor Models for Emotion Recognition from
Speech,” in IEEE Transactions on Affective Computing, vol. 4, pp. 280–290, 2013.
[14] A. Mohamed, G.E.Dahl, and G.Hinton, “Acoustic Modeling Using Deep Belief Networks,”
IEEE Tran. Audio, Speech, and Language Processing, vol. 20(1), pp. 14–22,
2012.
[15] F. Dong, G. Zhang, Y. Huang, and H. Liu, “Speech emotion recognition based on multioutput
GMM and SVM,” iPattern Recognition (CCPR), 2010.
[16] J. Hansen and S. Bou-Ghazale, “Getting started with susas: A speech under simulated
and actual stress database,” in Proc. EUROSPEECH-97, vol. 4, p. 1743–1746, 1997.
[17] B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic emotion
recognition: A benchmark comparison of performances,” in Proceedings of IEEE Workshop
on of Automatic Speech Recognition Understanding (ASRU), pp. 552–557, 2009.
[18] S. Feraru, D. Schuller, and B. Schuller, “Cross-Language Acoustic Emotion Recognition:
An Overview and Some Tendencies,” in n Proc. 6th biannual Conference on
Affective Computing and Intelligent Interaction(ACII 2015), pp. 125–131, 2015.
[19] G. Weiss and F. Provost, “The Effect of Class Distribution on Classifier Learning: An
Empirical Study,” Department of Computer Science, Rutgers University, 2001.
[20] B. Vlasenko, “Processing affected speech within human machine interaction,” in Proc.
Interspeech. Brighton, pp. 2039–2042, 2009.
[21] Kockmann, M. Burget, L. Cernocky, and J., “Brno University of Technology System
for Interspeech 2009 Emotion Challenge,” in Proc. Interspeech. Brighton, pp. 348–351,
2009.
[22] C. Zha, P. Yang, X. Zhang, and L. Zhao, “Spontaneous Speech Emotion Recognition
via Multiple Kernel Learning,” 2016 Eighth International Conference on Measuring
Technology and Mechatronics Automation (ICMTMA), pp. 621–623, 2016.
[23] L.Longfei, Z.Yong, J.Dongmei, Z.Yanning, W.Fengna, and I.Gonzalez, “Hybrid Deep
Neural Network–Hidden Markov Model (DNN-HMM) Based Speech Emotion Recognition,”
in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association
Conference on, pp. 312–317, 2013.
[24] J. H. Hansen and T. Hasan, “Speaker Recognition by Machines and Humans: A tutorial
review,” in IEEE Signal Process. Mag., vol. 4, pp. 74–99, 2015.
[25] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, and M. Mason, “i-vector based
Speaker Recognition on Short Utterances,” in Interspeech, 2011.
[26] C. S.Greenberg, D. Bans´e, and et al., “The NIST 2014 Speaker Recognition i-vector Machine
Learning Challenge,” National Institute of Standards and Technology, Gaithersburg,
MD, 2014.
[27] J.Gomes and M.El-Sharkawy, “i-vector Algorithm with Gaussian Mixture Model for Efficient
Speech Emotion Recognition,” International Conference on Computational Science
and Computational Intelligence (CSCI), pp. 476–480, 2015.
[28] Boersma, “Accurate short-term analysis of the fundamental frequency and the
harmonics-to-noise ratio of a sampled sound,” in Proceedings of the IFA, 1993.
[29] Bishop, Pattern Recognition and Machine Learning. LLC, New York: Springer Science
Business Media, 2006.
[30] J. Platt, “Fast Training of Support Vector Machines using Sequential Minimal Optimization,”
in Advances in Kernel Methods—Support Vector Learning (B. Sch¨olkopf, C. J. C.
Burges, and A. J. Smola, eds.), (Cambridge, MA), pp. 185–208, MIT Press, 1999.
[31] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted
Gaussian mixture models,” Digital Signal Processing, vol. 10(1 - 3), pp. 19–41, 2000.
[32] J. L. Gauvain and C. H. Lee, “Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains,” IEEE Trans. Speech Audio Process,
pp. 291–298, 1994.
[33] P. Kenny, M. Mihoubi, and P. Dumouchel, “New MAP estimators for speaker recognition,”
Proc. Interspeech, p. 2964–2967, 2003.
[34] P. Kenny, “Joint Factor Analysis of Speaker and Session Variability: Theory and algorithms,”
Tech. Rep. CRIM-06/08-13, 2005.
[35] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis
for speaker verification,” vol. 19, pp. 788–798, 2011.
[36] P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice ModelingWith Sparse Training
Data,” vol. 13, pp. 345–354, 2005.
[37] E. Khoury, L. El Shafey, and S. Marcel, “Spear: An open source toolbox for speaker
recognition based on Bob,” in IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
(ICASSP), 2014.
[38] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The
WEKA Data Mining Software: An Update,” ACM SIGKDD Explorations Newsletter,
vol. 11, pp. 10–18, 2009.
[39] P. Zhan and A. Waibel, “Vocal Tract Length Normalization for Large Vocabulary Continuous
Speech Recognition,” CMU-CS-97-148, May 1997.
[40] Elenius, Daniel, and M. Blomberg, “Adaptation and Normalization Experiments in
Speech Recognition for 4 to 8 Year old Children,” pp. 2749–2752, Interspeech. 2005.
[41] A. Hatch, S. Kajarekar, and A. Stolcke, “Within-class covariance normalization for
SVM-based speaker recognition,” in Proc. Int. Conf. Spoken Lang. Process, Sep. 2006.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code