國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,跨語言自動化情緒語音辨識,Cross-Lingual Automatic Speech Emotion Recognition

論文名稱 Title	跨語言自動化情緒語音辨識 Cross-Lingual Automatic Speech Emotion Recognition
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	102 學年度第 2 學期 The spring semester of Academic Year 102	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	60
研究生 Author	邱柏菖 Bo-Chang Chiou
指導教授 Advisor	陳嘉平 Chia-Ping Chen
召集委員 Convenor	吳宗憲 Chung-Hsien Wu
口試委員 Advisory Committee	王新民 Hsin-Min Wang
口試日期 Date of Exam	2014-07-24	繳交日期 Date of Submission	2014-09-01
關鍵字 Keywords	語音情感辨識、跨語言、情緒語料庫建置、直方圖均衡化 Histogram Equalization, Building Speech Emotion Database, Cross-Lingual, Speech Emotion Recognition
統計 Statistics	本論文已被瀏覽 5681 次，被下載 1368 次 The thesis/dissertation has been browsed 5681 times, has been downloaded 1368 times.

中文摘要
本論文採用一個基於聲學特徵參數搭配支持向量機的語音情緒辨識系統，我們的實驗於公開語料庫EMO-DB上。在我們EMO-DB的基準實驗可有85.2%辨識率，在降維研究中成功將特徵集從基準特徵集的6552個透過動態特徵、特徵群、泛函與降至37個特徵，並仍保有80.2%的辨識率。我們仿照EMO-DB自行錄製國語、台語及客家語三種台灣語言情緒語料庫來進行語料庫的實驗，用於跨語言、跨語者和跨語料庫的實驗。另外我們採用直方圖均衡法進行跨語言的語者與語言正規化，及將降維過程中得到的特徵集應用於正規化實驗。正規化實驗中，EMO-DB在4368維特徵集下透過語者正規化可得到最佳的90.8%辨識率，在加入台灣語料的進行混合語料實驗，既使加入三種台灣語料後也能透過語言與語者標準化保有89.9%的辨識率，因此我們能透過正規化幾乎抵銷跨語言造成的影響。另外將混合語言的方式實驗於我們的台灣語料庫，透過混合語料的語者正規化來改善我們台灣情緒語料庫辨識率，在同時以三種語言訓練的情況下進行語者正規化，國語、台語和客家語的辨識率能從單一語言時的68.5%、50.7%和54.6分別提升至79%、76.8%和72.8%辨識率。為了排除錄音通道差異，我們自行轉錄了德語語料庫後再進行正規化實驗，實驗的結果優於原始資料，並且混合語料正規化後可以得到最佳的91.6%辨識率。為了符合現實環境，我們分別實驗了少量句數的正規化實驗，實驗中也能確定我們的方法在少量句數下也能維持不錯的成效。
Abstract
In this paper, we propose a speech emotion recognition system which adopt acoustic feature with support vector machine. Our research evaluates on the well-known Berlin Database of Emotion Speech(EMO-DB). The baseline of EMO-DB is 85.2% accuracy. In our feature reducing research, we succeed to reduce feature set from 6552 features to 37 features and kept above 80% accuracy by reducing dynamic features, feature groups, functionals, and principal component analysis. We begin with the construction of a Mandarin, Taiwanese, and Hakka database of emotional speech, which is similar to EMO-DB in the composition and size. This corpus is used to cross-speaker, cross-lingual, and cross-corpus experiments. Moreover, we apply speaker and language normalization by histogram equalization. And, we implement the feature sets which obtain in feature reducing procedure in normalization experiments. We can get 90.8% accuracy by speaker normalization on EMO-DB with 4368 features, even we can get 89.9% accuracy after adding all Taiwanese emotion speech data by speaker and language normalization. It shows that our normalization can almost eliminate the effect of crosslingual training. Similarly, we evaluate multi-lingual training on our Taiwan Emotion Speech Corpus. Also, our normalization can improve the performance of Taiwan Emotion Speech Corpus. Under training model by three language data and speaker normalization, it promote the accuracy of Mandarin, Taiwanese, and Hakka from 68.5%, 50.7%, and 54.6% to 79%, 76.8%, and 72.8%, respectively. For excluding the channel difference, we record EMO-DB by our devices and re-experiment normalization experiments. These results are better than the results from original data. The best result is 91.6% under training with four languages data and normalization. For more close true environment, we experiment normalization by small number of test speaker’s utterances. In our results, out method can also maintain the performance under only 5 utterances.

目次 Table of Contents
List of Tables ix List of Figures xii Chapter 1 簡介1 1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 文獻回顧. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2 研究方法與基本架構6 2.1 基準特徵集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 基本聲學特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 支持向量機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 3 語料庫介紹及正規化方法11 3.1 語料庫介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.1 德語語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.2 國台客語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 特徵正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 語者正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 語言正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.3 不分情緒語言正規化. . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 4 實驗19 4.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 基準實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 特徵挑選. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.1 靜態與動態特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.2 特徵群. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3.3 泛函. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.3.4 主成分分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 正規化流程. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.5 EMO-DB實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5.1 跨語料庫實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.5.2 混合語料訓練. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5.3 不同句數語者正規化. . . . . . . . . . . . . . . . . . . . . . . . . 31 4.5.4 轉錄德語實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.4.1 經喇叭和麥克風收錄實驗. . . . . . . . . . . . . . . . . 33 4.5.4.2 軟體直接收錄實驗. . . . . . . . . . . . . . . . . . . . . 34 4.6 台灣語料庫實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6.1 跨語料庫正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6.2 語料庫內跨語言正規化. . . . . . . . . . . . . . . . . . . . . . . . 36 4.6.3 混和語料國語正規化. . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6.4 混和語料台語正規化. . . . . . . . . . . . . . . . . . . . . . . . . 39 4.6.5 混和語料客家語正規化. . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 5 總結與未來展望42

參考文獻 References
[1] R. W. Picard, Affective computing. MIT Press, 1997. [2] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, pp. 572– 587, Mar. 2011. [3] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 emotion challenge,” in Proceedings of INTERSPEECH, pp. 312–315, 2009. [4] S. Steidl, Automatic classification of emotion related user states in spontaneous children’s speech. PhD thesis, University of Erlangen-Nuremberg, 2009. [5] G. Weiss and F. Provost, “The effect of class distribution on classifier learning: An empirical study,” tech. rep., 2001. [6] A. Ivanov and G. Riccardi, “Kolmogorov-Smirnov test for feature selection in emotion recognition from speech,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5125–5128, March 2012. [7] B.-C. Chiou and C.-P. Chen, “Feature space dimension reduction in speech emotion recognition using support vector machine,” in Proceedings of Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6, October 2013. [8] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, pp. 788–798, May 2010. [9] V. Sethu, E. Ambikairajah, and J. Epps, “Speaker normalisation for speech-based emotion detection,” in Proceedings of Digital Signal Processing, pp. 611–614, July 2007. [10] S. Ntalampiras and N. Fakotakis, “Anchor models for emotion recognition from speech,” in IEEE Transactions on Affective Computing, vol. 4, pp. 280–290, 2013. [11] I. Luengo, E. Navas, and I. Hernaez, “Combining spectral and prosodic information for emotion recognition in the Interspeech 2009 emotion challenge.,” in Proceedings of INTERSPEECH, pp. 332–335, 2009. [12] A. Metallinou, A. Katsamanis, and S. S. Narayanan, “A hierarchical framework for modeling multimodality and emotional evolution in affective dialogs.,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2401–2404, March 2012. [13] S. Firoz, S. Raji, and A. Babu, “Automatic emotion recognition from speech using artificial neural networks with gender-dependent databases,” in Proceedings of the Advances in Computing, Control, and Telecommunication Technologies, pp. 162–164, December 2009. [14] H. Hu, M.-X. Xu, and W. Wu, “GMM supervector based svm with spectral features for speech emotion recognition,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 413–416, April 2007. [15] N. Ding, V. Sethu, J. Epps, and E. Ambikairajah, “Speaker variability in emotion recognition - an adaptation based approach.,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5101–5104, March 2012. [16] C.-H. Wu and W.-B. Liang, “Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels,” in IEEE Transactions on Affective Computing, vol. 2, pp. 10–21, 2011. [17] B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic emotion recognition: A benchmark comparison of performances,” in Proceedings of IEEE Workshop on of Automatic Speech Recognition Understanding (ASRU), pp. 552–557, 2009. [18] B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-corpus acoustic emotion recognition: Variances and strategies,” in IEEE Transactions on Affective Computing, vol. 1, pp. 119–131, 2010. [19] D. McDuff, R. Kaliouby, and R. Picard, “Crowdsourcing facial responses to online videos,” in IEEE Transactions on Affective Computing, vol. 3, pp. 456–468, 2012. [20] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in Proceedings of INTERSPEECH, pp. 1517–1520, 2005. [21] M. Grimm, K. Kroschel, and S. Narayanan, “The vera am mittag german audio-visual emotional speech database.,” in Proceedings of IEEE International Conference on Multimedia and Expo, pp. 865–868, 2008. [22] T. Iliou and C.-N. Anagnostopoulos, “SVM-MLP-PNN classifiers on speech emotion recognition field - a comparative study,” in Proceedings of International Conference on Digital Telecommunications (ICDT), pp. 1–6, 2010. [23] M. Shah, L. Miao, C. Chakrabarti, and A. Spanias, “A speech emotion recognition framework based on latent dirichlet allocation: Algorithm and fpga implementation,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2553–2557, May 2013. [24] S. Ananthakrishnan, A. Vembu, and R. Prasad, “Model-based parametric features for emotion recognition from speech,” in in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 529–534, 2011. [25] S. Ntalampiras and N. Fakotakis, “Modeling the temporal evolution of acoustic parameters for speech emotion recognition,” in IEEE Transactions on Affective Computing, vol. 3, pp. 116–125, 2012. [26] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001. [27] Y.-C. Kao and B. Chen, “Leveraging distributional characteristics of modulation spectra for robust speech recognition.,” in International Conference on Information Science, Signal Processing and their Applications (ISSPA), pp. 120–125, July 2012. [28] J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods—Support Vector Learning (B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, eds.), (Cambridge, MA), pp. 185–208, MIT Press, 1999.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0730114-173740.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS