Responsive image
博碩士論文 etd-0730114-173740 詳細資訊
Title page for etd-0730114-173740
論文名稱
Title
跨語言自動化情緒語音辨識
Cross-Lingual Automatic Speech Emotion Recognition
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
60
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2014-07-24
繳交日期
Date of Submission
2014-09-01
關鍵字
Keywords
語音情感辨識、跨語言、情緒語料庫建置、直方圖均衡化
Histogram Equalization, Building Speech Emotion Database, Cross-Lingual, Speech Emotion Recognition
統計
Statistics
本論文已被瀏覽 5681 次,被下載 1368
The thesis/dissertation has been browsed 5681 times, has been downloaded 1368 times.
中文摘要
本論文採用一個基於聲學特徵參數搭配支持向量機的語音情緒辨識系統,我們的實驗於公開語料庫EMO-DB上。在我們EMO-DB的基準實驗可有85.2%辨識率,在降維研究中成功將特徵集從基準特徵集的6552個透過動態特徵、特徵群、泛函與降至37個特徵,並仍保有80.2%的辨識率。我們仿照EMO-DB自行錄製國語、台語及客家語三種台灣語言情緒語料庫來進行語料庫的實驗,用於跨語言、跨語者和跨語料庫的實驗。另外我們採用直方圖均衡法進行跨語言的語者與語言正規化,及將降維過程中得到的特徵集應用於正規化實驗。正規化實驗中,EMO-DB在4368維特徵集下透過語者正規化可得到最佳的90.8%辨識率,在加入台灣語料的進行混合語料實驗,既使加入三種台灣語料後也能透過語言與語者標準化保有89.9%的辨識率,因此我們能透過正規化幾乎抵銷跨語言造成的影響。另外將混合語言的方式實驗於我們的台灣語料庫,透過混合語料的語者正規化來改善我們台灣情緒語料庫辨識率,在同時以三種語言訓練的情況下進行語者正規化,國語、台語和客家語的辨識率能從單一語言時的68.5%、50.7%和54.6分別提升至79%、76.8%和72.8%辨識率。為了排除錄音通道差異,我們自行轉錄了德語語料庫後再進行正規化實驗,實驗的結果優於原始資料,並且混合語料正規化後可以得到最佳的91.6%辨識率。為了符合現實環境,我們分別實驗了少量句數的正規化實驗,實驗中也能確定我們的方法在少量句數下也能維持不錯的成效。
Abstract
In this paper, we propose a speech emotion recognition system which adopt acoustic feature with support vector machine. Our research evaluates on the well-known Berlin Database of Emotion Speech(EMO-DB). The baseline of EMO-DB is 85.2% accuracy. In our feature reducing research, we succeed to reduce feature set from 6552 features to 37 features and kept above 80% accuracy by reducing dynamic features, feature groups, functionals, and principal component analysis. We begin with the construction of a Mandarin, Taiwanese, and Hakka database of emotional speech, which is similar to EMO-DB in the composition and size. This corpus is used to cross-speaker, cross-lingual, and cross-corpus experiments. Moreover, we apply speaker and language normalization by histogram equalization. And, we implement the feature sets which obtain in feature reducing procedure in normalization experiments. We can get 90.8% accuracy by speaker normalization on EMO-DB with 4368 features, even we can get 89.9% accuracy after adding all Taiwanese emotion speech data by speaker and language normalization. It shows that our normalization can almost eliminate the effect of crosslingual training. Similarly, we evaluate multi-lingual training on our Taiwan Emotion Speech Corpus. Also, our normalization can improve the performance of Taiwan Emotion Speech Corpus. Under training model by three language data and speaker normalization, it promote the accuracy of Mandarin, Taiwanese, and Hakka from 68.5%, 50.7%, and 54.6% to 79%, 76.8%, and 72.8%, respectively. For excluding the channel difference, we record EMO-DB by our devices and re-experiment normalization experiments. These results are better than the results from original data. The best result is 91.6% under training with four languages data and normalization. For more close true environment, we experiment normalization by small number of test speaker’s utterances. In our results, out method can also maintain the performance under only 5 utterances.
目次 Table of Contents
List of Tables ix
List of Figures xii
Chapter 1 簡介1
1.1 研究動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 文獻回顧. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2 研究方法與基本架構6
2.1 基準特徵集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 基本聲學特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 支持向量機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 3 語料庫介紹及正規化方法11
3.1 語料庫介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 德語語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 國台客語料庫. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 特徵正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 語者正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 語言正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 不分情緒語言正規化. . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 4 實驗19
4.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 基準實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 特徵挑選. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 靜態與動態特徵. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 特徵群. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.3 泛函. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.4 主成分分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 正規化流程. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 EMO-DB實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.1 跨語料庫實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.2 混合語料訓練. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5.3 不同句數語者正規化. . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5.4 轉錄德語實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.4.1 經喇叭和麥克風收錄實驗. . . . . . . . . . . . . . . . . 33
4.5.4.2 軟體直接收錄實驗. . . . . . . . . . . . . . . . . . . . . 34
4.6 台灣語料庫實驗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.1 跨語料庫正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.2 語料庫內跨語言正規化. . . . . . . . . . . . . . . . . . . . . . . . 36
4.6.3 混和語料國語正規化. . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6.4 混和語料台語正規化. . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6.5 混和語料客家語正規化. . . . . . . . . . . . . . . . . . . . . . . . 40
Chapter 5 總結與未來展望42
參考文獻 References
[1] R. W. Picard, Affective computing. MIT Press, 1997.
[2] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition:
Features, classification schemes, and databases,” Pattern Recognition, vol. 44, pp. 572–
587, Mar. 2011.
[3] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009 emotion challenge,”
in Proceedings of INTERSPEECH, pp. 312–315, 2009.
[4] S. Steidl, Automatic classification of emotion related user states in spontaneous children’s
speech. PhD thesis, University of Erlangen-Nuremberg, 2009.
[5] G. Weiss and F. Provost, “The effect of class distribution on classifier learning: An
empirical study,” tech. rep., 2001.
[6] A. Ivanov and G. Riccardi, “Kolmogorov-Smirnov test for feature selection in emotion
recognition from speech,” in Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 5125–5128, March 2012.
[7] B.-C. Chiou and C.-P. Chen, “Feature space dimension reduction in speech emotion
recognition using support vector machine,” in Proceedings of Signal and Information
Processing Association Annual Summit and Conference (APSIPA), pp. 1–6, October
2013.
[8] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front end factor analysis
for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing,
vol. 19, pp. 788–798, May 2010.
[9] V. Sethu, E. Ambikairajah, and J. Epps, “Speaker normalisation for speech-based emotion
detection,” in Proceedings of Digital Signal Processing, pp. 611–614, July 2007.
[10] S. Ntalampiras and N. Fakotakis, “Anchor models for emotion recognition from speech,”
in IEEE Transactions on Affective Computing, vol. 4, pp. 280–290, 2013.
[11] I. Luengo, E. Navas, and I. Hernaez, “Combining spectral and prosodic information
for emotion recognition in the Interspeech 2009 emotion challenge.,” in Proceedings of
INTERSPEECH, pp. 332–335, 2009.
[12] A. Metallinou, A. Katsamanis, and S. S. Narayanan, “A hierarchical framework for modeling
multimodality and emotional evolution in affective dialogs.,” in Proceedings of
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 2401–2404, March 2012.
[13] S. Firoz, S. Raji, and A. Babu, “Automatic emotion recognition from speech using artificial
neural networks with gender-dependent databases,” in Proceedings of the Advances
in Computing, Control, and Telecommunication Technologies, pp. 162–164, December
2009.
[14] H. Hu, M.-X. Xu, and W. Wu, “GMM supervector based svm with spectral features
for speech emotion recognition,” in Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 413–416, April 2007.
[15] N. Ding, V. Sethu, J. Epps, and E. Ambikairajah, “Speaker variability in emotion recognition
- an adaptation based approach.,” in Proceedings of IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 5101–5104, March
2012.
[16] C.-H. Wu and W.-B. Liang, “Emotion recognition of affective speech based on multiple
classifiers using acoustic-prosodic information and semantic labels,” in IEEE Transactions
on Affective Computing, vol. 2, pp. 10–21, 2011.
[17] B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic emotion
recognition: A benchmark comparison of performances,” in Proceedings of IEEE Workshop
on of Automatic Speech Recognition Understanding (ASRU), pp. 552–557, 2009.
[18] B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, and
G. Rigoll, “Cross-corpus acoustic emotion recognition: Variances and strategies,” in
IEEE Transactions on Affective Computing, vol. 1, pp. 119–131, 2010.
[19] D. McDuff, R. Kaliouby, and R. Picard, “Crowdsourcing facial responses to online
videos,” in IEEE Transactions on Affective Computing, vol. 3, pp. 456–468, 2012.
[20] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of
german emotional speech,” in Proceedings of INTERSPEECH, pp. 1517–1520, 2005.
[21] M. Grimm, K. Kroschel, and S. Narayanan, “The vera am mittag german audio-visual
emotional speech database.,” in Proceedings of IEEE International Conference on Multimedia
and Expo, pp. 865–868, 2008.
[22] T. Iliou and C.-N. Anagnostopoulos, “SVM-MLP-PNN classifiers on speech emotion
recognition field - a comparative study,” in Proceedings of International Conference on
Digital Telecommunications (ICDT), pp. 1–6, 2010.
[23] M. Shah, L. Miao, C. Chakrabarti, and A. Spanias, “A speech emotion recognition
framework based on latent dirichlet allocation: Algorithm and fpga implementation,”
in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 2553–2557, May 2013.
[24] S. Ananthakrishnan, A. Vembu, and R. Prasad, “Model-based parametric features for
emotion recognition from speech,” in in Proceedings of IEEE Workshop on Automatic
Speech Recognition and Understanding (ASRU), pp. 529–534, 2011.
[25] S. Ntalampiras and N. Fakotakis, “Modeling the temporal evolution of acoustic parameters
for speech emotion recognition,” in IEEE Transactions on Affective Computing,
vol. 3, pp. 116–125, 2012.
[26] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization,
Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001.
[27] Y.-C. Kao and B. Chen, “Leveraging distributional characteristics of modulation spectra
for robust speech recognition.,” in International Conference on Information Science,
Signal Processing and their Applications (ISSPA), pp. 120–125, July 2012.
[28] J. Platt, “Fast training of support vector machines using sequential minimal optimization,”
in Advances in Kernel Methods—Support Vector Learning (B. Sch¨olkopf, C. J. C.
Burges, and A. J. Smola, eds.), (Cambridge, MA), pp. 185–208, MIT Press, 1999.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code