國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於隱藏式馬可夫模型之語者相關情緒語音合成 ,A Hidden Markov Model-Based Approach for Emotional Speech Synthesis

論文名稱 Title	基於隱藏式馬可夫模型之語者相關情緒語音合成 A Hidden Markov Model-Based Approach for Emotional Speech Synthesis
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	98 學年度第 2 學期 The spring semester of Academic Year 98	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	47
研究生 Author	楊治鏞 Chih-Yung Yang
指導教授 Advisor	陳嘉平 Chia-Ping Chen
召集委員 Convenor	王新民 Hsin-Min Wang
口試委員 Advisory Committee	吳宗憲, 葉瑞峰 Chung-Hsien Wu; Jui-Feng Yeh
口試日期 Date of Exam	2010-07-28	繳交日期 Date of Submission	2010-08-30
關鍵字 Keywords	模型結合、線性迴歸、模型內插、馬氏距離、情緒語音、隱藏式馬可夫模型、語音合成 speech synthesis, HMM, emotional expressiveness, model combination, linear regression, model interpolation, Mahalanobis distance
統計 Statistics	本論文已被瀏覽 5680 次，被下載 0 次 The thesis/dissertation has been browsed 5680 times, has been downloaded 0 times.

中文摘要
在這篇論文中,我們利用隱藏式馬可夫模型，開發了兩個藉由目標語者不帶情緒的語音來合成出情緒語音的方法。在第一個方法裡，我們藉由將目標語者不帶情緒的模型和資料庫中帶有情緒的模型做模型內插來合成目標語者帶有情緒的語音。我們提出了monophone-based Mahalanobis distance (MBMD)來選擇適當的模型，並且估算出模型的內插值。在第二個方法中，我們用線性迴歸來描述不帶情緒的模型和情緒模型之間的差異。將訓練線性迴歸所得到的參數與目標語者不帶情緒的模型結合來達成我們所要的效果。在實驗中，我們合成出帶有生氣、快樂和悲傷的語音並且做了客觀的評估。由評估結果得知，我們的方法可以有效的合成出目標語者的情緒語音。
Abstract
In this thesis, we describe two approaches to automatically synthesize the emotional speech of a target speaker based on the hidden Markov model for his/her neutral speech. In the interpolation based method, the basic idea is the model interpolation between the neutral model of the target speaker and an emotional model selected from a candidate pool. Both the interpolation model selection and the interpolation weight computation are determined based on a model-distance measure. We propose a monophone-based Mahalanobis distance (MBMD). In the parallel model combination (PMC) based method, our basic idea is to model the mismatch between neutral model and emotional model. We train linear regression model to describe this mismatch. And then we combine the target speaker neutral model with the linear regression model. We evaluate our approach on the synthesized emotional speech of angriness, happiness, and sadness with several subjective tests. Experimental results show that the implemented system is able to synthesize speech with emotional expressiveness of the target speaker.

目次 Table of Contents
List of Tables iii List of Figures iv Acknowledgments vi Chapter 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2 Hidden Markov Model-Based Speech Synthesis System 4 2.1 HMM-Based Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 HMM-Based Mandarin Speech Synthesis System . . . . . . . . . . . . . . . 7 2.2.1 Segmental Tonal Modeling . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Questions Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Context-Dependent Label Format . . . . . . . . . . . . . . . . . . . 9 Chapter 3 The Proposed Algorithm 11 3.1 Interpolation Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.1 Model Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.2 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 PMC Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Model of the Environment . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 18 i Chapter 4 Experiment Results 22 4.1 Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 The Experiment of Interpolation Based . . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Emotional Expressiveness Test . . . . . . . . . . . . . . . . . . . . . 23 4.2.2 Naturalness Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.3 Similarity Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.4 Comparison with Naive Interpolation . . . . . . . . . . . . . . . . . 25 4.3 The Experiment of PMC Based . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.1 Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.2 Emotional Expressiveness Test . . . . . . . . . . . . . . . . . . . . . 27 4.3.3 Naturalness Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.4 Similarity Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Paired Comparison Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 5 Conclusion and Future Works 35 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

參考文獻 References
[1] A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proc. ICASSP, pp. 373–376, 1996. [2] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis,” in Proc. of Eurospeech, pp. 2347–2350, 1999. [3] J. Tao, Y. Kang, and A. Li, “Prosody conversion from neutral speech to emotional speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1145–1154, 2006. [4] C. Wu, C. Hsia, T. Liu, and J. Wang, “Voice conversion using duration-embedded bi- HMMs for expressive speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1109–1116, 2006. [5] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis,” IEICE Transactions on Information and Systems, vol. 88, no. 3, pp. 502–509, 2005. [6] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai, “Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 1, pp. 66–83, 2009. [7] M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, “Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing,” IEICE Transactions on Information and Systems, vol. 88, no. 11, pp. 2484–2491, 2005. [8] T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style control technique for HMM-based expressive speech synthesis,” IEICE Transactions on Information and Systems, vol. 90, no. 9, pp. 1406–1413, 2007. [9] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Speaker interpolation for HMM-based speech synthesis system,” Acoustical Science and Technology, vol. 21, no. 4, pp. 199–206, 2000. [10] M. Gales and S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, 1996. [11] K. Tokuda, T. Kobayashi, and S. Imai, “Adaptive cepstral analysis of speech,” IEEE Transactions on Speech and Audio Processing, vol. 3, no. 6, pp. 481–489, 1995. [12] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive algorithm for melcepstral analysis of speech,” in Proc. ICASSP, vol. 92, pp. 137–140, 1992. [13] J. Wu, “Pitch Prediction Using Prosody Hierarchy and Dynamic Features for HMMbased Mandarin Speech Synthesis,” 2008. [14] C. Huang, Y. Shi, J. Zhou, M. Chu, T. Wang, and E. Chang, “Segmental tonal modeling for phone set design in Mandarin LVCSR,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04), 2004. [15] M. Pucher, D. Schabus, J. Yamagishi, F. Neubarth, and V. Strom, “Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis,” Speech Communication, vol. 52, no. 2, pp. 164–179, 2010. [16] M. Gales and S. Young, “Robust speech recognition in additive and convolutional noise using parallel model combination,” Computer Speech & Language, vol. 9, no. 4, pp. 289–307, 1995.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.138.101.95 論文開放下載的時間是校外不公開 Your IP address is 3.138.101.95 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS