論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
運用拉丁方陣評估模型內插與模型調適兩種情緒語音合成方法 Using Latin Square Design To Evaluate Model Interpolation And Adaptation Based Emotional Speech Synthesis |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
49 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2012-06-29 |
繳交日期 Date of Submission |
2012-07-19 |
關鍵字 Keywords |
馬氏距離、拉丁方陣設計、情緒語音合成、隱藏式馬可夫模型、模型調適、模型內插 model interpolation, Latin-square design, hidden Markov model, model adaptation, emotional speech synthesis, Mahalanobis distance |
||
統計 Statistics |
本論文已被瀏覽 5714 次,被下載 1385 次 The thesis/dissertation has been browsed 5714 times, has been downloaded 1385 times. |
中文摘要 |
在本篇論文中,我們利用隱藏式馬可夫模型(hidden Markov model, HMM) 能夠使用少量的語料,即可合成出擁有一定品質的語音特性,來實作中文語音合成系統外,更加利用此模型語音參數化的彈性表示方法來進行情緒語音的合成。我們使用了模型內插和模型調適讓不帶情緒的目標語音來合成出特定情緒的語音。在模型內插的方法中,我們利用單音馬氏距離(monophone-based Mahalanobis distance) 從語者池中挑選出與目標語者相近的情緒模型,並且估算出其內插值,進而合成出情緒語音;在模型調適的方法中,我們使用大量收集的語料對於各別情緒訓練出平均語音模型後,利用CMLLR(constrained maximum likelihood linear regression) 模型調適方法,將其模型調整符合目標語者特性的情緒語音。此外我們以拉丁方陣設計了評測方法,此方法可以減少在主觀測驗下系統性的偏移,使得測驗結果更加的可信和公平。在實驗中,我們分別合成出包含快樂、生氣和悲傷的特定情緒語音,並且利用利拉丁方陣設計做了相似度、自然度和情緒表達度三種評估。由評估結果,對兩種合成方法做了綜合的比較和結論。 |
Abstract |
In this thesis, we use a hidden Markov model which can use a small amount of corpus to synthesize speech with certain quality to implement speech synthesis system for Chinese. More, the emotional speech are synthesized by the flexibility of the parametric speech in this model. We conduct model interpolation and model adaptation to synthesize speech from neutral to particular emotion without target speaker’s emotional speech. In model adaptation, we use monophone-based Mahalanobis distance to select emotional models which are close to target speaker from pool of speakers, and estimate the interpolation weight to synthesize emotional speech. In model adaptation, we collect abundant of data training average voice models for each individual emotion. These models are adapted to specific emotional models of target speaker by CMLLR method. In addition, we design the Latin-square evaluation to reduce the systematic offset in the subjective tests, making results more credible and fair. We synthesize emotional speech include happiness, anger, sadness, and use Latin square design to evaluate performance in three part similarity, naturalness, and emotional expression respectively. According to result, we make a comprehensive comparison and conclusions of two method in emotional speech synthesis. |
目次 Table of Contents |
List of Tables vii List of Figures viii Chapter 1 緒論1 1.1 研究動機與目的. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究背景. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 單元選取合成系統. . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 統計式參數合成系統. . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2 研究方法與基本架構6 2.1 基本系統架構簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 中文拼音系統與特性. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 文本相關與決策分類. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 決策樹與問題集. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 模型調適法簡述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 最大相似度線性迴歸(MLLR) . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 設限最大相似度線性迴歸(CMLLR) . . . . . . . . . . . . . . . . . 14 Chapter 3 拉丁方陣設計及模型內插模型演算16 3.1 拉丁方陣簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.1 方陣設計步驟與模型分析. . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 方陣設計之優缺點. . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 評估情緒語音的方陣設計. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 模型內插演算. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.1 HMM的內插. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 語料收集花費分析. . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.3 特定語者模型內插. . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 4 實驗26 4.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 實驗語料收集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 模型內插和調適法的評估實驗. . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.1 情緒表達度測試. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.2 語音自然度測試. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.3 語者相似度測試. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 5 結論與未來展望34 |
參考文獻 References |
[1] J.-A. Bachorowski, “Vocal Expression and Perception of Emotion,” Current Directions in Psychological Science, vol. 8, no. 2, pp. 53–57, 1999. [2] 劉玉娟, “情緒的語音交流,” Chinese Journal of Beahvoral Medicine and Brain Science, vol. 16, April 2007. [3] D. Klatt, “The klattalk text-to-speech conversion system,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 7, pp. 1589 – 1592, May 1982. [4] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 373–376, 1996. [5] J. Z. Gros and M. Zganec, “An Efficient Unit-selection Method for Concatenative Text-to-speech Synthesis Systems,” Journal of Computing and Information Technology, vol. 16, no. 1, pp. 69–78, 2008. [6] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis,” in proceedings of 6th European Conference on Speech Communication and Technology (EUROSPEECH 1999), Hungary, pp. 2347–2350, 1999. [7] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1229–1232, 2007. [8] M. K. et al, “Personalising speech-to-speech translation in the EMIME project,” in Proceedings of the ACL 2010 System Demonstrations, Uppsala, Sweden, 2010. [9] M. W. et al, “Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project,” in Proceedings of 7th ISCA Speech Synthesis Workshop, Kyoto, Japan, 2010. [10] M. Bulut, S. S. Narayanan, and A. K. Syrdal, “Expressive speech synthesis using a concatenative synthesizer,” in Proceedings International Conference on Spoken Language Processing (ICSLP), pp. 1265–1268, 2002. [11] J. M. Montero, J. M. Guti′errez-Arriola, S. E. Palazuelos, E. Enr′ıquez, S. Aguilera, and J. M. Pardo, “Emotional speech synthesis: from speech database to tts,” in Proceedings International Conference on Spoken Language Processing (ICSLP), 1998. [12] J. Tao, Y. Kang, and A. Li, “Prosody conversion from neutral speech to emotional speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1145–1154, 2006. [13] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Modeling of various speaking styles and emotions for HMM-based speech synthesis,” in proceedings of 8th European Conference on Speech Communication and Technology (EUROSPEECH 2003), Switzerland, pp. 2461–2464, 2003. [14] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis,” IEICE Transactions on Information and Systems, vol. E88-D, no. 3, pp. 502–509, 2005. [15] T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style control technique for HMM-based expressive speech synthesis,” IEICE Transactions on Information and Systems, vol. E90-D, no. 9, pp. 1406–1413, 2007. [16] M. Tachibana, S. Izawa, T. Nose, and T. Kobayashi, “Speaker and style adaptation using average voice model for style control in hmm-based speech synthesis,” in Proceed37 ings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4633–4636, 2008. [17] A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 373 –376 vol. 1, may 1996. [18] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. Black, and K. Tokuda, “The HMM-based speech synthesis system (HTS) version 2.0,” in Proc. 6th ISCA Workshop on Speech Synthesis (SSW-6), Aug. 2007. [19] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Hidden markov models based on multi-space probability distribution for pitch pattern modeling,” in Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01, Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 229–232, 1999. [20] T. Yoshimura, K. Tokuda, T. Kobayashi, T. Masuko, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in hmm-based speech synthesis,” 1999. [21] H. Wang, Y. Qian, F. K. Soong, J.-L. Zhou, and J. Han, “A multi-space distribution (msd) approach to speech recognition of tonal languages,” in INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, USA, pp. 17– 21, September 2006. [22] O. Govokhina, G. Bailly, G. Breton, and P. C. Bagshaw, “Tda: a new trainable trajectory formation system for facial animation,” in INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, USA, pp. 17–21, September 2006. [23] L. Ma, F. Soong, P. Liu, and Y.-J.Wu, “A msd-hmm approach to pen trajectory modeling for online handwriting recognition,” in Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01. [24] C. Huang, Y. Shi, J. Zhou, M. Chu, T. Wang, and E. Chang, “Segmental tonal modeling for phone set design in mandarin lvcsr,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 901–904, 2004. [25] C.-T. Chiu, J.-J. Tu, J.-S. Lin, and S.-C. Chang, “Phone-based mandarin speech recognition,” 電腦與通訊, vol. 120, pp. 37–41, 6 2007. [26] H. Zen, “An example of context-dependent label format for hmm-based speech synthesis in english,” March 2006. [27] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai, “Analysis of speaker adaptation algorithms for hmm-based speech synthesis and a constrained smaplr adaptation algorithm,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 17, pp. 66 –83, Jan 2009. [28] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,” 1995. [29] M. Ferras, C. C. Leung, C. Barras, and J.-L. Gauvain, “Constrained mllr for speaker recognition,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. IV–53 –IV–56, April 2007. [30] B. G. Armitage P, Statistical Methods in Medical Research, 2nd edn. Blackwell, Oxford., 1987. [31] S. Clarke-O’Neill, L. Pettersson, M. Fader, G. Dean, R. Brooks, and A. Cottenden, “A multicentre comparative evaluation: washable pants with an integral pad for light incontinence.,” Journal of Clinical Nursing, vol. 11, no. 1, p. 79–89., 2002. [32] M. Fraser and S. King, “The blizzard challenge 2007,” in conjunction with the Sixth ISCA Workshop on Speech Synthesis, Germany, August 2007. [33] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Speaker interpolation for HMM-based speech synthesis system,” Journal of Acoustical Society of Japan, vol. 21, no. 4, pp. 199–206, 2000. [34] M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, “Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing,” IEICE Transactions on Information and Systems, vol. E88-D, no. 11, pp. 2484–2491, 2005. [35] M. Pucher, D. Schabus, J. Yamagishi, F. Neubarth, and V. Strom, “Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis,” Speech Communication, vol. 52, no. 2, pp. 164–179, 2010. [36] C.-Y. Yang and C.-P. Chen, “A hidden markov model-based approach for emotional speech synthesis,” in 7th ISCA Workshop on Speech Synthesis (SSW-7) ISCA Archive, September 2010. [37] “Examples for using speech signal processing toolkit,” tech. rep., SPTK working group, 2011. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:自定論文開放時間 user define 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |