Responsive image
博碩士論文 etd-0719112-143512 詳細資訊
Title page for etd-0719112-143512
論文名稱
Title
運用拉丁方陣評估模型內插與模型調適兩種情緒語音合成方法
Using Latin Square Design To Evaluate Model Interpolation And Adaptation Based Emotional Speech Synthesis
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
49
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2012-06-29
繳交日期
Date of Submission
2012-07-19
關鍵字
Keywords
馬氏距離、拉丁方陣設計、情緒語音合成、隱藏式馬可夫模型、模型調適、模型內插
model interpolation, Latin-square design, hidden Markov model, model adaptation, emotional speech synthesis, Mahalanobis distance
統計
Statistics
本論文已被瀏覽 5714 次,被下載 1385
The thesis/dissertation has been browsed 5714 times, has been downloaded 1385 times.
中文摘要
  在本篇論文中,我們利用隱藏式馬可夫模型(hidden Markov model, HMM) 能夠使用少量的語料,即可合成出擁有一定品質的語音特性,來實作中文語音合成系統外,更加利用此模型語音參數化的彈性表示方法來進行情緒語音的合成。我們使用了模型內插和模型調適讓不帶情緒的目標語音來合成出特定情緒的語音。在模型內插的方法中,我們利用單音馬氏距離(monophone-based Mahalanobis distance) 從語者池中挑選出與目標語者相近的情緒模型,並且估算出其內插值,進而合成出情緒語音;在模型調適的方法中,我們使用大量收集的語料對於各別情緒訓練出平均語音模型後,利用CMLLR(constrained maximum likelihood linear regression) 模型調適方法,將其模型調整符合目標語者特性的情緒語音。此外我們以拉丁方陣設計了評測方法,此方法可以減少在主觀測驗下系統性的偏移,使得測驗結果更加的可信和公平。在實驗中,我們分別合成出包含快樂、生氣和悲傷的特定情緒語音,並且利用利拉丁方陣設計做了相似度、自然度和情緒表達度三種評估。由評估結果,對兩種合成方法做了綜合的比較和結論。
Abstract
  In this thesis, we use a hidden Markov model which can use a small amount of corpus to synthesize speech with certain quality to implement speech synthesis system for Chinese. More, the emotional speech are synthesized by the flexibility of the parametric speech in this model. We conduct model interpolation and model adaptation to synthesize speech from neutral to particular emotion without target speaker’s emotional speech. In model adaptation, we use monophone-based Mahalanobis distance to select emotional models which are close to target speaker from pool of speakers, and estimate the interpolation weight to synthesize emotional
speech. In model adaptation, we collect abundant of data training average voice models for each individual emotion. These models are adapted to specific emotional models of target speaker by CMLLR method. In addition, we design the Latin-square evaluation to reduce the systematic offset in the subjective tests, making results more credible and fair. We synthesize emotional speech include happiness, anger, sadness, and use Latin square design to evaluate performance in three part similarity, naturalness, and emotional expression respectively. According to result, we make a comprehensive comparison and conclusions of two method in emotional speech synthesis.
目次 Table of Contents
List of Tables vii
List of Figures viii
Chapter 1 緒論1
1.1 研究動機與目的. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究背景. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 單元選取合成系統. . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 統計式參數合成系統. . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2 研究方法與基本架構6
2.1 基本系統架構簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 中文拼音系統與特性. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 文本相關與決策分類. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 決策樹與問題集. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 模型調適法簡述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 最大相似度線性迴歸(MLLR) . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 設限最大相似度線性迴歸(CMLLR) . . . . . . . . . . . . . . . . . 14
Chapter 3 拉丁方陣設計及模型內插模型演算16
3.1 拉丁方陣簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 方陣設計步驟與模型分析. . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 方陣設計之優缺點. . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 評估情緒語音的方陣設計. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 模型內插演算. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 HMM的內插. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 語料收集花費分析. . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.3 特定語者模型內插. . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 4 實驗26
4.1 實驗設定. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 實驗語料收集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 模型內插和調適法的評估實驗. . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 情緒表達度測試. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.2 語音自然度測試. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 語者相似度測試. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 5 結論與未來展望34
參考文獻 References
[1] J.-A. Bachorowski, “Vocal Expression and Perception of Emotion,” Current Directions
in Psychological Science, vol. 8, no. 2, pp. 53–57, 1999.
[2] 劉玉娟, “情緒的語音交流,” Chinese Journal of Beahvoral Medicine and Brain Science,
vol. 16, April 2007.
[3] D. Klatt, “The klattalk text-to-speech conversion system,” in Proceedings of International
Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 7,
pp. 1589 – 1592, May 1982.
[4] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system
using a large speech database,” in Proceedings of International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 373–376, 1996.
[5] J. Z. Gros and M. Zganec, “An Efficient Unit-selection Method for Concatenative
Text-to-speech Synthesis Systems,” Journal of Computing and Information Technology,
vol. 16, no. 1, pp. 69–78, 2008.
[6] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous
Modeling of Spectrum, Pitch and Duration in HMM-Based Speech Synthesis,” in proceedings
of 6th European Conference on Speech Communication and Technology (EUROSPEECH
1999), Hungary, pp. 2347–2350, 1999.
[7] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” in
Proceedings of International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 1229–1232, 2007.
[8] M. K. et al, “Personalising speech-to-speech translation in the EMIME project,” in Proceedings
of the ACL 2010 System Demonstrations, Uppsala, Sweden, 2010.
[9] M. W. et al, “Speaker adaptation and the evaluation of speaker similarity in the EMIME
speech-to-speech translation project,” in Proceedings of 7th ISCA Speech Synthesis
Workshop, Kyoto, Japan, 2010.
[10] M. Bulut, S. S. Narayanan, and A. K. Syrdal, “Expressive speech synthesis using a concatenative
synthesizer,” in Proceedings International Conference on Spoken Language
Processing (ICSLP), pp. 1265–1268, 2002.
[11] J. M. Montero, J. M. Guti′errez-Arriola, S. E. Palazuelos, E. Enr′ıquez, S. Aguilera, and
J. M. Pardo, “Emotional speech synthesis: from speech database to tts,” in Proceedings
International Conference on Spoken Language Processing (ICSLP), 1998.
[12] J. Tao, Y. Kang, and A. Li, “Prosody conversion from neutral speech to emotional
speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4,
pp. 1145–1154, 2006.
[13] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Modeling of various speaking
styles and emotions for HMM-based speech synthesis,” in proceedings of 8th European
Conference on Speech Communication and Technology (EUROSPEECH 2003), Switzerland,
pp. 2461–2464, 2003.
[14] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Acoustic modeling of speaking
styles and emotional expressions in HMM-based speech synthesis,” IEICE Transactions
on Information and Systems, vol. E88-D, no. 3, pp. 502–509, 2005.
[15] T. Nose, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style control technique for
HMM-based expressive speech synthesis,” IEICE Transactions on Information and Systems,
vol. E90-D, no. 9, pp. 1406–1413, 2007.
[16] M. Tachibana, S. Izawa, T. Nose, and T. Kobayashi, “Speaker and style adaptation using
average voice model for style control in hmm-based speech synthesis,” in Proceed37
ings of International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 4633–4636, 2008.
[17] A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis system using a
large speech database,” in Proceedings of International Conference on Acoustics, Speech
and Signal Processing (ICASSP), vol. 1, pp. 373 –376 vol. 1, may 1996.
[18] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. Black, and K. Tokuda, “The
HMM-based speech synthesis system (HTS) version 2.0,” in Proc. 6th ISCA Workshop
on Speech Synthesis (SSW-6), Aug. 2007.
[19] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Hidden markov models based
on multi-space probability distribution for pitch pattern modeling,” in Proceedings of
the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference
- Volume 01, Proceedings of International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 229–232, 1999.
[20] T. Yoshimura, K. Tokuda, T. Kobayashi, T. Masuko, and T. Kitamura, “Simultaneous
modeling of spectrum, pitch and duration in hmm-based speech synthesis,” 1999.
[21] H. Wang, Y. Qian, F. K. Soong, J.-L. Zhou, and J. Han, “A multi-space distribution
(msd) approach to speech recognition of tonal languages,” in INTERSPEECH 2006 -
ICSLP, Ninth International Conference on Spoken Language Processing, USA, pp. 17–
21, September 2006.
[22] O. Govokhina, G. Bailly, G. Breton, and P. C. Bagshaw, “Tda: a new trainable trajectory
formation system for facial animation,” in INTERSPEECH 2006 - ICSLP, Ninth
International Conference on Spoken Language Processing, USA, pp. 17–21, September
2006.
[23] L. Ma, F. Soong, P. Liu, and Y.-J.Wu, “A msd-hmm approach to pen trajectory modeling
for online handwriting recognition,” in Proceedings of the Ninth International Conference
on Document Analysis and Recognition - Volume 01.
[24] C. Huang, Y. Shi, J. Zhou, M. Chu, T. Wang, and E. Chang, “Segmental tonal modeling
for phone set design in mandarin lvcsr,” in Proceedings of International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 901–904, 2004.
[25] C.-T. Chiu, J.-J. Tu, J.-S. Lin, and S.-C. Chang, “Phone-based mandarin speech recognition,”
電腦與通訊, vol. 120, pp. 37–41, 6 2007.
[26] H. Zen, “An example of context-dependent label format for hmm-based speech synthesis
in english,” March 2006.
[27] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai, “Analysis of speaker
adaptation algorithms for hmm-based speech synthesis and a constrained smaplr adaptation
algorithm,” Audio, Speech, and Language Processing, IEEE Transactions on,
vol. 17, pp. 66 –83, Jan 2009.
[28] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker
adaptation of continuous density hidden markov models,” 1995.
[29] M. Ferras, C. C. Leung, C. Barras, and J.-L. Gauvain, “Constrained mllr for speaker
recognition,” in Proceedings of International Conference on Acoustics, Speech and Signal
Processing (ICASSP), vol. 4, pp. IV–53 –IV–56, April 2007.
[30] B. G. Armitage P, Statistical Methods in Medical Research, 2nd edn. Blackwell, Oxford.,
1987.
[31] S. Clarke-O’Neill, L. Pettersson, M. Fader, G. Dean, R. Brooks, and A. Cottenden,
“A multicentre comparative evaluation: washable pants with an integral pad for light
incontinence.,” Journal of Clinical Nursing, vol. 11, no. 1, p. 79–89., 2002.
[32] M. Fraser and S. King, “The blizzard challenge 2007,” in conjunction with the Sixth
ISCA Workshop on Speech Synthesis, Germany, August 2007.
[33] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Speaker interpolation
for HMM-based speech synthesis system,” Journal of Acoustical Society of Japan,
vol. 21, no. 4, pp. 199–206, 2000.
[34] M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, “Speech synthesis with various
emotional expressions and speaking styles by style interpolation and morphing,”
IEICE Transactions on Information and Systems, vol. E88-D, no. 11, pp. 2484–2491,
2005.
[35] M. Pucher, D. Schabus, J. Yamagishi, F. Neubarth, and V. Strom, “Modeling and interpolation
of Austrian German and Viennese dialect in HMM-based speech synthesis,”
Speech Communication, vol. 52, no. 2, pp. 164–179, 2010.
[36] C.-Y. Yang and C.-P. Chen, “A hidden markov model-based approach for emotional
speech synthesis,” in 7th ISCA Workshop on Speech Synthesis (SSW-7) ISCA Archive,
September 2010.
[37] “Examples for using speech signal processing toolkit,” tech. rep., SPTK working group,
2011.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code