Responsive image
博碩士論文 etd-0910108-122526 詳細資訊
Title page for etd-0910108-122526
論文名稱
Title
模擬棒球廣播之情緒化語音合成系統
Emotional Text-to-Speech System of Baseball Broadcast
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
62
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2008-08-28
繳交日期
Date of Submission
2008-09-10
關鍵字
Keywords
接合式語音合成、韻律的調整、語音合成
speech synthesis, emotion conversion, prosodic rule
統計
Statistics
本論文已被瀏覽 5655 次,被下載 1448
The thesis/dissertation has been browsed 5655 times, has been downloaded 1448 times.
中文摘要
本研究為建立一個具有情緒化的棒球報導語音合成系統。我們系統的目標為讓合成的語音能夠盡量模擬廣播中棒球主播的報導方式,包括主播報導時的具有情緒的語音,以及主播在轉播時額外加入的場上資訊,都是系統建制時必須考慮的部份。為了讓合成系統能以語音的方式提供使用者額外的場上資訊,我們對線上文本的內容作文句的剖析,並且根據剖析的內容不斷的更新場上的狀況,像是壘上的人數和哪些壘包有人、目前的出局數、比數,以及場上打者在前幾次的打擊中的表現。這些資訊會用來產生額外的文句並插入到原來文本中合適的位置中。加入額外文句後的文本先經由基本的接合式語音合成器來產生語音,而在韻律的調整上,是利用實際的兩場棒球報導播的語料庫來學習韻律規則,並利用這些規則來調整基本合成器所合成的語句,讓合成的語音在韻律上能夠有情緒的表現。最後,當系統建立完成,利用主觀的聽測實驗來瞭解聽者對於合成系統的結果是否滿意。
Abstract
In this study, we implement an emotional text-to-speech system for the limited domain of on-line play-by-play baseball game summary. TheChinese Professional Baseball League (CPBL) is our target domain. Our goal is that the output synthesized speech is fluent with appropriate emotion. The system first parses the input text and keeps the on-court informations, e.g., the number of runners and which base is occupied, the number of outs, the score of each team, the batter's performance in game. And the system adds additional sentences in the input text.
Then, the system outputs neutral synthesized speech from the text with additional sentences inserted, and subsequently converts it to emotional speech. Our approach to speech conversion is to simulate a baseball braodcaster. Specifically, our system learns and uses the prosody from a broadcaster. To learn the prosody, we record two baseball games and analyze the prosodic features of emotional utterances.
These observations are used to generate some prosodic rules of emotional conversion. The subjective evaluation is used to study the preference of the subjects about the additional sentences insertion and the emotion conversion in the system.
目次 Table of Contents
1 Introduction ................................1
1.1 Background ................................1
1.2 Motivation .................................2
1.3 Thesis Organization ............................4
2 Review ................................6
2.1 Concatenation-Based TTS ........................6
2.2 Speech Emotion Conversion .......................8
3 Basic Text-to-Speech Module ................................10
3.1 Speech Inventory .............................10
3.2 Pre-Processing of the Synthesis Units ..................12
3.2.1 Pitch Tracking ..........................12
3.2.2 Energy Normalization ......................13
3.3 Basic TTS Framework ..........................13
3.4 Synthesizer ................................14
4 Emotional Speech Corpus and Analysis ................................18
4.1 Emotional Speech Corpus Construction .................18
4.2 Classification of Emotional Corpus ...................19
4.3 F0 Contour Analysis ...........................21
4.4 Stressed Syllables .............................23
5 Additional Sentence Generation Module ................................27
5.1 On-court Information Parser .......................28
5.2 Additional Sentence Insertion ......................30
6 Experiment and Evaluation 31
6.1 Speech Emotion Conversion Module ...................31
6.1.1 Text Analyzer ...........................31
6.1.2 F0 Extraction ...........................33
6.1.3 Rhythmic Stress .........................33
6.1.4 Semantic Stress ..........................34
6.1.5 Speech Synthesizer ........................36
6.2 Evaluation .................................36
6.2.1 Perceptual Experiment ......................36
6.2.2 Preference Test ..........................38
6.2.3 Additional Sentence Preference Test ...............38
6.3 Discussion .................................39
6.4 Cross-fading Effect ............................40
7 Conclusion and Future Work ................................42
7.1 Conclusion .................................42
7.2 Future Work ................................43
參考文獻 References
[1] L.S. Lee, C.Y. Tseng, and M. Ouh-Young. The synthesis rules in a Chinese text-to-speech system. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 37(9):1309–1320, 1989.
[2] M.S. Liang, R.C. Yang, Y.C. Chiang, D.C. Lyu, and R.Y. Lyu. A Taiwanese Text-to-Speech System with Applications to Language Learning. In Proceedings of the IEEE International Conference on Advanced Learning Technologies, volume 1, pages 91–95. IEEE Computer Society Washington, DC, USA, 2004.
[3] A.J. Hunt and A.W. Black. Unit selection in a concatenative speech synthesis system using alarge speech database. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996, volume 1, pages 373–376, 1996.
[4] 古鴻炎 and 楊仲捷 基於VQ/HMM之國語語句基週軌跡產生之方法. Master’s thesis, 國立台灣科技大學電機所, 1999.
[5] A.W. Black and K.A. Lenzo. Limited Domain Synthesis. In Proceedings of the Sixth International Conference on Spoken Language Processing, 2000. ISCA, 2000.
[6] S.J. Kim, J.J. Kim, and M. Hahn. HMM-based Korean speech synthesis system for hand-held devices. IEEE Transactions on Consumer Electronics, 52(4): 1384–1390, 2006.
[7] S.H. Chen, S.H. Hwang, and Y.R. Wang. An RNN-based prosodic information synthesizer for Mandarintext-to-speech. IEEE Transactions on Speech and Audio Processing, 6(3):226–239, 1998.
[8] J. Tao, Y. Kang, and A. Li. Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech and Language Processing, 14(4):1145–1154, 2006.
[9] M. Isogai and H. Mizuno. A New F0 Contour Control Method Based on Vector Representation of F0 Contour. In Sixth European Conference on Speech Communication and Technology. ISCA, 1999.
[10] D.T. Chappell and J.H.L. Hansen. Speaker-specific pitch contour modeling and modification. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1998, volume 2, pages 885–888, 1998.
[11] Z. Inanoglu. Transforming pitch in a voice conversion framework. Master’s thesis, St. Edmund’s College, University of Cambridge, 2003.
[12] T. Ceyssens, W. Verhelst, and P. Wambacq. On The Construction Of A Pitch Conversion System. In Proceedings of European Signal Processing Conference, volume I, pages 423–426, 2002.
[13] H. Kawahara, A. Cheveign′e, H. Banno, T. Takahashi, and T. Irino. Nearly Defect-Free F0 Trajectory Extraction for Expressive Speech Modifications Based on STRAIGHT. In Ninth European Conference on Speech Communication and Technology. ISCA, 2005.
[14] S.H. Pin, Y. Lee, Y. Chen, H. Wang, and C. Tseng. A Mandarin TTS system with an integrated prosodic model. 2004 International Symposium on Chinese Spoken Language Processing, pages 169–172, 2004.
[15] H. Tseng, P. Chang, G. Andrew, D. Jurafsky, and C. Manning. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pages 168–171. Jeju Island, Korea, 2005.
[16] P. C. Chang, M. Galley, and C. D. Manning. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 224–232, Columbus, Ohio, June 2008. Association for Computational Linguistics.
[17] N. XUE, FEI XIA, F.U.D. CHIOU, and M. PALMER. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(02):207–238, 2005.
[18] G. Monti and M. Sandler. MONOPHONIC TRANSCRIPTION WITH AUTOCORRELATION. In Proceedings of the Workshop on Digital Audio Effects (DAFx-00), volume 12, 2000.
[19] G. S. Ying, L. H. Jamieson, and C. D. Michell. A probabilistic approach to AMDF pitch detection. Proceedings of the Fourth International Conference on
Spoken Language, 1996, 2:1201–1204, 1996.
[20] E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9 (5-6):453–467, 1990.
[21] H. Valbret, E. Moulines, J.P. Tubach, and T. Paris. Voice transformation using PSOLA technique. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992, 1:145–148, 1992.
[22] W.B. Kleijn, H. Yang, and E.F. Deprettere. Waveform Interpolation Coding With Pitch-Spaced Subbands. In Fifth International Conference on Spoken Language Processing. ISCA, 1998.
[23] M. Chu, Y. Wang, and L. He. Labeling stress in continuous Mandarin speech perceptually. In Proceedings of the 15th International Congress of Phonetic Science, pages 2095–2098, 2003.
[24] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. ISBN 0321455363.
[25] 黃奕欽 and 陳嘉平. 前後文無關文法於語音合成語料庫之應用. In Proceedings of the 25th Workshop on Combinatorial Mathematics and Computation Theory, pages 455–459, 2008.
[26] G.J.L.S.G. Chen and T. Wu. High quality and low complexity pitch modification of acousticsignals. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1995, volume 5, pages 2987–2990, 1995.
[27] S. Lemmetty. Review of Speech Synthesis Technology. Master’s thesis, Helsinki University of Technology, 1999.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內立即公開,校外一年後公開 off campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code