國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,英文語音辨識系統增進辨識率之策略研究 ,A Design of Recognition Rate Improving Strategy For English Speech Recognition System

論文名稱 Title	英文語音辨識系統增進辨識率之策略研究 A Design of Recognition Rate Improving Strategy For English Speech Recognition System
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	99 學年度第 2 學期 The spring semester of Academic Year 99	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	73
研究生 Author	洪明昌 Ming-Chang Hung
指導教授 Advisor	陳志堅 Chih-Chien Chen
召集委員 Convenor	盧而輝 Er-Hui Lu
口試委員 Advisory Committee	柏小松, 李聰, 汪啟茂 Xiao-Song Bo; Tsung Lee; Chii-Maw Uang
口試日期 Date of Exam	2011-07-18	繳交日期 Date of Submission	2011-08-27
關鍵字 Keywords	隱藏式馬可夫模型、線性預測倒頻譜係數、英文語音辨識系統、音位結構學、梅爾頻率倒頻譜係數 Mel-frequency cepstral coefficients, Phonotactics, Hidden Markov model, Linear predictive cepstral coefficients, English speech recognition system
統計 Statistics	本論文已被瀏覽 5696 次，被下載 1018 次 The thesis/dissertation has been browsed 5696 times, has been downloaded 1018 times.

中文摘要
1588年大英帝國的海上霸權確立之後，英國的殖民活動將英文拓展到北美、印度、非洲以及澳大利亞等地。1918年第一次世界大戰結束後，美國成爲世界第一大經濟強國，同時世界金融中心也由倫敦轉移到紐約。1945年第二次世界大戰結束，美國在國際政治、經濟以及科技方面的地位更加崇高。1945年10月24日聯合國成立，訂定英文與中文、法文、西班牙文、阿拉伯文以及俄文同為聯合國的六大正式工作語言。這些歷史事件促成了一連串語言使用地域之擴展，使英文成為國際上使用最為廣泛的語言。除了政治、經濟以及科技方面的優勢之外，英國擁有世界上規模最大的綜合博物館--大英博物館。這座在1753年成立於倫敦市區的博物館，收藏來自世界各地超過1,300多萬件的考古文物，文化資源相當豐富。本研究之目的在於建立一套英文語音辨識系統，提升吾人學習英文的成效，進而擴大視野增長見聞。本論文探討英文語音辨識系統增進辨識率之策略。以英文常用的989個單音節作為主要的訓練與辨識之基礎。將每個單音節類別以一聲錄製一輪，再以四聲錄製下一輪。兩種不同聲調交替錄音，共錄製十四輪的聲紋特性作為訓練語料。並使用音高週期判斷語音尾部端點，以強化端點標示的準確性。系統採用梅爾頻率倒頻譜係數及線性預測倒頻譜係數作為特徵參數，運用隱藏式馬可夫模型，作為單音辨識模型，且調整模型狀態數目到10，再利用音位結構學做比對，以增進辨識率。本系統在時脈為2.4GHz的Intel Core i5 CPU M450之筆記型電腦與Fedora 14 之作業系統環境下，針對6,812個英文語詞做辨識，可達到92.94%之正確辨識率，而平均所需辨識時間約在1.5秒以內。
Abstract
Britain established the status of maritime hegemony in 1588. The English language along with the British colonized activities was spread to North America, India, Africa and Australia. After the end of World War I in 1918, the U.S. became the most powerful nation in the world economy. And at the same time, the world financial center was shifted to New York from London. In 1945, the World War II ended, the U.S. further played indispensable role in each aspect of international politics, economy and technologies. The United Nation, founded on October 24, 1945, adopted English, Chinese, French, Spanish, Arabic as well as Russian as the six working languages. These historical events facilitated a succession of language expansion and caused English to be the most widely used international language. Beside the political, economic and technological superiority, Britain owns the largest comprehensive museum in the globe, the British Museum. This Museum was located in London, built in 1753, and more than 13 million cultural relics of archaeology from around the world were collected. Her cultural resources are remarkably rich. It is our objective to build a language system that can help us to learn English more effectively and to widen our vision of living at the same time. This thesis investigates the recognition rate improvement strategies for an English speech recognition system. It utilizes the speech features of the 989 common English mono-syllables as the major training and recognition methodology. A training database is established by reading each mono-syllable 14 rounds. Each one of the 989 mono-syllables is consecutively read with two different tones at alternate rounds. The odd pronounced rounds have high pitch of tone 1, while the even rounds have falling pitch of tone 4. The pitch period frame method is applied for enhancing the accuracy of end point detection. Mel-frequency cepstral coefficients, linear predictive cepstral coefficients, and hidden Markov model are used as the two feature models and the recognition model respectively. The number of HMM states is adjusted to 10 and the phonotactical rule is used for the recognition rate improvement. Under the Core ™ i5 CPU M450 notebook computer with 2.4GHz clock rate and Fedora 14 operating system environment, a 92.94% correct phrase recognition rate can be reached for a 6,812 English phrase database. The average computation time for each phrase is within 1.5 seconds.

目次 Table of Contents
論文審定書 i 致謝 ii 摘要 iii Abstract iv 目錄 v 圖次 viii 表次 ix 第一章緒論 1 1-1 研究動機 1 1-2 研究目標 1 1-3 章節概要 2 第二章英文的歷史演變及其發音特性 3 2-1 語系的分類及使用概況 3 2-2 英語的演變歷史 5 2-3 英語的發音 7 第三章語音訊號處理相關技術介紹 10 3-1 音節端點標示 10 3-1-1 能量（Energy） 10 3-1-2 越零率（Zero Crossing Rate） 11 3-1-3 線性預測係數誤差能量（Linear Prediction Coefficients Error Energy） 12 3-2 梅爾倒譜係數特徵萃取(MFCC Feature Extraction) 14 3-2-1 預強濾波器(Pre-emphasis) 14 3-2-2 音框化(Frame Blocking) 15 3-2-3 加窗(Windowing) 15 3-2-4 離散傅氏轉換(Discrete Fourier Transform) 16 3-2-5 梅爾濾波器組(Mel-Frequency Filter Bank) 16 3-2-6 離散餘弦轉換 18 3-2-7 線性預測倒頻譜係數(LPC-Cepstrum) 19 第四章隱馬可夫模型(HIDDEN MARKOV MODEL, HMM) 22 4-1 信號模型 22 4-2 隱馬可夫模型介紹 22 4-3 隱馬可夫模型解三項問題 24 4-4 建立隱馬可夫模型 25 4-4-1 初始化 25 4-4-2 狀態觀察序列機率計算 26 4-4-3 參數重估(Parameter Estimation) 27 4-5 參數重估計算(Reestimation) 28 4-5-1 向前程序(Forward Procedure) 28 4-5-2 向後程序(Backward Procedure) 29 4-5-3 向前向後程序(Forward-Backward Procedure) 29 4-5-4 狀態轉移機率矩陣參數重估 30 4-5-5 狀態觀察機率矩陣參數重估 30 4-6 維特比演算法(Viterbi Algorithm) 32 第五章英文語音辨識系統介紹 34 5-1 系統架構 34 5-1-1 模型訓練系統架構 34 5-1-2 辨識系統架構 35 5-2 設備環境 36 5-3 設計方法 36 5-3-1 音標代碼 36 5-3-2 音節類別的選取與語詞資料庫的建置 38 5-3-3 語音的活動與停止判別 42 5-3-4 音節的端點標示 42 5-3-5 聲學特徵的萃取 47 5-3-6 HMM模型訓練方式 47 5-3-7 決策方式 48 第六章辨識策略研究及實驗設計 49 6-1 實驗參數設定與模擬語詞數量 49 6-2 調整模型訓練次數及訓練方式之實驗 50 6-2-1 單音節類別模型之訓練次數 50 6-2-2 單音節類別模型之訓練方式 52 6-3 改變端點標示方式實驗 55 6-3-1 保留完整子音的端點標示方式對辨識率的影響 55 6-4 採用多元聲學特徵參數實驗 57 6-4-1 單一聲學特徵與多元聲學特徵對辨識率的影響 57 6-5 調整HMM狀態數實驗 58 6-5-1 HMM狀態數對辨識率的影響 58 第七章結論與未來展望 60 7-1 結論 60 7-2 未來展望 60 參考文獻 61

參考文獻 References
[1] 王小川，語音訊號處理, 全華圖書出版社，民國93年 [2] 胡航，語音信號處理，哈爾濱工業大學出版社，2009 [3] 陳永銘，英文語音辨識系統之設計研究, 國立中山大學電機工程研究所碩士論文, 民國98年7月 [4] 越力，語音信號處理，機械工業出版社，2009 [5] 維基百科，http://zh.wikipedia.org/wiki/%E8%AA%9E%E8%A8%80 [6] 劉樂和宋庭新，語音識別與控制應用技術，科學出版社，2008 [7] Ben Gold and Nelson Morgan, Speech and Audio Signal Processing, John Wiley & Sons, inc., 1999 [8] Chin-Hui Lee, Haizhou Li, Bin Ma,Donglai Zhu, “Optimizing the Performance of Spoken Language Recognition With Discriminative Training”, IEEE Transactions on audio, Speech, and language processing. Vol.16, No.8,pp.1642-1652, November 2008 [9] Emmanuel Deruty and Geoffroy Peeters, “Sound Indexing Using Morphological Description”, IEEE Transactions on Audio, Speech, and Language processing, Vol. 18, No. 3,pp.675-687, March 2010 [10] J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing of Speech Signal, IEEE Press, New York, 2000. [11] Kai-Fu Lee, Automatic Speech Recognition, Kluwer Academic Publishers, Fourth Printing 1999. [12] Thomas F. Quatieri, Discrete-Time Speech Signal Processing principles and practice, Prentice Hall, Taiwan, 2005 [13] X. Huang, A. Acero, and H.W. Hon, Spoken Language Processing, Prentice Hall, Taiwan, 2005

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0827111-202317.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS