國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,使用差分貝氏資訊準則及支援向量機於混合語言語音自動分段與辨識,Automatic Segmentation and Identification of Mixed-Language Speech Using delta-BIC and Support Vector Machines

論文名稱 Title	使用差分貝氏資訊準則及支援向量機於混合語言語音自動分段與辨識 Automatic Segmentation and Identification of Mixed-Language Speech Using delta-BIC and Support Vector Machines
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	96 學年度第 2 學期 The spring semester of Academic Year 96	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	66
研究生 Author	王聖富 Sheng-Fu Wang
指導教授 Advisor	陳嘉平 Chia-Ping Chen
召集委員 Convenor	范俊逸 Chun-I Fan
口試委員 Advisory Committee	王新民, 柯正雯 Hsing-Min Wang; Cheng-Wen Ko
口試日期 Date of Exam	2008-07-29	繳交日期 Date of Submission	2008-09-09
關鍵字 Keywords	辨識、差分貝氏資訊準則、分段、支援向量機 LID, delta-BIC, Segmentation, Support Vector Machines
統計 Statistics	本論文已被瀏覽 5654 次，被下載 0 次 The thesis/dissertation has been browsed 5654 times, has been downloaded 0 times.

中文摘要
這篇論文提出方法，用來分段及辨識混合語言的語音資料。自動語言辨識可分成四個步驟：特徵參數擷取、分段、片段分類、與重新標註。特徵參數擷取的部份，我們比較群延遲特徵 (group delay feature, GDF) 和傳統梅爾頻率倒頻譜參數 (Mel-frequency cepstral coefficient, MFCC) 兩種不同的特徵參數。不同於傳統特徵參數取自於傅立葉轉換後的強度，群延遲特徵使用相位頻譜。在語言分段的部份，我們比較差分貝氏資訊準則 (delta-Bayesian in-formation criterion, delta-BIC) 與支援向量機 (support vector machines, SVMs) 等兩種不同方法。差分貝氏資訊準則使用聲學參數，用於將輸入語句切割成一連串語言相依的片段。再使用 K－平均演算法 (the K-means algorithm) 進行分群。最後，重新標註用於辨識各分群的語言。支援向量機則在完成訓練模型後，直接進行自動語言分段及辨識。考慮腔調可能產生的影響，我們使用台灣口音英語 (English Across Taiwan) 語料庫。在基礎為 57.77% 的音框正確率，可以得到 78.13% 的結果。
Abstract
This thesis proposes an approach to segmenting and identifying mixed-language speech. Automatic LID can be divided into four steps, feature extraction, segmentation, segment clustering, and re-labeling. In feature extraction, we compare the group delay feature (GDF) with MFCC feature. Unlike the traditional feature from Fourier trans-form magnitude, GDF uses the phase spectrum. In segmentation, we compare delta Bayesian information criterion (delta-BIC) with support vector machines (SVMs). A delta-BIC is applied to segment the input speech utterance into a sequence of lan-guage-dependent segments using acoustic features. The segments are clustered using the K-means algorithm. Finally, re-labeling is used to determine the language of the clusters. SVMs proceed to segment and identify automatically after model training. Considering the effect of the accent issue, we use the corpus English Across Taiwan (EAT) to perform our system. The experimental results show that the system can reach 78.13% in the frame hit rate under the baseline 57.77%.

目次 Table of Contents
中文摘要 …………………………………………………………………………… i Abstract …………………………………………………………………………… ii 誌謝 …………………………………………………………………………… iii Table of Contents …………………………………………………………………………… iv List of Tables …………………………………………………………………………… vii List of Figures …………………………………………………………………………… viii 1 Introduction ……………………………………………………………………… 1 1.1 Background ………………………………………………………………… 1 1.2 Motivation ………………………………………………………………… 2 1.3 Purposes …………………………………………………………………… 3 1.4 Thesis Organization ……………………………………………………… 3 2 Review …………………………………………………………………………… 5 2.1 Mono-lingual LID ………………………………………………………… 6 2.1.1 Acoustic Features …………………………………………………… 6 2.1.2 Prosody Features …………………………………………………… 7 2.1.3 Phonotactics ………………………………………………………… 9 2.1.4 Acoustic Model ……………………………………………………… 10 2.2 Mixed-language LID ……………………………………………………… 11 2.2.1 Methods for Segmentation ………………………………………… 11 2.2.2 Classifier …………………………………………………………… 13 3 Methods ………………………………………………………………………… 15 3.1 System I …………………………………………………………………… 15 3.1.1 Feature Extraction …………………………………………………… 16 3.1.2 Segmentation ………………………………………………………… 21 3.1.3 Segment Clustering …………………………………………………… 26 3.1.4 Re-label ……………………………………………………………… 27 3.2 System II …………………………………………………………………… 29 3.2.1 Types of SVMs ………………………………………………………33 3.2.2 Kernel Function ………………………………………………………34 3.2.3 Probability Estimates ………………………………………………… 35 3.3 System III ………………………………………………………………… 36 3.3.1 Shifted Delta Cepstrum …………………………………………… 38 4 Experimental Results …………………………………………………………… 40 4.1 System I …………………………………………………………………… 43 4.2 System II …………………………………………………………………… 45 4.3 System III ………………………………………………………………… 46 5 Conclusions and Future work …………………………………………………… 49 Reference …………………………………………………………………………… 51

參考文獻 References
1. A. Waibel, P. Geutner, and L. M. Tomokiyo, “Multilinguality in speech and spoken language systems,” in Proc. IEEE, vol. 88, pp. 1297 - 1313, 2000. 2. Y. Muthusamy, E. Barnard, and R. Cole, “Reviewing automatic language identi-fication,” Signal Processing Magazine, IEEE, vol. 11, pp. 33 - 41, Oct. 1994. 3. P. Yip and R. K. R., Discrete Cosine Transform: Algorithms, Advantages and Ap-plications. Norwell, MA: Academic, 1997. 4. P. Mermelstein, “Distance measures for speech recognition, psychological and instrumental,” in Pattern Recognition and Artificial Intelligence, pp. 374 - 388, 1976. 5. B. Gold and N. Morgan, Speech and Audio Signal Processing. John Wiley & Sons, Inc., 2000. 6. H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” in Journal of Acoustical Society of America, vol. 87, pp. 1738 - 1752, 1990. 7. L. Ferrer, H. Bratt, V. R. R. Gadde, S. Kajarekar, E. Shriberg, K. S. Andreas, and S. A. Venkataraman, “Modeling duration patterns for speaker recognition,” Eu-rospeech, pp. 2017 - 2020, 2003. 8. R. Tong, B. Ma, D. Zhu, H. Li, and E. S. Chng, “Integrating acoustic, prosodic and phonotactic features for spoken language identification,” in Acoustics, 51 Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE In-ternational Conference on, vol. 1, pp. I-205 - I-208, 14-19 May 2006. 9. D. Reynolds, W. Campbell, T. Gleason, C. Quillen, D. Sturim, P. Torres-Carrasquillo, and A. Adami, “The 2004 MIT Lincoln laboratory speaker recognition system,” in ICASSP'05, vol. 1, pp. 177 - 180, March 2005. 10. P. Boersma and D. Weenink, “Praat: doing phonetics by computer,” http://www.praat.org. 11. C.-Y. Lin and H.-C. Wang, Language identification using pitch contour in- for-mation in the ergodic markov model," in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Con- ference on, vol. 1, pp. I-193 – I-196, 14-19 May 2006. 12. M. Rizvi, B. Akram, M. Anwar, M. Baig, and M. Sheikh, “Language identifica-tion from raw speech,” in Students Conference, ISCON '02. Proceedings. IEEE, vol. 1, pp. 27 – 33 vol.1, 16-17 Aug. 2002. 13. M. Zissman, “Comparison of four approaches to automatic language identification of telephone speech," Speech and Audio Processing, IEEE Transactions on, vol. 4, p. 31, Jan 1996. 14. M. A. Zissman and E. Singer, “Automatic language identification of telephone speech message using phoneme recognition and n-gram modeling,” ICASSP'94, vol. 1, pp. 305 - 308, Apr. 1994. 15. B. Ma and H. Li, “A phonotactic-semantic paradigm for automatic spoken docu-ment classification,” SIGIR2005, pp. 369 - 376, Aug. 2005. 16. G. McLachlan and T. Krishnan, The EM algorithm and extensions. John Wiley & Sons, 1988. 17. P. A. Torres-Carrasquillo, E. Singer, M. A. Kohler, R. J. Greene, D. A. Reynolds, and J. R. J. Deller, “Approaches to language identification using Gaussian mixture models and shifted delta cepstral features," ICSLP, pp. 89 - 92, Sep 2002. 18. S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change de-tection and clustering via the Bayesian information criterion," in DARPA Speech Recognition Workshop, 1998. 19. H. Akaike, “A new look at the statistical model identification," in Automatic Con-trol, IEEE Transactions on, vol. 19, pp. 716 - 723, 1974. 20. J. Rissanen, “Modeling by shortest data description,” in Automatica, vol. 14, pp. 465 - 471, 1978. 21. B. S. Everitt, The Cambridge Dictionary of Statistics. Cambridge University Press, 1 ed., October 1998. [22] P. D. GrÄunwald, The Minimum Description Length Principle. 2007. [23] V. Vapnik, The nature of statistical learning theory. Berlin: Springer-Verlag,1995. [24] H. Abdi, A neural network primer," in Journal of Biological Systems, vol. 2,1994. [25] B. V. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classi‾cation Techniques. Ieee Computer Society, 1991. [26] A. Oppenheim and R. Schafer, Discrete-Time Signal Processing. Upper Sad-dle River, NJ: Prentice-Hall, 2000. [27] R. M. Hegde, A. Murthy, Hema, and V. R. R. Gadde, Signi‾cance of the modi‾ed group delay feature in speech recognition," Audio, Speech and Lan-guage Processing, IEEE Transactions on, vol. 15, pp. 190{202, Jan. 2007. [28] J. A. Hartigan, Clustering Algorithms. Wiley, 1975. [29] R. Fletcher, Optimization in Practice. John Wiley, 1987. [30] R. E. Fan, P. H. Chen, and C. J. Lin, Working set selection using second order information for training support vector machines," in The Journal of Machine Learning Research, vol. 6, pp. 1889 { 1918, December 2005. [31] B. E. Boser, I. Guyon, and V. Vapnik, A training algorithm for optimal margin classi‾ers," in Computational Learning Theory, pp. 144{152, 1992. [32] B. SchÄolkopf, A. Smola, R. C. Williamson, and P. L. Bartlett, New support vector algorithms," in Neural Computation, vol. 12, pp. 1207{1245, 2000. [33] C. J. C. Burges, A tutorial on support vector machines for pattern recogni-tion," in Data Mining and Knowledge Discovery, pp. 121{167, 1998. [34] B. SchÄolkopf, K. Sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik, Comparing support vector machines with gaussian kernels to radial basis function classi‾ers," in Signal Processing, vol. 45, pp. 2758{2765,November 1997. [35] G. Wahba, Support vector machines, reproducing kernel hilbert spaces, and the randomized gacv," in Advances in Kernel Methods: Support Vector Learn-ing (B. SchÄoelkopf, C. J. C. Burges, and A. J. Smola, eds.), pp. 69{87, MIT Press, 1999. [36] V. Vapnik, Statistical learning Theory. Wiley, New York, 1998. [37] T. Hastie and R. Tibshirani, Classi‾cation by pairwise coupling," in Ad-vances in Neural Information Processing Systems, 1998. [38] B. Bielefeld, Language identi‾cation using shifted cepstrum," In 14th Annual Speech Research Symposium, 1994.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.226.96.61 論文開放下載的時間是校外不公開 Your IP address is 18.226.96.61 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS