國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,利用SVM 提升RNA 二級結構預測準確度之方法,Accuracy Improvement for RNA Secondary Structure Prediction with SVM

論文名稱 Title	利用SVM 提升RNA 二級結構預測準確度之方法 Accuracy Improvement for RNA Secondary Structure Prediction with SVM
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	96 學年度第 2 學期 The spring semester of Academic Year 96	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	60
研究生 Author	張嘉宏 Chia-Hung Chang
指導教授 Advisor	楊昌彪 Chang-Biau Yang
召集委員 Convenor	王有禮 Yue-Li Wang
口試委員 Advisory Committee	陳世中, 楊佳寧, 薛佑玲 Shih-Chung Chen; Chia-Ning Yang; Yow-Ling Shiue
口試日期 Date of Exam	2008-07-14	繳交日期 Date of Submission	2008-07-30
關鍵字 Keywords	分類、機器學習、支援向量機、核醣核酸、二級結構 RNA, secondary structure, support vector machine, machine learning, classification
統計 Statistics	本論文已被瀏覽 5659 次，被下載 801 次 The thesis/dissertation has been browsed 5659 times, has been downloaded 801 times.

中文摘要
核醣核酸是普遍存在於有機體內的重要遺傳物質。和去氧核醣核酸不同的是，其結構為單股長鏈分子。在水溶液和生物體內，分布於鏈上的核甘酸因氫鍵作用力產生鍵結而形成分子內螺旋的二級結構。最簡單的核醣核酸二級結構稱之為巢狀結構。然而有些核醣核酸生成的二級結構較為複雜，我們稱之為偽結結構。現存許多核醣核酸的二級結構預測軟體的預測能力有限並各有所擅場。本研究的主要目標為整合現有之預測工具以提高整體核醣核酸二級結構預測的準確度。我們提出了一個核醣核酸序列分析法作為在選擇預測核醣核酸二級結構工具的預先處理。該分析法主要是藉由支援向量機的分類能力，達到預先選擇適宜該序列之預測工具的目的。本研究中所使用的核醣核酸序列資料由 PseudoBase 以及 RNA SSTRAND 兩個資料庫所得。利用交叉驗證的方法，我們一共測試了723筆真實存在的核醣核酸序列。實驗結果指出我們不僅提高了預測的整體準確率，並且使預測的敏感度和選擇性皆有所提升。
Abstract
Ribonucleic acid (RNA) sometimes occurs in a complex structure called pseudoknots. Prediction of RNA secondary structures has drawn much attention from both biologists and computer scientists. Consequently, many useful tools have been developed for RNA secondary structure prediction, with or without pseudoknots. These tools have their individual strength and weakness. As a result, we propose a hybrid feature extraction method which integrates two prediction tools pknotsRG and NUPACK with a support vector machine (SVM). We first extract some useful features from the target RNA sequence, and then decide its prediction tool preference with SVM classification. Our test data set contains 723 RNA sequences, where 202 pseudoknotted RNA sequences are obtained from PseudoBase, and 521 nested RNA sequences are obtained from RNA SSTRAND. Experimental results show that our method improves not only the overall accuracy but also the sensitivity and the selectivity of the target sequences. Our method serves as a preprocessing process in analyzing RNA sequences before employing the RNA secondary structure prediction tools. The ability to combine the existing methods and make the prediction tools more accurate is our main contribution.

目次 Table of Contents
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2. Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Prediction of RNA Secondary Structure with Dynamic Programming . . . 6 2.1.1 pknotsRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 NUPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 The Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Soft Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 LIBSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 3. Effective Features for RNA Structure Prediction . . . . . . . . . . 15 3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 The Compositional Factor . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 The Bi-transitional Factor . . . . . . . . . . . . . . . . . . . . . 17 3.1.3 The Distributional Factor . . . . . . . . . . . . . . . . . . . . . . 17 3.1.4 The Tri-transitional Factor . . . . . . . . . . . . . . . . . . . . . 18 3.1.5 The Potential Base-pairing Factor . . . . . . . . . . . . . . . . . 18 3.1.6 The Nucleotide Proportional Factor . . . . . . . . . . . . . . . . 18 3.1.7 The Potential Single-stranded Factor . . . . . . . . . . . . . . . . 19 3.1.8 The Sequence Specific Score . . . . . . . . . . . . . . . . . . . . 19 Page 3.1.9 The Segmental Factor . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.10 An Example of Feature Extraction . . . . . . . . . . . . . . . . . 21 3.2 Our Method with SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 The Source of Our Data and the Evaluation Criteria . . . . . . . . . . . . 28 4.2 The Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 Parameter Searching . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.2 Self-Consistency Test . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2.3 The Jackknife Test . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

參考文獻 References
[1] T. Akutsu, “Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots,” Discrete Applied Mathematics, Vol. 104, pp. 45–62, 2000. [2] M. Andronescu, V. Bereg, H. Hoos, and A. Condon, “RNA SSTRAND,” 2004. [3] P. Baldi, S. Brunak, Y. Chauvin, C. Andersen, and H. Nielsen, “Assessing the accuracy of prediction algorithms for classification: An overview,” Bioinformatics, Vol. 16, No. 5, pp. 412–424, May 2000. [4] A. Ben-Hur, D. Horn, H.T.Siegelmann, and V. Vapnik, “Support vector clustering,” Machine Learning, Vol. 2, pp. 125–137, 2001. [5] M. Brown, RNA Pseudoknot Modeling Using Intersections of Stochastic Context Free Grammars with Applications to Database Search. Computer and Information Sciences, University of California, Santa Cruz 95064, USA, 1995. [6] C. J. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, Vol. 2, pp. 121–167, 1998. [7] L. Cai, R. L. Malmberg, and Y. Wu, “Stochastic modeling of RNA pseudoknotted structures: a grammatical approach,” Bioinformatics, Vol. 19, pp. i66–i73, 2003. [8] C. C. Chang and C. J. Lin, LIBSVM: A library for support vector machines. National Taiwan University, No. 1, Roosevelt Rd. Sec. 4, Taipei, Taiwan 106, ROC, 2001. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. [9] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, Vol. 20, No. 3, pp. 273–297, 1995. [10] R. M. Dirks and N. A. Pierce, “An algorithm for computing nucleic acid base-pairing probabilities including pseudoknots.” Wiley InterScience (www.interscience.wiley.com),Wiley Periodicals, Inc., 2004. [11] S. R. Eddy and R. Durbin, “RNA sequence analysis using covariance models,” Nucleic Acids Research, Vol. 22, No. 11, pp. 2079–2088, 1994. 41 [12] I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, and P. Schuster, “Fast folding and comparison of RNA secondary structures (The Vinna RNA package,” Monatshefte f ‥ ur Chemie (Chemical Monthly), Vol. 125, pp. 167–188, 1994. [13] C. W. Hsu, C. C. Chang, and C. J. Lin, “A practical guide to support vector classification.” http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf, 2004. [14] C.-H. Huang, C. L. Lu, and H.-T. Chiu, “A heuristic approach for detecting RNA H-type pseudoknots,” Bioinformatics, Vol. 21, No. 17, pp. 3501–3508, 2005. [15] B. Knudsen and J. Hein, “RNA secondary structure prediction using stochastic context-free grammars and evolutionary history,” Bioinformatics, Vol. 15, pp. 446– 454, 1999. [16] J. S. Lin, “An effective feature selection for protein fold recognition,” master’s thesis, Department of Computer Science and Engineering, National Sun Yat-sen University, Taiwan, No. 70, Lienhai Rd., Kaohsiung 80424, Taiwan, R.O.C, Oct. 2007. [17] R. B. Lyngsf and C. N. S. Pedersen, “Pseudoknots in RNA secondary structures,” Research in Computational Molecular Biology, pp. 201–209, 2000. [18] R. B. Lyngsf,M. Zuker, and C. N. S. Pedersen, “Fast evaluation of internal loops in RNA secondary structure prediction,” Bioinformatics, Vol. 15, No. 6, pp. 440–445, 1999. [19] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge University Press., 2008. [20] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner, “Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure,” Journal of Molecular Biology, Vol. 288, pp. 911–940, 1999. [21] H. Matsui, K. Sato, and Y. Sakakibara, “Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures,” Bioinformatics, Vol. 21, No. 11, pp. 2611–2617, 2005. [22] J. S. McCaskill, “The equilibrium partition function and base pair binding probabilities for RNA secondary structure,” Biopolymers, Vol. 29, pp. 1105–1119, 1990. [23] A. W. Moore, “Cross-validation for detecting and preventing overfitting,” 2008. 42 [24] P. Mundra, M. Kumar, K. K. Kumar, V. K. Jayaraman, and B. D. Kulkarni, “Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM,” Pattern Reconition Letters, Vol. 28, pp. 1610–1615, 2007. [25] G. Rao, “Remembering the meanings of sensitivity, specificity, and predictive values. - Language of evidence: defining the terms of evidence-based medicine - Predictive value of tests medical.” Journal of Family Practice, MD, 3518 Fifth Avenue, Pittsburgh, PA 15261, Jan. 2004. [26] J. Reeder, Algorithms for RNA Secondary Structure Analysis: Prediction of Pseudoknots and the Consensus Shapes Approach. Phd thesis, der Technischen Fakult ‥ at, der Universit ‥ at Bielefeld, 12 2007. [27] J. Reeder and R. Giegerich, “Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics,” BMC Bioinformatics, Vol. 5, pp. 104–116, 2004. [28] J. Reeder, P. Steffen, and R. Giegerich, “pknotsRG: RNA pseudoknot folding including near-optimal structures and sliding windows,” Nucleic Acids Research, Vol. 35, pp. 1–5, 2007. [29] E. Rivas and S. R. Eddy, “A dynamic programming algorithmfor RNA structure prediction including pseudoknots,” Journal of Molecular Biology, Vol. 285, pp. 2053– 2068, 1999. [30] E. Rivas and S. R. Eddy, “The language of RNA: A formal grammar that includes pseudoknots,” Bioinformatics, Vol. 16, No. 4, pp. 334–340, 2000. [31] Y. Sakakibara, “Pair hidden Markov models on tree strucutres,” Bioinformatics, Vol. 19, pp. i232–i240, 2003. [32] Y. Sakakibara, M. Brown, R. Hughey, and I. S. Mian, “Recent methods for RNA modeling using stochastic context-free grammars,” Proceedings of the Asilomar Conference on Combinatorial Pattern Matching, Asilomar, California, USA, 1994. [33] Y. Sakakibara,M. Brown, R. C. Underwood, I. S.Mian, and D. Haussler, “Stochastic context-free grammars for modeling RNA,” Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences, Hawaii, USA, pp. 284– 293, 1994. [34] F. Tahi, “A fast algorithm for RNA secondary structure prediction including pseudoknots,” Proceedings of the Third IEEE Symposium on BioInformatics and BioEngineering, Bethesda, Maryland, USA, pp. 11–17, 2003. 43 [35] F. Tahi, M. Gouy, and M. R′ egnier, “Automatic RNA secondary structure prediction with a comparative approach,” Computers and Chemistry, Vol. 26, pp. 521–530, 2002. [36] F. H. D. van Batenburg, A. P. Gultyaev, C. W. A. Pleij, J. Ng, and J. Oliehoek, “Pseudobase: a database with RNA pseudoknots,” Nucleic Acids Research, Vol. 28, No. 1, pp. 201–204, 2000. [37] V. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, 1995. [38] X. Yu, J. Cao, Y. Cai, T. Shi, and Y. Li, “Predicting rRNA-, RNA-, and DNAbinding proteins from primary structure with support vector machines,” Journal of Theoretical Biology, Vol. 240, pp. 175–184, 2006. [39] M. Zuker, “Mfold web server for nucleic acid folding and hybridization prediction,” Nucleic Acids Research, Vol. 31, No. 13, pp. 3406–3415, 2003.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內立即公開，校外一年後公開 off campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0730108-234319.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS