Responsive image
博碩士論文 etd-0730108-234319 詳細資訊
Title page for etd-0730108-234319
論文名稱
Title
利用SVM 提升RNA 二級結構預測準確度之方法
Accuracy Improvement for RNA Secondary Structure Prediction with SVM
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
60
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2008-07-14
繳交日期
Date of Submission
2008-07-30
關鍵字
Keywords
分類、機器學習、支援向量機、核醣核酸、二級結構
RNA, secondary structure, support vector machine, machine learning, classification
統計
Statistics
本論文已被瀏覽 5659 次,被下載 801
The thesis/dissertation has been browsed 5659 times, has been downloaded 801 times.
中文摘要
核醣核酸是普遍存在於有機體內的重要遺傳物質。和去氧核醣核酸不同的是,其結構為單股長鏈分子。在水溶液和生物體內,分布於鏈上的核甘酸因氫鍵作用力產生鍵結而形成分子內螺旋的二級結構。最簡單的核醣核酸二級結構稱之為巢狀結構。然而有些核醣核酸生成的二級結構較為複雜,我們稱之為偽結結構。現存許多核醣核酸的二級結構預測軟體的預測能力有限並各有所擅場。本研究的主要目標為整合現有之預測工具以提高整體核醣核酸二級結構預測的準確度。我們提出了一個核醣核酸序列分析法作為在選擇預測核醣核酸二級結構工具的預先處理。該分析法主要是藉由支援向量機的分類能力,達到預先選擇適宜該序列之預測工具的目的。本研究中所使用的核醣核酸序列資料由 PseudoBase 以及 RNA SSTRAND 兩個資料庫所得。利用交叉驗證的方法,我們一共測試了723筆真實存在的核醣核酸序列。實驗結果指出我們不僅提高了預測的整體準確率,並且使預測的敏感度和選擇性皆有所提升。
Abstract
Ribonucleic acid (RNA) sometimes occurs in a complex structure called pseudoknots. Prediction of RNA secondary structures has drawn much attention from both biologists and computer scientists. Consequently, many useful tools have been developed for RNA secondary structure prediction, with or without pseudoknots. These tools have their individual strength and weakness. As a result, we propose a hybrid feature extraction method which integrates two prediction tools pknotsRG and NUPACK with a support vector machine (SVM). We first extract some useful features from the target RNA sequence, and then decide its prediction tool preference with SVM classification. Our test data set contains 723 RNA sequences, where 202 pseudoknotted RNA sequences are obtained from PseudoBase, and 521 nested RNA sequences are obtained from RNA SSTRAND. Experimental results show that our method improves not only the overall accuracy but also the sensitivity and the selectivity of the target sequences. Our method serves as a preprocessing process in analyzing RNA sequences before employing the RNA secondary structure prediction tools. The ability to combine the existing methods and make the prediction tools more accurate is our main contribution.
目次 Table of Contents
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2. Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Prediction of RNA Secondary Structure with Dynamic Programming . . . 6
2.1.1 pknotsRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 NUPACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 The Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Soft Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 LIBSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 3. Effective Features for RNA Structure Prediction . . . . . . . . . . 15
3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 The Compositional Factor . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 The Bi-transitional Factor . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 The Distributional Factor . . . . . . . . . . . . . . . . . . . . . . 17
3.1.4 The Tri-transitional Factor . . . . . . . . . . . . . . . . . . . . . 18
3.1.5 The Potential Base-pairing Factor . . . . . . . . . . . . . . . . . 18
3.1.6 The Nucleotide Proportional Factor . . . . . . . . . . . . . . . . 18
3.1.7 The Potential Single-stranded Factor . . . . . . . . . . . . . . . . 19
3.1.8 The Sequence Specific Score . . . . . . . . . . . . . . . . . . . . 19
Page
3.1.9 The Segmental Factor . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.10 An Example of Feature Extraction . . . . . . . . . . . . . . . . . 21
3.2 Our Method with SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 The Source of Our Data and the Evaluation Criteria . . . . . . . . . . . . 28
4.2 The Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Parameter Searching . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Self-Consistency Test . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.3 The Jackknife Test . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
參考文獻 References
[1] T. Akutsu, “Dynamic programming algorithms for RNA secondary structure prediction
with pseudoknots,” Discrete Applied Mathematics, Vol. 104, pp. 45–62, 2000.
[2] M. Andronescu, V. Bereg, H. Hoos, and A. Condon, “RNA SSTRAND,” 2004.
[3] P. Baldi, S. Brunak, Y. Chauvin, C. Andersen, and H. Nielsen, “Assessing the accuracy
of prediction algorithms for classification: An overview,” Bioinformatics,
Vol. 16, No. 5, pp. 412–424, May 2000.
[4] A. Ben-Hur, D. Horn, H.T.Siegelmann, and V. Vapnik, “Support vector clustering,”
Machine Learning, Vol. 2, pp. 125–137, 2001.
[5] M. Brown, RNA Pseudoknot Modeling Using Intersections of Stochastic Context
Free Grammars with Applications to Database Search. Computer and Information
Sciences, University of California, Santa Cruz 95064, USA, 1995.
[6] C. J. Burges, “A tutorial on support vector machines for pattern recognition,” Data
Mining and Knowledge Discovery, Vol. 2, pp. 121–167, 1998.
[7] L. Cai, R. L. Malmberg, and Y. Wu, “Stochastic modeling of RNA pseudoknotted
structures: a grammatical approach,” Bioinformatics, Vol. 19, pp. i66–i73, 2003.
[8] C. C. Chang and C. J. Lin, LIBSVM: A library for support vector machines. National
Taiwan University, No. 1, Roosevelt Rd. Sec. 4, Taipei, Taiwan 106, ROC, 2001.
Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
[9] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, Vol. 20,
No. 3, pp. 273–297, 1995.
[10] R. M. Dirks and N. A. Pierce, “An algorithm for computing nucleic
acid base-pairing probabilities including pseudoknots.” Wiley InterScience
(www.interscience.wiley.com),Wiley Periodicals, Inc., 2004.
[11] S. R. Eddy and R. Durbin, “RNA sequence analysis using covariance models,” Nucleic
Acids Research, Vol. 22, No. 11, pp. 2079–2088, 1994.
41
[12] I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, and P. Schuster,
“Fast folding and comparison of RNA secondary structures (The Vinna RNA
package,” Monatshefte f ‥ ur Chemie (Chemical Monthly), Vol. 125, pp. 167–188,
1994.
[13] C. W. Hsu, C. C. Chang, and C. J. Lin, “A practical guide to support vector classification.”
http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf, 2004.
[14] C.-H. Huang, C. L. Lu, and H.-T. Chiu, “A heuristic approach for detecting RNA
H-type pseudoknots,” Bioinformatics, Vol. 21, No. 17, pp. 3501–3508, 2005.
[15] B. Knudsen and J. Hein, “RNA secondary structure prediction using stochastic
context-free grammars and evolutionary history,” Bioinformatics, Vol. 15, pp. 446–
454, 1999.
[16] J. S. Lin, “An effective feature selection for protein fold recognition,” master’s thesis,
Department of Computer Science and Engineering, National Sun Yat-sen University,
Taiwan, No. 70, Lienhai Rd., Kaohsiung 80424, Taiwan, R.O.C, Oct. 2007.
[17] R. B. Lyngsf and C. N. S. Pedersen, “Pseudoknots in RNA secondary structures,”
Research in Computational Molecular Biology, pp. 201–209, 2000.
[18] R. B. Lyngsf,M. Zuker, and C. N. S. Pedersen, “Fast evaluation of internal loops in
RNA secondary structure prediction,” Bioinformatics, Vol. 15, No. 6, pp. 440–445,
1999.
[19] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval.
Cambridge University Press., 2008.
[20] D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner, “Expanded sequence dependence
of thermodynamic parameters improves prediction of RNA secondary structure,”
Journal of Molecular Biology, Vol. 288, pp. 911–940, 1999.
[21] H. Matsui, K. Sato, and Y. Sakakibara, “Pair stochastic tree adjoining grammars
for aligning and predicting pseudoknot RNA structures,” Bioinformatics, Vol. 21,
No. 11, pp. 2611–2617, 2005.
[22] J. S. McCaskill, “The equilibrium partition function and base pair binding probabilities
for RNA secondary structure,” Biopolymers, Vol. 29, pp. 1105–1119, 1990.
[23] A. W. Moore, “Cross-validation for detecting and preventing overfitting,” 2008.
42
[24] P. Mundra, M. Kumar, K. K. Kumar, V. K. Jayaraman, and B. D. Kulkarni, “Using
pseudo amino acid composition to predict protein subnuclear localization: Approached
with PSSM,” Pattern Reconition Letters, Vol. 28, pp. 1610–1615, 2007.
[25] G. Rao, “Remembering the meanings of sensitivity, specificity, and predictive values.
- Language of evidence: defining the terms of evidence-based medicine - Predictive
value of tests medical.” Journal of Family Practice, MD, 3518 Fifth Avenue,
Pittsburgh, PA 15261, Jan. 2004.
[26] J. Reeder, Algorithms for RNA Secondary Structure Analysis: Prediction of Pseudoknots
and the Consensus Shapes Approach. Phd thesis, der Technischen Fakult ‥ at,
der Universit ‥ at Bielefeld, 12 2007.
[27] J. Reeder and R. Giegerich, “Design, implementation and evaluation of a practical
pseudoknot folding algorithm based on thermodynamics,” BMC Bioinformatics,
Vol. 5, pp. 104–116, 2004.
[28] J. Reeder, P. Steffen, and R. Giegerich, “pknotsRG: RNA pseudoknot folding including
near-optimal structures and sliding windows,” Nucleic Acids Research, Vol. 35,
pp. 1–5, 2007.
[29] E. Rivas and S. R. Eddy, “A dynamic programming algorithmfor RNA structure prediction
including pseudoknots,” Journal of Molecular Biology, Vol. 285, pp. 2053–
2068, 1999.
[30] E. Rivas and S. R. Eddy, “The language of RNA: A formal grammar that includes
pseudoknots,” Bioinformatics, Vol. 16, No. 4, pp. 334–340, 2000.
[31] Y. Sakakibara, “Pair hidden Markov models on tree strucutres,” Bioinformatics,
Vol. 19, pp. i232–i240, 2003.
[32] Y. Sakakibara, M. Brown, R. Hughey, and I. S. Mian, “Recent methods for RNA
modeling using stochastic context-free grammars,” Proceedings of the Asilomar
Conference on Combinatorial Pattern Matching, Asilomar, California, USA, 1994.
[33] Y. Sakakibara,M. Brown, R. C. Underwood, I. S.Mian, and D. Haussler, “Stochastic
context-free grammars for modeling RNA,” Proceedings of the Twenty-Seventh Annual
Hawaii International Conference on System Sciences, Hawaii, USA, pp. 284–
293, 1994.
[34] F. Tahi, “A fast algorithm for RNA secondary structure prediction including pseudoknots,”
Proceedings of the Third IEEE Symposium on BioInformatics and BioEngineering,
Bethesda, Maryland, USA, pp. 11–17, 2003.
43
[35] F. Tahi, M. Gouy, and M. R′ egnier, “Automatic RNA secondary structure prediction
with a comparative approach,” Computers and Chemistry, Vol. 26, pp. 521–530,
2002.
[36] F. H. D. van Batenburg, A. P. Gultyaev, C. W. A. Pleij, J. Ng, and J. Oliehoek,
“Pseudobase: a database with RNA pseudoknots,” Nucleic Acids Research, Vol. 28,
No. 1, pp. 201–204, 2000.
[37] V. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
[38] X. Yu, J. Cao, Y. Cai, T. Shi, and Y. Li, “Predicting rRNA-, RNA-, and DNAbinding
proteins from primary structure with support vector machines,” Journal of
Theoretical Biology, Vol. 240, pp. 175–184, 2006.
[39] M. Zuker, “Mfold web server for nucleic acid folding and hybridization prediction,”
Nucleic Acids Research, Vol. 31, No. 13, pp. 3406–3415, 2003.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內立即公開,校外一年後公開 off campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code