Responsive image
博碩士論文 etd-0906111-150445 詳細資訊
Title page for etd-0906111-150445
論文名稱
Title
利用蛋白質序列資訊之胺基酸接觸狀態預測
Protein Contact Prediction Based on Protein Sequences
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
57
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2011-08-31
繳交日期
Date of Submission
2011-09-06
關鍵字
Keywords
懲罰判別分析、預測、K-最近鄰居法、胺基酸接觸狀態、支持向量機
prediction, Contact, SVM, KNN, PDA
統計
Statistics
本論文已被瀏覽 5682 次,被下載 1095
The thesis/dissertation has been browsed 5682 times, has been downloaded 1095 times.
中文摘要
蛋白質的三級結構維繫著其生物功能,而蛋白質折疊支撐著三級結構;胺基酸之間的鍵結影響蛋白質折疊的形成,進而穩定蛋白質結構。因此,蛋白質接觸狀態聯繫著蛋白質的結構組成和生物功\\\\\能分析。在本篇論文,我們提出一個新的方法來預測胺基酸之間的接觸狀態,並且利用預測準確率來評估實驗結果。我們使用三種預測工具:支持向量機、K-最近鄰居法和懲罰判別分析,分別對訓練資料進行自我測試,並取出預測準確度最高的預測工具 (支持向量機) 來執行測試資料的預測;其中訓練資料的蛋白質取自PDB-REPRDB,訓練資料的蛋白質取自前人的研究。實驗結果表明,三種胺基酸接觸狀態的預測分別達到24.84%、15.68%和8.23%的準確度,與隨機預測準確度比較之下 (5.31%、3.33%和1.12%),有顯著的提升。
Abstract
The biological function of a protein is mainly maintained by its three-dimensional structure. Protein folds support the three-dimensional structure of a protein, and then the inter-residue contacts in the protein impact the formation of protein folds and the stability of its protein structure. Therefore, the protein contact plays a critical role in building protein structures and analyzing biological functions. In this thesis, we propose a methodology to predict the residue-residue contacts of a target protein and develop a new measurement to evaluate the accuracy of prediction. With three prediction tools, the support vector machine (SVM), the k-nearest neighbor algorithm (KNN), and the penalized discriminant analysis (PDA), we compare these classifiers based on the self-testing of the training set, which are derived from representative protein chains from PDB (PDB-REPRDB), and apply the best (SVM) to predict a testing set of 173 protein chains derived from previous study. The experimental results show that the accuracy of our prediction achieves 24.84%,15.68%, and 8.23% for three categories of different contacts, which greatly improves the result of random exploration (5.31%, 3.33%, and 1.12%, respectively).
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Definition of Contacts . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Position Specific Scoring Matrix . . . . . . . . . . . . . . . . . . . . . 8
2.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 The K-nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . 13
2.5 Penalized Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . 15
2.6 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 3. Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Contact Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 The Sequence separation . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 The Amino Acid Composition . . . . . . . . . . . . . . . . . . 20
3.2.3 The Position Specific Scoring Matrix . . . . . . . . . . . . . . 21
3.2.4 Normalization of Features . . . . . . . . . . . . . . . . . . . . 23
3.3 Evaluation of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Prediction of Contacts . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 The Number of Ranked Predicted Contacts . . . . . . . . . . . . . . . 27
Chapter 4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 The Training Dataset . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.2 The Testing Dataset . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
參考文獻 References
[1] D. A. adn S. Merler, G. Jurman, R. Visintainer, S. Riccadonna, S. Paoli, and C. Furlanello, “Machine Learning Py - A High-Performance Python/!NumPy Based Package for Machine Learning,” 2008. Software available at https://mlpy.fbk.eu/.
[2] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, Vol. 25, No. 17, p. 33893402, 1997.
[3] P. Bjorkholm, P. Daniluk, A. Kryshtafovych, K. Fidelis, R. Andersson, and T. R. Hvidsten, “Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residueresidue contacts,” Bioinformatics, Vol. 25, No. 10, pp. 1264–1270, 2009.
[4] M. K. Campbell and S. O. Farrell, Biochemistry. Thomson-Brooks/Cole, fourth ed., 2003.
[5] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.
[6] J. Cheng and P. Baldi, “Improved residue contact prediction using support vector machines and a large feature set,” BMC Bioinformatics, Vol. 8(1), pp. 113–121, 2007.
[7] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, Vol. 20, No. 3, pp. 273–297, 1995.
[8] P. Fariselli and R. Casadio, “A neural network based predictor of residue contacts in proteins,” Protein Engineering, Design and Selection, Vol. 12(1), pp. 15–21, 1999.
[9] P. Fariselli and R. Casadio, “Prediction of contact maps with neural networks and correlated mutation,” Protein Engineering, Design and Selection, Vol. 14(11), pp. 835–843, 2001.
[10] G. Faure, A. Bornot, and A. G. de Brevern, “Protein contacts, inter-residue interactions and side-chain modelling,” Biochimie, Vol. 90(4), pp. 626–639, 2008.
[11] U. Gobel, C. Sander, R. Schneider, and A. Valencia, “Correlated Mutations and Residue Contacts in Proteins,” PROTEINS: Structure, Function, and Genetics, Vol. 18(4), pp. 309–317, 1994.
[12] I. Halperin, H.Wolfson, and R. Nussinov, “Correlated mutations: advances and limitations. A study on fusion proteins and on the Cohesin-Dockerin families,” PROTEINS: Structure, Function, and Genetics, Vol. 63(4), pp. 832–845, 2006.
[13] N. Hamilton, K. Burrage, M. A. Ragan, and T. Huber, “Protein contact prediction using patterns of correlation,” PROTEINS: Structure, Function, and Genetics, Vol. 56(4), pp. 679–684, 2004.
[14] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.
[15] K. M. Misura, D. Chivian, C. A. Rohl, D. E. Kim, and D. Baker, “Physically realistic homology models built with ROSETTA can be more accurate than their templates,” Proceedings of the National Academy of Sciences of the United States of America, Vol. 103(14), pp. 5361–5366, 2006.
[16] J. B. Mitchell, J. M. Thornton, J. Singh, and S. L. Price, “Towards an understanding of the arginine-aspartate interaction,” Journal of Molecular Biology, Vol. 226(1), pp. 251–262, 1992.
[17] T. Noguchi and Y. Akiyama, “PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB) in 2003,” Nucleic Acids Research, Vol. 31, No. 1, pp. 492–493, 2003.
[18] O. Olmea and A. Valencia, “Improving contact predictions by the combination of correlated mutations and other sources of sequence information,” Folding and Design, Vol. 2(3), pp. S25–S32, 1997.
[19] Y. Shao and C. Bystroff, “Predicting interresidue contacts using templates and pathways,” PROTEINS: Structure, Function, and Genetics, Vol. 53, pp. 497–502, 2003.
[20] J. Skolnick, D. Kihara, and Y. Zhang, “Development and large scale benchmark testing of the PROSPECTOR 3 threading algorithm,” PROTEINS: Structure, Function, and Genetics, Vol. 56, pp. 502–518, 2004.
[21] A. Vullo, I.Walsh, and G. Pollastri, “A two-stage approach for improved prediction of residue contact maps,” BMC Bioinformatics, Vol. 7, No. 1, pp. 180–191, 2006.
[22] S. Wu and Y. Zhang, “LOMETS: a local meta-threading-server for protein structure prediction,” Nucleic Acids Research, Vol. 35(10), pp. 3375–3382, 2007.
[23] S. Wu and Y. Zhang, “A comprehensive assessment of sequence-based and template-based methods for protein contact prediction,” Structural Bioinformatics, Vol. 24, No. 7, pp. 924–931, 2008.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code