國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,利用蛋白質序列資訊之胺基酸接觸狀態預測 ,Protein Contact Prediction Based on Protein Sequences

論文名稱 Title	利用蛋白質序列資訊之胺基酸接觸狀態預測 Protein Contact Prediction Based on Protein Sequences
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	100 學年度第 1 學期 The fall semester of Academic Year 100	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	57
研究生 Author	林東建 Dong-Jian Lin
指導教授 Advisor	楊昌彪 Chang-Biau Yang
召集委員 Convenor	李宗南 Chung-Nan Lee
口試委員 Advisory Committee	黃國璽, 薛佑玲 Kuo-Si Huang; Yow-Ling Shiue
口試日期 Date of Exam	2011-08-31	繳交日期 Date of Submission	2011-09-06
關鍵字 Keywords	懲罰判別分析、預測、K-最近鄰居法、胺基酸接觸狀態、支持向量機 prediction, Contact, SVM, KNN, PDA
統計 Statistics	本論文已被瀏覽 5682 次，被下載 1095 次 The thesis/dissertation has been browsed 5682 times, has been downloaded 1095 times.

中文摘要
蛋白質的三級結構維繫著其生物功能，而蛋白質折疊支撐著三級結構；胺基酸之間的鍵結影響蛋白質折疊的形成，進而穩定蛋白質結構。因此，蛋白質接觸狀態聯繫著蛋白質的結構組成和生物功\\\\\能分析。在本篇論文，我們提出一個新的方法來預測胺基酸之間的接觸狀態，並且利用預測準確率來評估實驗結果。我們使用三種預測工具：支持向量機、K-最近鄰居法和懲罰判別分析，分別對訓練資料進行自我測試，並取出預測準確度最高的預測工具 (支持向量機) 來執行測試資料的預測；其中訓練資料的蛋白質取自PDB-REPRDB，訓練資料的蛋白質取自前人的研究。實驗結果表明，三種胺基酸接觸狀態的預測分別達到24.84%、15.68%和8.23%的準確度，與隨機預測準確度比較之下 (5.31%、3.33%和1.12%)，有顯著的提升。
Abstract
The biological function of a protein is mainly maintained by its three-dimensional structure. Protein folds support the three-dimensional structure of a protein, and then the inter-residue contacts in the protein impact the formation of protein folds and the stability of its protein structure. Therefore, the protein contact plays a critical role in building protein structures and analyzing biological functions. In this thesis, we propose a methodology to predict the residue-residue contacts of a target protein and develop a new measurement to evaluate the accuracy of prediction. With three prediction tools, the support vector machine (SVM), the k-nearest neighbor algorithm (KNN), and the penalized discriminant analysis (PDA), we compare these classiﬁers based on the self-testing of the training set, which are derived from representative protein chains from PDB (PDB-REPRDB), and apply the best (SVM) to predict a testing set of 173 protein chains derived from previous study. The experimental results show that the accuracy of our prediction achieves 24.84%,15.68%, and 8.23% for three categories of diﬀerent contacts, which greatly improves the result of random exploration (5.31%, 3.33%, and 1.12%, respectively).

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Definition of Contacts . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Position Specific Scoring Matrix . . . . . . . . . . . . . . . . . . . . . 8 2.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 The K-nearest Neighbor Algorithm . . . . . . . . . . . . . . . . . . . 13 2.5 Penalized Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . 15 2.6 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 3. Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Contact Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 The Sequence separation . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 The Amino Acid Composition . . . . . . . . . . . . . . . . . . 20 3.2.3 The Position Specific Scoring Matrix . . . . . . . . . . . . . . 21 3.2.4 Normalization of Features . . . . . . . . . . . . . . . . . . . . 23 3.3 Evaluation of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Prediction of Contacts . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 The Number of Ranked Predicted Contacts . . . . . . . . . . . . . . . 27 Chapter 4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 The Training Dataset . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.2 The Testing Dataset . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

參考文獻 References
[1] D. A. adn S. Merler, G. Jurman, R. Visintainer, S. Riccadonna, S. Paoli, and C. Furlanello, “Machine Learning Py - A High-Performance Python/!NumPy Based Package for Machine Learning,” 2008. Software available at https://mlpy.fbk.eu/. [2] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, Vol. 25, No. 17, p. 33893402, 1997. [3] P. Bjorkholm, P. Daniluk, A. Kryshtafovych, K. Fidelis, R. Andersson, and T. R. Hvidsten, “Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residueresidue contacts,” Bioinformatics, Vol. 25, No. 10, pp. 1264–1270, 2009. [4] M. K. Campbell and S. O. Farrell, Biochemistry. Thomson-Brooks/Cole, fourth ed., 2003. [5] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. [6] J. Cheng and P. Baldi, “Improved residue contact prediction using support vector machines and a large feature set,” BMC Bioinformatics, Vol. 8(1), pp. 113–121, 2007. [7] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, Vol. 20, No. 3, pp. 273–297, 1995. [8] P. Fariselli and R. Casadio, “A neural network based predictor of residue contacts in proteins,” Protein Engineering, Design and Selection, Vol. 12(1), pp. 15–21, 1999. [9] P. Fariselli and R. Casadio, “Prediction of contact maps with neural networks and correlated mutation,” Protein Engineering, Design and Selection, Vol. 14(11), pp. 835–843, 2001. [10] G. Faure, A. Bornot, and A. G. de Brevern, “Protein contacts, inter-residue interactions and side-chain modelling,” Biochimie, Vol. 90(4), pp. 626–639, 2008. [11] U. Gobel, C. Sander, R. Schneider, and A. Valencia, “Correlated Mutations and Residue Contacts in Proteins,” PROTEINS: Structure, Function, and Genetics, Vol. 18(4), pp. 309–317, 1994. [12] I. Halperin, H.Wolfson, and R. Nussinov, “Correlated mutations: advances and limitations. A study on fusion proteins and on the Cohesin-Dockerin families,” PROTEINS: Structure, Function, and Genetics, Vol. 63(4), pp. 832–845, 2006. [13] N. Hamilton, K. Burrage, M. A. Ragan, and T. Huber, “Protein contact prediction using patterns of correlation,” PROTEINS: Structure, Function, and Genetics, Vol. 56(4), pp. 679–684, 2004. [14] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. [15] K. M. Misura, D. Chivian, C. A. Rohl, D. E. Kim, and D. Baker, “Physically realistic homology models built with ROSETTA can be more accurate than their templates,” Proceedings of the National Academy of Sciences of the United States of America, Vol. 103(14), pp. 5361–5366, 2006. [16] J. B. Mitchell, J. M. Thornton, J. Singh, and S. L. Price, “Towards an understanding of the arginine-aspartate interaction,” Journal of Molecular Biology, Vol. 226(1), pp. 251–262, 1992. [17] T. Noguchi and Y. Akiyama, “PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB) in 2003,” Nucleic Acids Research, Vol. 31, No. 1, pp. 492–493, 2003. [18] O. Olmea and A. Valencia, “Improving contact predictions by the combination of correlated mutations and other sources of sequence information,” Folding and Design, Vol. 2(3), pp. S25–S32, 1997. [19] Y. Shao and C. Bystroff, “Predicting interresidue contacts using templates and pathways,” PROTEINS: Structure, Function, and Genetics, Vol. 53, pp. 497–502, 2003. [20] J. Skolnick, D. Kihara, and Y. Zhang, “Development and large scale benchmark testing of the PROSPECTOR 3 threading algorithm,” PROTEINS: Structure, Function, and Genetics, Vol. 56, pp. 502–518, 2004. [21] A. Vullo, I.Walsh, and G. Pollastri, “A two-stage approach for improved prediction of residue contact maps,” BMC Bioinformatics, Vol. 7, No. 1, pp. 180–191, 2006. [22] S. Wu and Y. Zhang, “LOMETS: a local meta-threading-server for protein structure prediction,” Nucleic Acids Research, Vol. 35(10), pp. 3375–3382, 2007. [23] S. Wu and Y. Zhang, “A comprehensive assessment of sequence-based and template-based methods for protein contact prediction,” Structural Bioinformatics, Vol. 24, No. 7, pp. 924–931, 2008.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0906111-150445.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS