國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以支持向量機為基礎之必要性蛋白質預測,Prediction for the Essential Protein with the Support Vector Machine

論文名稱 Title	以支持向量機為基礎之必要性蛋白質預測 Prediction for the Essential Protein with the Support Vector Machine
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	100 學年度第 1 學期 The fall semester of Academic Year 100	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	67
研究生 Author	楊子杰 Zih-Jie Yang
指導教授 Advisor	楊昌彪 Chang-Biau Yang
召集委員 Convenor	林振盛 Jen-Sen Lin
口試委員 Advisory Committee	彭永興, 薛佑玲 Yung-Hsing Peng; Yow-Ling Shiue
口試日期 Date of Exam	2011-08-31	繳交日期 Date of Submission	2011-09-06
關鍵字 Keywords	生物資訊、必要性蛋白質、蛋白質交互作用、支持向量機、特徵集 bioinformatics, essential protein, protein-protein interaction, support vector machine, feature set
統計 Statistics	本論文已被瀏覽 5738 次，被下載 1713 次 The thesis/dissertation has been browsed 5738 times, has been downloaded 1713 times.

中文摘要
必要性蛋白質對細胞生命影響非常深，但我們很難去偵測必要性蛋白質。蛋白質交互作用為其中一種檢測蛋白質是否為必要性蛋白質的方法。我們注意到很多研究方法從蛋白質交互作用擷取拓樸的特徵去預測必要性蛋白質。然而，蛋白質的功能也是一條線索去決定他的必要性。在本篇論文中，我們利用影響蛋白質功能的序列特徵、拓樸和蛋白質特徵去建立支持向量機的模型來預測必要性蛋白質。在我們的實驗中，我們從DIP資料庫中下載Scere20070107檔案，其中包含了4873條蛋白質和17166個交互作用。在此檔案中必要性蛋白質和非必要性蛋白質的比例相當不平衡為1:4。在不平衡的資料中，我們的模型得到最好的F-measure、MCC、AIC和BIC分別為0.5197、0.4371、0.2428和0.2543。我們另外建立了比例為1:1的平衡資料。在平衡資料中，我們的模型得到最好的F-measure、MCC、AIC和BIC分別為0.7742、0.5484、0.3603和0.3828。我們的研究結果均優於以前的研究方法與結果。
Abstract
Essential proteins affect the cellular life deeply, but it is hard to identify them. Protein-protein interaction is one of the ways to disclose whether a protein is essential or not. We notice that many researchers use the feature set composed of topology properties from protein-protein interaction to predict the essential proteins. However, the functionality of a protein is also a clue to determine its essentiality. In this thesis, to build SVM models for predicting the essential proteins, our feature set contains the sequence properties which can influence the protein function, topology properties and protein properties. In our experiments, we download Scere20070107, which contains 4873 proteins and 17166 interactions, from DIP database. The ratio of essential proteins to nonessential proteins is nearly 1:4, so it is imbalanced. In the imbalanced dataset, the best values of F-measure, MCC, AIC and BIC of our models are 0.5197, 0.4671, 0.2428 and 0.2543, respectively. We build another balanced dataset with ratio 1:1. For balanced dataset, the best values of F-measure, MCC, AIC and BIC of our models are 0.7742, 0.5484, 0.3603 and 0.3828, respectively. Our results are superior to all previous results with various measurements.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Database of Protein and PPI . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Position Specific Scoring Matrix . . . . . . . . . . . . . . . . . . . . . 8 2.4 Topological Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 Bottleneck (BN) . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.3 Edge Percolated Component (EPC) . . . . . . . . . . . . . . . 12 2.4.4 Maximum Neighborhood Component (MNC) . . . . . . . . . . 12 2.4.5 Density of Maximum Neighborhood Component (DMNC) . . 12 2.4.6 Neighbors’ Intra-degree (NID ) . . . . . . . . . . . . . . . . . 14 2.4.7 Clustering Coefficient (CCo) . . . . . . . . . . . . . . . . . . . 14 2.4.8 Betweenness Centrality (BC) . . . . . . . . . . . . . . . . . . 15 2.4.9 Closeness Centrality (CC) . . . . . . . . . . . . . . . . . . . . 15 2.4.10 Clique Level (KL) . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Methods for Essential Protein Prediction . . . . . . . . . . . . . . . . 16 2.5.1 Score Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.2 Prediction by Classifiers . . . . . . . . . . . . . . . . . . . . . 17 Chapter 3. Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Topological Properties . . . . . . . . . . . . . . . . . . . . . . 19 3.1.2 Bit String Implementation of Double Screening Scheme . . . . 22 3.1.3 Protein Properties . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.4 Sequence Properties . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.5 Other Properties . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Our Method with SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . 31 4.1 PPI Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Data Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Experimental Results and Comparison . . . . . . . . . . . . . . . . . 33 Chapter 5. Conclusion and Future Work . . . . . . . . . . . . . . . . . 45 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

參考文獻 References
[1] V. I. Abkevich and E. I. Shakhnovich, “What can disulfide bonds tell us about protein energetics, function and folding: Simulations and bioninformatics anal- ysis,” Journal of Molecular Biology, Vol. 300, pp. 975–985, 2000. [2] M. L. Acencio and N. Lemke, “Towards the prediction of essential genes by integration of network topology, cellular localization and biological process in- formation,” BMC Bioinformatics, Vol. 10, No. 1, pp. 290–307, 2009. [3] R. Albert and A. L. Barabasi, “Statistical mechanics of complex networks,” Reviews of Modern Physics, Vol. 74, pp. 47–97, 2002. [4] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipmanl, “Basic local alignment search tool,” Journal of Molecular Biology, Vol. 215, No. 3, pp. 403–410, 1990. [5] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, Vol. 25, No. 17, pp. 3389–3402, 1997. [6] G. D. Bader and C.W. Hogue, “Analyzing yeast proteinprotein interaction data obtained from different sources,” Nature Biotechnology, Vol. 20, pp. 991–997, 2002. [7] A. L. Barabasi and Z. N. Oltvai, “Network biology: understanding the cell’s functional organization,” Nature Reviews Genetics, Vol. 5, No. 2, pp. 101–113, 2004. [8] D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, N. Zhang, G. Li1, and R. Chen, “Topological structure analysis of the protein- protein interaction network in budding yeast,” Nucleic Acids Research, Vol. 31, pp. 121–167, 2003. [9] C. J. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, Vol. 2, pp. 121–167, 1998. [10] K. M. Cadigan, U. Grossniklaus, and W. J. Gehring, “Functional redundancy: The respective roles of the two sloppy paired genes in drosophila segmenta- tion,” Proceedings of the National Academy of Sciences of the United States of America, Vol. 91, No. 14, pp. 6324–6328, 1994. [11] M. K. Campbell and S. O. Farrell, Biochemistry. Thomson-Brooks/Cole, fourth ed., 2003. [12] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [13] C.-S. Chin and M. P. Samanta, “Global snapshot of a protein interaction net- worka percolation based approach,” Bioinformatics, Vol. 19, pp. 2413–2419, 2003. [14] C.-H. Chin, C.-W. Ho, and M.-T. Ko, Prediction of Essential Proteins and Functional Modules from Protein-Protein Interaction Networks. PhD thesis, National Central University, Chung-Li, Taiwan, 2010. [15] H. N. Chua, K. L. Tew, X.-L. Li, and S.-K. Ng, “A unified scoring scheme for detecting essential proteins in protein interaction networks,” 2008 20th IEEE International Conference on Tools with Artificial Intelligence, Vol. 2, pp. 66–73, 2008. [16] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, Vol. 20, No. 2, pp. 273–297, 1995. [17] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Ma- chines and other kernel-based learning methods. Cambridge University Press, 2000. [18] L. M. Cullen and G. M. Arndt, “Genome-wide screening for gene function using rnai in mammalian cells,” Immunology and Cell Biology, Vol. 83, No. 3, pp. 217–223, 2003. [19] E. Estrada, “Virtual identification of essential proteins within the protein in- teraction network of yeast,” PROTEOMICS, Vol. 6, No. 1, pp. 35–40, 2006. [20] J. C. Game and P. D. Kaufmana, “Role of saccharomyces cerevisiae chromatin assembly factor-i in repair of ultraviolet radiation damage in vivo,” Genetics, Vol. 151, pp. 458–497, 1999. [21] G. Giaever, A. M. Chu, L. Ni, C. Connelly, L. Riles, S. Veronneau, S. Dow, A. Lucau-Danila, K. Anderson, B. Andre, A. P. Arkin, A. Astromoff, M. E. Bakkoury, R. Bangham, R. Benito, S. Brachat, S. Campanaro, M. Curtiss, K. Davis, A. Deutschbauer, K.-D. Entian, P. Flaherty, F. Foury, D. J. Garfinkel, M. Gerstein, D. Gotte, U. G. ldener, J. H. Hegemann, S. Hempel, Z. Herman, D. F. Jaramillo, D. E. Kelly, S. L. Kelly, P. K. tter, D. LaBonte, D. C. Lamb, N. Lan, H. Liang, H. Liao, L. Liu, C. Luo, M. Lussier, R. Mao, P. Menard, S. L. Ooi, J. L. Revuelta, C. J. Roberts, M. Rose, P. Ross-Macdonald, B. Scherens, G. Schimmack, B. Shafer, D. D. Shoemaker, S. Sookhai-Mahadeo, R. K. Storms, J. N. Strathern, G. Valle, M. Voet, G. Volckaert, C. yun Wang, T. R. Ward, J. Wilhelmy, E. A. Winzeler, Y. Yang, G. Yen, E. Youngman, K. Yu, H. Bussey, J. D. Boeke, M. Snyder, P. Philippsen, R. W. Davis, and M. Johnston, “Func- tional profiling of the saccharomyces cerevisiae genome,” Nature, Vol. 418, pp. 387–391, 2002. [22] A. M. Gustafson, E. S. Snitkin, S. C. Parker, C. DeLisi, and S. Kasif, “To- wards the identification of essential genes using targeted genome sequencing and comparative analysis,” BMC Genomics, Vol. 7, 2006. [23] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. second ed., 2009. [24] E. L. Hong, R. Balakrishnan, Q. Dong, K. R. Christie, J. Park, G. Binkley, M. C. Costanzo, S. S. Dwight, S. R. Engel, D. G. Fisk, J. E. Hirschman, B. C. Hitz, C. J. Krieger, M. S. Livstone, S. R. Miyasato, R. S. Nash, R. Oughtred, M. S. Skrzypek, S. Weng, E. D. Wong, K. K. Zhu, K. Dolinski, D. Botstein, and J. M. Cherry, “Gene ontology annotations at SGD: new data sources and annotation methods,” Nucleic Acids Research, Vol. 36, pp. D577–D581, 2008. [25] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to support vector classification.” http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf. [26] Y.-C. Hwang, C.-C. Lin, J.-Y. Chang, H. Mori, H.-F. Juan, and H.-C. Huang, “Predicting essential genes based on network and sequence analysis,” Molecular BioSystems, Vol. 5, No. 12, pp. 1672–1678, 2009. [27] J. W. Hyle, R. J. Shaw, and D. Reines, “Functional distinctions between imp dehydrogenase genes in providing mycophenolate resistance and guanine pro- totrophy to yeast.,” The Journal Of Biological Chemistry, Vol. 278, No. 31, pp. 28470–28478, 2003. [28] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, “A com- prehensive two-hybrid analysis to explore the yeast protein interactome,” Pro- ceedings of the National Academy of Sciences of the United States of America, Vol. 98, No. 8, pp. 4569–4574, 2001. [29] H. Jeong, S. P.Mason, A.-L. Barabsi, and Z. N. Oltvai, “Lethality and centrality in protein networks,” Nature, Vol. 411,, pp. 41–42, 2001. [30] M. P. Joy, A. Brock, D. E. Ingber, , and S. Huang, “High-betweenness pro- teins in the yeast protein interaction network,” Journal of Biomedicine and Biotechnology, Vol. 18, No. 12, pp. 96–103, 2005. [31] J. Kyte and R. F. Doolittle, “A simple method for displaying the hydropathic character of a protein,” Journal of Molecular Biology, Vol. 157, pp. 105–132, 1982. [32] C.-Y. Lin, C.-B. Yang, C.-Y. Hor, and K.-S. Huang, “Disulfide bonding state prediction with svm based on protein types,” Bio-Inspired Computing: Theories and Applications, pp. 1436–1442, 2010. [33] C.-Y. Lin, C.-H. Chin, H.-H. Wu, S.-H. Chen, C.-W. Ho, and M.-T. Ko, “Hubba: hub objects analyzer- framework of interactome hubs identification for network biology,” Nucleic Acids Research, Vol. 36, pp. W438–W443, 2008. [34] C.-H. Luo, C.-W. Ho, and M.-T. Ko, “Essential protein detection from protein- protein interaction network,” Master’s thesis, National Central University, Chung-Li, Taiwan, 2006. [35] D. A. Mangus, N. Amrani, and A. Jacobson, “Pbp1p, a factor interacting with saccharomyces cerevisiae poly(a)-binding protein, regulates polyadenylation,” Molecular and Cellular Biology, Vol. 18, No. 12, pp. 7383–7396, 1998. [36] H. W. Min Li, Jianxin Wang and Y. Pan, “Essential proteins discovery from weighted protein interaction networks,” Bioinformatics Research and Applica- tions, Vol. 6053, pp. 89–100, 2010. [37] N. Prˇzulj, D. Wigle, and I. Jurisica, “Functional topology in a network of protein interactions,” Bioinformatics, Vol. 20, pp. 340–348, 1998. [38] T. Roemer, B. Jiang, J. Davison, T. Ketela, K. Veillette, A. Breton, F. Tan- dia, A. Linteau, S. Sillaots, C. Marta, N. Martel, S. Veronneau, S. Lemieux, S. Kauffman, J. Becker, R. Storms, C. Boone, and H. Bussey, “Large-scale essential gene identification in candida albicans and applications to antifungal drug discovery,” Molecular Microbiology, Vol. 50, pp. 167–181, 2003. [39] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisen- berg, “The database of interacting proteins: 2004 update.,” Nucleic Acids Re- search, Vol. 32, pp. D449–D451, 2004. [40] M. P. Samanta and S. Liang, “Predicting protein functions from redundancies in large-scale protein interaction networks,” National Aeronautics and Space Administration Advanced Supercomputing Division, Vol. 100, No. 22, pp. 12579– 12583, 2003. [41] E. Sprinzak, S. Sattath, and H. Margalit, “How reliable are experimental pro- teinprotein interaction data?,” Journal of Molecular Biology, Vol. 327, pp. 919– 923, 2003. [42] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applica- tions. Cambridge University Press, 1994. [43] E. A. Winzeler, D. D. Shoemaker, A. Astromoff, H. Liang, K. Anderson, B. An- dre, R. Bangham, R. Benito, J. D. Boeke, H. Bussey, A. M. Chu, C. C. and6 Karen Davis, F. Dietrich, S. W. Dow, M. E. Bakkoury, F. Foury, S. H. Friend, E. Gentalen, G. Giaever, J. H. Hegemann, T. Jones, M. Laub, H. Liao, N. Liebundguth, D. J. Lockhart, A. Lucau-Danila, M. Lussier, N. MRabet, P. Menard, M. Mittmann, C. Pai, C. Rebischung, J. L. Revuelta, L. Riles, C. J. Roberts, P. Ross-MacDonald, B. Scherens, M. Snyder, S. Sookhai-Mahadeo, R. K. Storms, S. Veronneau, M. Voet, G. Volckaert, T. R. Ward, R. Wysocki, G. S. Yen, K. Yu, K. Zimmermann, P. Philippsen, M. Johnston, and R. W. Davis, “Functional characterization of the S. cerevisiae genome by gene dele- tion and parallel analysis,” Science, Vol. 285, No. 5429, pp. 901–906, 1999. [44] I. H. Witten and E. Frank, Data Mining:Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000. [45] S. Wuchty and P. F. Stadle, “Centers of complex networks,” Journal of Theo- retical Biology, pp. 45–53, 2003. [46] H. Yu, P. M. Kim, E. Sprecher, V. Trifonov, and M. Gerstein, “The impor- tance of bottlenecks in protein networks: Correlation with gene essentiality and expression dynamics,” PLoS Computational Biology, Vol. 3, pp. 713–720, 2007.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0906111-094417.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS