Responsive image
博碩士論文 etd-0906111-094417 詳細資訊
Title page for etd-0906111-094417
論文名稱
Title
以支持向量機為基礎之必要性蛋白質預測
Prediction for the Essential Protein with the Support Vector Machine
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
67
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2011-08-31
繳交日期
Date of Submission
2011-09-06
關鍵字
Keywords
生物資訊、必要性蛋白質、蛋白質交互作用、支持向量機、特徵集
bioinformatics, essential protein, protein-protein interaction, support vector machine, feature set
統計
Statistics
本論文已被瀏覽 5738 次,被下載 1713
The thesis/dissertation has been browsed 5738 times, has been downloaded 1713 times.
中文摘要
必要性蛋白質對細胞生命影響非常深,但我們很難去偵測必要性蛋白質。蛋白質交互作用為其中一種檢測蛋白質是否為必要性蛋白質的方法。我們注意到很多研究方法從蛋白質交互作用擷取拓樸的特徵去預測必要性蛋白質。然而,蛋白質的功能也是一條線索去決定他的必要性。在本篇論文中,我們利用影響蛋白質功能的序列特徵、拓樸和蛋白質特徵去建立支持向量機的模型來預測必要性蛋白質。在我們的實驗中,我們從DIP資料庫中下載Scere20070107檔案,其中包含了4873條蛋白質和17166個交互作用。在此檔案中必要性蛋白質和非必要性蛋白質的比例相當不平衡為1:4。在不平衡的資料中,我們的模型得到最好的F-measure、MCC、AIC和BIC分別為0.5197、0.4371、0.2428和0.2543。我們另外建立了比例為1:1的平衡資料。在平衡資料中,我們的模型得到最好的F-measure、MCC、AIC和BIC分別為0.7742、0.5484、0.3603和0.3828。我們的研究結果均優於以前的研究方法與結果 。
Abstract
Essential proteins affect the cellular life deeply, but it is hard to identify them. Protein-protein interaction is one of the ways to disclose whether a protein is essential or not. We notice that many researchers use the feature set composed of topology properties from protein-protein interaction to predict the essential proteins. However, the functionality of a protein is also a clue to determine its essentiality. In this thesis, to build SVM models for predicting the essential proteins, our feature set contains the sequence properties which can influence the protein function, topology properties and protein properties. In our experiments, we download Scere20070107, which contains 4873 proteins and 17166 interactions, from DIP database. The ratio of essential proteins to nonessential proteins is nearly 1:4, so it is imbalanced. In the imbalanced dataset, the best values of F-measure, MCC, AIC and BIC of our models are 0.5197, 0.4671, 0.2428 and 0.2543, respectively. We build another balanced dataset with ratio 1:1. For balanced dataset, the best values of F-measure, MCC, AIC and BIC of our models are 0.7742, 0.5484, 0.3603 and 0.3828, respectively. Our results are superior to all previous results with various measurements.
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Database of Protein and PPI . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Position Specific Scoring Matrix . . . . . . . . . . . . . . . . . . . . . 8
2.4 Topological Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Bottleneck (BN) . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.3 Edge Percolated Component (EPC) . . . . . . . . . . . . . . . 12
2.4.4 Maximum Neighborhood Component (MNC) . . . . . . . . . . 12
2.4.5 Density of Maximum Neighborhood Component (DMNC) . . 12
2.4.6 Neighbors’ Intra-degree (NID ) . . . . . . . . . . . . . . . . . 14
2.4.7 Clustering Coefficient (CCo) . . . . . . . . . . . . . . . . . . . 14
2.4.8 Betweenness Centrality (BC) . . . . . . . . . . . . . . . . . . 15
2.4.9 Closeness Centrality (CC) . . . . . . . . . . . . . . . . . . . . 15
2.4.10 Clique Level (KL) . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Methods for Essential Protein Prediction . . . . . . . . . . . . . . . . 16
2.5.1 Score Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 Prediction by Classifiers . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3. Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Topological Properties . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Bit String Implementation of Double Screening Scheme . . . . 22
3.1.3 Protein Properties . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.4 Sequence Properties . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.5 Other Properties . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Our Method with SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Chapter 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . 31
4.1 PPI Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Data Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Experimental Results and Comparison . . . . . . . . . . . . . . . . . 33
Chapter 5. Conclusion and Future Work . . . . . . . . . . . . . . . . . 45
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
參考文獻 References
[1] V. I. Abkevich and E. I. Shakhnovich, “What can disulfide bonds tell us about
protein energetics, function and folding: Simulations and bioninformatics anal-
ysis,” Journal of Molecular Biology, Vol. 300, pp. 975–985, 2000.
[2] M. L. Acencio and N. Lemke, “Towards the prediction of essential genes by
integration of network topology, cellular localization and biological process in-
formation,” BMC Bioinformatics, Vol. 10, No. 1, pp. 290–307, 2009.
[3] R. Albert and A. L. Barabasi, “Statistical mechanics of complex networks,”
Reviews of Modern Physics, Vol. 74, pp. 47–97, 2002.
[4] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipmanl, “Basic
local alignment search tool,” Journal of Molecular Biology, Vol. 215, No. 3,
pp. 403–410, 1990.
[5] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller,
and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs,” Nucleic Acids Research, Vol. 25, No. 17,
pp. 3389–3402, 1997.
[6] G. D. Bader and C.W. Hogue, “Analyzing yeast proteinprotein interaction data
obtained from different sources,” Nature Biotechnology, Vol. 20, pp. 991–997,
2002.
[7] A. L. Barabasi and Z. N. Oltvai, “Network biology: understanding the cell’s
functional organization,” Nature Reviews Genetics, Vol. 5, No. 2, pp. 101–113,
2004.
[8] D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling,
N. Zhang, G. Li1, and R. Chen, “Topological structure analysis of the protein-
protein interaction network in budding yeast,” Nucleic Acids Research, Vol. 31,
pp. 121–167, 2003.
[9] C. J. Burges, “A tutorial on support vector machines for pattern recognition,”
Data Mining and Knowledge Discovery, Vol. 2, pp. 121–167, 1998.
[10] K. M. Cadigan, U. Grossniklaus, and W. J. Gehring, “Functional redundancy:
The respective roles of the two sloppy paired genes in drosophila segmenta-
tion,” Proceedings of the National Academy of Sciences of the United States of
America, Vol. 91, No. 14, pp. 6324–6328, 1994.
[11] M. K. Campbell and S. O. Farrell, Biochemistry. Thomson-Brooks/Cole,
fourth ed., 2003.
[12] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,”
2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[13] C.-S. Chin and M. P. Samanta, “Global snapshot of a protein interaction net-
worka percolation based approach,” Bioinformatics, Vol. 19, pp. 2413–2419,
2003.
[14] C.-H. Chin, C.-W. Ho, and M.-T. Ko, Prediction of Essential Proteins and
Functional Modules from Protein-Protein Interaction Networks. PhD thesis,
National Central University, Chung-Li, Taiwan, 2010.
[15] H. N. Chua, K. L. Tew, X.-L. Li, and S.-K. Ng, “A unified scoring scheme for
detecting essential proteins in protein interaction networks,” 2008 20th IEEE
International Conference on Tools with Artificial Intelligence, Vol. 2, pp. 66–73,
2008.
[16] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,
Vol. 20, No. 2, pp. 273–297, 1995.
[17] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Ma-
chines and other kernel-based learning methods. Cambridge University Press,
2000.
[18] L. M. Cullen and G. M. Arndt, “Genome-wide screening for gene function
using rnai in mammalian cells,” Immunology and Cell Biology, Vol. 83, No. 3,
pp. 217–223, 2003.
[19] E. Estrada, “Virtual identification of essential proteins within the protein in-
teraction network of yeast,” PROTEOMICS, Vol. 6, No. 1, pp. 35–40, 2006.
[20] J. C. Game and P. D. Kaufmana, “Role of saccharomyces cerevisiae chromatin
assembly factor-i in repair of ultraviolet radiation damage in vivo,” Genetics,
Vol. 151, pp. 458–497, 1999.
[21] G. Giaever, A. M. Chu, L. Ni, C. Connelly, L. Riles, S. Veronneau, S. Dow,
A. Lucau-Danila, K. Anderson, B. Andre, A. P. Arkin, A. Astromoff, M. E.
Bakkoury, R. Bangham, R. Benito, S. Brachat, S. Campanaro, M. Curtiss,
K. Davis, A. Deutschbauer, K.-D. Entian, P. Flaherty, F. Foury, D. J. Garfinkel,
M. Gerstein, D. Gotte, U. G. ldener, J. H. Hegemann, S. Hempel, Z. Herman,
D. F. Jaramillo, D. E. Kelly, S. L. Kelly, P. K. tter, D. LaBonte, D. C. Lamb,
N. Lan, H. Liang, H. Liao, L. Liu, C. Luo, M. Lussier, R. Mao, P. Menard, S. L.
Ooi, J. L. Revuelta, C. J. Roberts, M. Rose, P. Ross-Macdonald, B. Scherens,
G. Schimmack, B. Shafer, D. D. Shoemaker, S. Sookhai-Mahadeo, R. K. Storms,
J. N. Strathern, G. Valle, M. Voet, G. Volckaert, C. yun Wang, T. R. Ward,
J. Wilhelmy, E. A. Winzeler, Y. Yang, G. Yen, E. Youngman, K. Yu, H. Bussey,
J. D. Boeke, M. Snyder, P. Philippsen, R. W. Davis, and M. Johnston, “Func-
tional profiling of the saccharomyces cerevisiae genome,” Nature, Vol. 418,
pp. 387–391, 2002.
[22] A. M. Gustafson, E. S. Snitkin, S. C. Parker, C. DeLisi, and S. Kasif, “To-
wards the identification of essential genes using targeted genome sequencing
and comparative analysis,” BMC Genomics, Vol. 7, 2006.
[23] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. second ed., 2009.
[24] E. L. Hong, R. Balakrishnan, Q. Dong, K. R. Christie, J. Park, G. Binkley,
M. C. Costanzo, S. S. Dwight, S. R. Engel, D. G. Fisk, J. E. Hirschman, B. C.
Hitz, C. J. Krieger, M. S. Livstone, S. R. Miyasato, R. S. Nash, R. Oughtred,
M. S. Skrzypek, S. Weng, E. D. Wong, K. K. Zhu, K. Dolinski, D. Botstein,
and J. M. Cherry, “Gene ontology annotations at SGD: new data sources and
annotation methods,” Nucleic Acids Research, Vol. 36, pp. D577–D581, 2008.
[25] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to support vector
classification.” http://www.csie.ntu.edu.tw/ cjlin/papers/guide/guide.pdf.
[26] Y.-C. Hwang, C.-C. Lin, J.-Y. Chang, H. Mori, H.-F. Juan, and H.-C. Huang,
“Predicting essential genes based on network and sequence analysis,” Molecular
BioSystems, Vol. 5, No. 12, pp. 1672–1678, 2009.
[27] J. W. Hyle, R. J. Shaw, and D. Reines, “Functional distinctions between imp
dehydrogenase genes in providing mycophenolate resistance and guanine pro-
totrophy to yeast.,” The Journal Of Biological Chemistry, Vol. 278, No. 31,
pp. 28470–28478, 2003.
[28] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki, “A com-
prehensive two-hybrid analysis to explore the yeast protein interactome,” Pro-
ceedings of the National Academy of Sciences of the United States of America,
Vol. 98, No. 8, pp. 4569–4574, 2001.
[29] H. Jeong, S. P.Mason, A.-L. Barabsi, and Z. N. Oltvai, “Lethality and centrality
in protein networks,” Nature, Vol. 411,, pp. 41–42, 2001.
[30] M. P. Joy, A. Brock, D. E. Ingber, , and S. Huang, “High-betweenness pro-
teins in the yeast protein interaction network,” Journal of Biomedicine and
Biotechnology, Vol. 18, No. 12, pp. 96–103, 2005.
[31] J. Kyte and R. F. Doolittle, “A simple method for displaying the hydropathic
character of a protein,” Journal of Molecular Biology, Vol. 157, pp. 105–132,
1982.
[32] C.-Y. Lin, C.-B. Yang, C.-Y. Hor, and K.-S. Huang, “Disulfide bonding state
prediction with svm based on protein types,” Bio-Inspired Computing: Theories
and Applications, pp. 1436–1442, 2010.
[33] C.-Y. Lin, C.-H. Chin, H.-H. Wu, S.-H. Chen, C.-W. Ho, and M.-T. Ko,
“Hubba: hub objects analyzer- framework of interactome hubs identification
for network biology,” Nucleic Acids Research, Vol. 36, pp. W438–W443, 2008.
[34] C.-H. Luo, C.-W. Ho, and M.-T. Ko, “Essential protein detection from protein-
protein interaction network,” Master’s thesis, National Central University,
Chung-Li, Taiwan, 2006.
[35] D. A. Mangus, N. Amrani, and A. Jacobson, “Pbp1p, a factor interacting with
saccharomyces cerevisiae poly(a)-binding protein, regulates polyadenylation,”
Molecular and Cellular Biology, Vol. 18, No. 12, pp. 7383–7396, 1998.
[36] H. W. Min Li, Jianxin Wang and Y. Pan, “Essential proteins discovery from
weighted protein interaction networks,” Bioinformatics Research and Applica-
tions, Vol. 6053, pp. 89–100, 2010.
[37] N. Prˇzulj, D. Wigle, and I. Jurisica, “Functional topology in a network of
protein interactions,” Bioinformatics, Vol. 20, pp. 340–348, 1998.
[38] T. Roemer, B. Jiang, J. Davison, T. Ketela, K. Veillette, A. Breton, F. Tan-
dia, A. Linteau, S. Sillaots, C. Marta, N. Martel, S. Veronneau, S. Lemieux,
S. Kauffman, J. Becker, R. Storms, C. Boone, and H. Bussey, “Large-scale
essential gene identification in candida albicans and applications to antifungal
drug discovery,” Molecular Microbiology, Vol. 50, pp. 167–181, 2003.
[39] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisen-
berg, “The database of interacting proteins: 2004 update.,” Nucleic Acids Re-
search, Vol. 32, pp. D449–D451, 2004.
[40] M. P. Samanta and S. Liang, “Predicting protein functions from redundancies
in large-scale protein interaction networks,” National Aeronautics and Space
Administration Advanced Supercomputing Division, Vol. 100, No. 22, pp. 12579–
12583, 2003.
[41] E. Sprinzak, S. Sattath, and H. Margalit, “How reliable are experimental pro-
teinprotein interaction data?,” Journal of Molecular Biology, Vol. 327, pp. 919–
923, 2003.
[42] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applica-
tions. Cambridge University Press, 1994.
[43] E. A. Winzeler, D. D. Shoemaker, A. Astromoff, H. Liang, K. Anderson, B. An-
dre, R. Bangham, R. Benito, J. D. Boeke, H. Bussey, A. M. Chu, C. C.
and6 Karen Davis, F. Dietrich, S. W. Dow, M. E. Bakkoury, F. Foury, S. H.
Friend, E. Gentalen, G. Giaever, J. H. Hegemann, T. Jones, M. Laub, H. Liao,
N. Liebundguth, D. J. Lockhart, A. Lucau-Danila, M. Lussier, N. MRabet,
P. Menard, M. Mittmann, C. Pai, C. Rebischung, J. L. Revuelta, L. Riles, C. J.
Roberts, P. Ross-MacDonald, B. Scherens, M. Snyder, S. Sookhai-Mahadeo,
R. K. Storms, S. Veronneau, M. Voet, G. Volckaert, T. R. Ward, R. Wysocki,
G. S. Yen, K. Yu, K. Zimmermann, P. Philippsen, M. Johnston, and R. W.
Davis, “Functional characterization of the S. cerevisiae genome by gene dele-
tion and parallel analysis,” Science, Vol. 285, No. 5429, pp. 901–906, 1999.
[44] I. H. Witten and E. Frank, Data Mining:Practical Machine Learning Tools and
Techniques with Java Implementations. Morgan Kaufmann, 2000.
[45] S. Wuchty and P. F. Stadle, “Centers of complex networks,” Journal of Theo-
retical Biology, pp. 45–53, 2003.
[46] H. Yu, P. M. Kim, E. Sprecher, V. Trifonov, and M. Gerstein, “The impor-
tance of bottlenecks in protein networks: Correlation with gene essentiality
and expression dynamics,” PLoS Computational Biology, Vol. 3, pp. 713–720,
2007.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code