國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,生物序列近似比對之演算法,Algorithms for Near-optimal Alignment Problems on Biosequences

論文名稱 Title	生物序列近似比對之演算法 Algorithms for Near-optimal Alignment Problems on Biosequences
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	96 學年度第 2 學期 The spring semester of Academic Year 96	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	140
研究生 Author	曾國尊 Kuo-Tsung Tseng
指導教授 Advisor	楊昌彪 Chang-Biau Yang
召集委員 Convenor	張貿翔 M. S. Chang
口試委員 Advisory Committee	戴顯權, 吳邦一, 王炳豐, 林耀鈴, 李宗南 S. C. Tai; Bang-Ye Wu; Biing-Feng Wang; Yaw-Ling Lin; Chung-Nan Lee
口試日期 Date of Exam	2007-06-09	繳交日期 Date of Submission	2008-08-26
關鍵字 Keywords	演算法、蛋白質、近似、生物序列 LCS, protein, biosequence, algorithm, near-optimal
統計 Statistics	本論文已被瀏覽 5803 次，被下載 1104 次 The thesis/dissertation has been browsed 5803 times, has been downloaded 1104 times.

中文摘要
本篇論文研究生物序列近似比對之演算法。生物序列比對問題向來為生物資訊學門之重要問題，旨在利用生物序列的比對來得出各生物在序列間的相同與相異性，以期更為了解生物體內各蛋白質之結構形成或其演化關係。然生物序列比對之準確性問題由來便頗為生物學家所詬病，其原因主要在於比對時所時用的計分方式不全然正確所致。故而本篇論文先以基於計分方式為正確的狀況下、去求得同分之序列排列中較具生物意義的排列。次者再擴充問題為非同分但近似高分的排列中、是否有哪些排列較具生物意義？最後、有見於計分方式的分岐可能導致使用者的無所適從，本篇論文另外討論當多種計分方式同時使用時、當如何取得所謂的最佳排序？本篇論文為促進生物序列排列的比對正確性，循序漸進地慢慢將問題擴大並討論及解決。多數問題在平方時間內解決、剩餘亦在三方時間內解決。未來方向可研究降低其時間與空間之需求，或推而廣之就其準確性做更深入的探討。
Abstract
With the improvement of biological techniques, the amount of biosequences data, such as DNA, RNA and protein sequences, are growing explosively. It is almost impossible to handle such huge amount of data purely by manpower. Thus the requirement of the great computing power is essential. There are some ways to treat biosequence data, finding identical biosequences, searching similar biosequences, or mining the signature of biosequences. All of these are based on the same problems, the biosequence alignment problems. In this dissertation, we shall study the biosequence alignment problems to raise the biological meaning of the optimal or near-optimal alignments since the biologists and computer scientists sometimes argue the biological meaning of the mathematically optimal alignment obtained based on some scoring functions. We first study the methods to improve the optimal alignment of two given biosequences. Since usually the optimal alignment is not unique, there should exist the best one among the optimal alignments, and we try to extract this by defining some other criteria to judge the goodness of the alignments when the traditional methods cannot decide which is the better one. Two algorithms are proposed for solving the newly defined biosequence alignment problems, the smoothest optimal alignment and the most conserved optimal alignment problems. Some other criteria are also discussed since most of them can be solved in a similar way. Then we notice that the most biologically meaningful alignment may not be the optimal one since there is no perfect scoring matrix. We address our candidates in those near-optimal alignments, and present a tracing marking function to get all near-optimal alignments and use the criterion "the most conserved" to filter it, which is named as the near-optimal block alignment (NBA) problem. Finally, as everybody knows that existing scoring matrices are not perfect at all, we try to figure out how we choose the winner when multiple scoring matrices are applied. We define some reasonable schemes to decide the winner alignment. In this dissertation, we solve and discuss the algorithms for near-optimal alignment problems on biosequences. In the future, we would like to do some experiments to support or reject these concepts.

目次 Table of Contents
TABLE OF CONTENTS Page LIST OF FIGURES iii LIST OF TABLES v LIST OF SYMBOLS viii LIST OF ABBREVIATION x ABSTRACT xi 1 Introduction 1 2 Preliminaries 5 2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 The Longest Common Subsequence Problem . . . . . . . . . . 7 2.2.1 Dynamic Programming Algorithm for 2-LCS . . . . . . 9 2.2.2 Linear Space Algorithm for 2-LCS . . . . . . . . . . . . 18 2.3 The Sequence Alignment Problem . . . . . . . . . . . . . . . . 20 2.3.1 Near-optimal Alignment . . . . . . . . . . . . . . . . . 22 2.3.2 Sequence Alignment Problem with Multiple Scoring Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 The Better Alignment among the Output Alignments 27 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 The Newly Defined Biosequence Alignment Problems . . . . . 29 3.2.1 The Smoothest Optimal Alignment . . . . . . . . . . . 29 i 3.2.2 The Most Conserved Optimal Alignment . . . . . . . . 32 3.2.3 The Miscellaneous Reasonable Optimal Alignments . . 37 3.3 The Algorithms for Solving the Newly Defined Problems . . . 40 3.3.1 An Algorithm for the Smoothest Optimal Alignment . 40 3.3.2 An Algorithm for the Most Conserved Optimal Align- ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . 54 4 Near-optimal Block Alignment 56 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Tracings in the Alignment Lattice . . . . . . . . . . . . . . . . 58 4.3 An Algorithm for Near-optimal Block Alignment . . . . . . . . 65 4.4 Comparing NBA with Affine Gap Penalty . . . . . . . . . . . 79 4.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . 81 5 Finding Winner Alignments with Multiple Scoring Matrices 82 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 Losing Score Lattice . . . . . . . . . . . . . . . . . . . . . . . 86 5.3 Finding the Winner Alignment . . . . . . . . . . . . . . . . . 87 5.4 Variants of the Comparing Function . . . . . . . . . . . . . . . 98 5.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . 106 6 Conclusions 107 BIBLIOGRAPHY 110 INDEX 118

參考文獻 References
[1] S. Altschul and B. W. Erickson, “Optimal sequence alignment using affine gap costs,” Journal of Molecular Biology, Vol. 48, pp. 603–616, 1986. [2] S. F. Altschul, “Gap costs for multiple sequence alignment,” Journal of Theoretical Biology, Vol. 138, pp. 297–309, 1989. [3] S. F. Altschul, “A protein alignment scoring system sensitive to all evolu- tionary distances,” Journal of Molecular Evolution, Vol. 36, pp. 290–300, 1993. [4] S. F. Altschul, W. Gish,W. Miller, E.W. Myers, and D. J. Lipman, “Ba- sic local alignment search tool,” Journal of Molecular Biology, Vol. 215, pp. 403–410, 1990. [5] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Re- search, Vol. 25, pp. 3389–3402, 1997. [6] S. F. Altschul and B. W. Erickson, “Locally optimal subalignments us- ing nonlinear similarity functions,” Bulletin of Mathematical Biology, Vol. 48, No. 5-6, pp. 633–660, 1986. [7] A. Apostolico and C. Guerra, “The longest common subsequences prob- lem revisited,” Algorithmica, Vol. 18, pp. 1–11, 1987. [8] B. S. Baker and R. Giancarlo, “Sparse dynamic programming for longest common subsequence from fragments,” Journal of Algorithms, Vol. 42, No. 2, pp. 231–254, 2002. 110 [9] A. Banerjee and J. Ghosh, “Clickstream clustering using weighted longest common subsequences,” In Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining, Chicago, April 2001. [10] L. Bergroth, H. Hakonen, and T. Raita, “New approximation algorithms for longest common subsequences,” Proceedings of String Processing and Information Retrieval: A South American Symposium, SPIRE 1998, pp. 32–40, 1998. [11] L. Bergroth, H. Hakonen, and T. Raita, “A survey of longest common subsequence algorithms,” Proceedings of Seventh International Sym- posium on String Processing and Information Retrieval, SPIRE 2000, pp. 39–48, 2000. [12] P. Bonizzoni, G. D. Vedova, and G. Mauri, “Experimenting an approxi- mation algorithm for the LCS,” Discrete Applied Mathematics, Vol. 110, No. 1, pp. 13–24, 2001. [13] P. Chain, S. Kurtz, E. Ohlebusch, and T. Slezak, “An applications- focused review of comparative genomics tools: capabilities, limitations, and future challenges,” Briefings in Bioinformatics, Vol. 4, pp. 105–123, 2003. [14] Y. Y. Chen, C. B. Yang, and K. T. Tseng, “Prediction of protein struc- tures based on curve alignment,” Proc. of the 20th Workshop on Combi- natorial Mathematics and Computation Theory, Chiayi, Taiwan, pp. 33– 44, 2003. [15] F. Y. L. Chin, A. D. Santis, A. L. Ferrara, N. L. Ho, and S. K. Kim, “A simple algorithm for the constrained sequence problems,” Information Processing Letters, Vol. 90, No. 4, pp. 175–179, 2004. [16] J. F. Collins, A. F. Coulson, and A. Lyall, “The significance of protein sequence similarities,” Computer Applications in the Biosciences, Vol. 4, pp. 67–71, 1988. [17] M. O. Dayhoff., Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC, 1978. 111 [18] L. A. Delcher, S. Kasif, A. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg, “Alignment of whole genomes,” Nucleic Acids Research, Vol. 27, No. 11, pp. 2369–2376, 1999. [19] M. Farach, S. Kannan, and T. Warnow, “A robust model for finding optimal evolutionary trees,” Algorithmica, Vol. 13, pp. 155–179, 1995. [20] D. F. Feng, M. S. Johnson, and R. F. Doolittle, “Aligning amino acid se- quences: comparison of commonly used method s,” Journal of Molecular Evolution, Vol. 21, pp. 112–125, 1985. [21] C. B. Fraser, Subsequences and supersequences of strings. University of Glasgow, Computing Science Department Research Report, TR-1995- 16, 1995. [22] K. A. Frazer, L. Elnitski, D. M. Church, I. Dubchak, and R. C. Hardison, “Cross-species sequence comparisons: A review of methods and available resources,” Genome Research, Vol. 13, pp. 1–12, 2003. [23] F. Frommlet, M. Bogdan, and A. Futschik, “Power analysis of database search using multiple scoring matrices,” Computational Statistics & Data Analysis, Vol. 51, pp. 1656–1663, 2006. [24] F. Frommlet and A. Futschik, “On the dependence structure of sequence alignment scores calculated with multiple scoring matrices,” Statistical Applications in Genetics and Molecular Biology, Vol. 3, 2004. [25] F. Frommlet, A. Futschik, and M. Bogdan, “On the significance of se- quence alignments when using multiple scoring matrices,” Bioinformat- ics, Vol. 20, No. 6, pp. 881–887, 2004. [26] O. Gotoh, “An improved algorithm for matching biological sequences,” Journal of Molecular Biology, Vol. 162, pp. 705–708, 1982. [27] O. Gotoh, “Optimal sequence alignment allowing for long gaps,” Bulletin of Mathematical Biology, Vol. 52, pp. 359–373, 1990. [28] R. I. Greenberg, “Bounds on the number of longest common subse- quences,” Computing Research Repository, Vol. cs.DM/0301030, 2003. [29] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge Press, NY, 1997. 112 [30] B. T. H. and W. M. S., “Determining all optimal and near-optimal solu- tions when solving shortest path problems by dynamic programming,” Operations Research, Vol. 32, No. 6, pp. 1381–1384, 1984. [31] K. Hakata and H. Imai, “The longest common subsequence problem for small alphabet size between many strings,” Proceedings of the Third In- ternational Symposium on Algorithms and Computation, Lecture Notes in Computer Science 650, Springer Verlag, pp. 469–478, 1992. [32] M. D. Hendy and D. Penny, “Branch and bound algorithms to deter- mine minimal evolutionary trees,” Mathematical Biosciences, Vol. 59, pp. 277–290, 1982. [33] M. Hilbert, G. Bohm, and R. Jaenicke, “Structural relationships of ho- mologous proteins as a fundamen tal principle in homology modeling,” Proteins, Vol. 17, No. 2, pp. 138–151, 1993. [34] D. S. Hirschberg, “A linear space algorithm for computing maximal common subsequence,” Communications of the ACM, Vol. 18, No. 6, pp. 341–343, 1975. [35] D. S. Hirschberg, “Algorithms for the longest common subsequence problem,” Journal of ACM, Vol. 24, pp. 664–675, 1977. [36] K. F. Huang, C. B. Yang, and K. T. Tseng, “An efficient algorithm for multiple sequence alignment,” Proc. of the 19th Workshop on Combina- torial Mathematics and Computation Theory, pp. 50–59, 2002. [37] K.-S. Huang, C.-B. Yang, K.-T. Tseng, Y.-H. Peng, and H.-Y. Ann, “Dy- namic programming algorithms for the mosaic longest common subse- quence problem,” Information Processing Letters, Vol. 102, pp. 99–103, 2007. [38] J. W. Hunt and T. G. Szymanski, “A fast algorithm for computing longest common subsequences,” Communications of the ACM, Vol. 20, No. 5, pp. 350–353, 1977. 113 [39] R. W. Irving and C. B. Fraser, “Two algorithms for the longest com- mon subsequence of three (or more) strings,” Proceedings of CPM’92, the Fourth Annual Symposium on Combinatorial Pattern Matching, Ari- zona, Lecture Notes in Computer Science 644, Springer Verlag, pp. 214– 229, 1992. [40] T. Jiang and M. Li, “On the approximation of shortest common super- sequences and longest common subsequences,” SIAM Journal on Com- puting, Vol. 24, pp. 1122–1139, 1995. [41] T. Johtela, J. Smed, H. Hakonen, and T. Raita, “An efficient heuris- tic for the LCS problem,” Third South American Workshop on String Processing, WSP’96, pp. 126–140, 1996. [42] J. D. Kececioglu, H. P. Lenhof, K. Mehlhorn, P. Mutzel, K. Reinert, and M. Vingron, “A polyhedral approach to sequence alignment problems,” Discrete Applied Mathematics, Vol. 104, pp. 143–186, 2000. [43] A. M. Lesk, M. Levitt, and C. Chothia, “Alignment of the amino acid sequences of distantly related protei ns using variable gap penalties,” Protein Engineering, Vol. 1, pp. 77–78, 1986. [44] R. Lewin, “When does homology mean something else?,” Science, Vol. 237, p. 1570, 1987. [45] D. Maier, “The complexity of some problems on subsequences and su- persequences,” Journal of the ACM, Vol. 25, No. 2, pp. 322–336, 1978. [46] W. J. Masek and M. S. Paterson, “A faster algorithm computing string edit distances,” Journal of Computer and System Sciences, Vol. 20, pp. 18–31, 1980. [47] D. Naor and D. L. Brutlag, “On near-optimal alignments of biological sequences,” Journal of Computing Biology, Vol. 4, pp. 349–366, 1994. [48] D. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, Vol. 48, No. 3, pp. 443–453, 1970. 114 [49] T. Oates, “Identifying distinctive subsequences in multivariate time se- ries by clustering,” KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, pp. 322–326, ACM, 1999. [50] M. Paterson and V. Dancik, “Longest common subsequence,” Pro- ceedings of the 19th Mathematical Foundations of Computer Science (MFCS), LNCS 841, pp. 127–142, 1994. [51] W. Pearson and W. Miller, “Dynamic programming algorithms for biological sequence comparison,” Methods in Enzymology, Vol. 210, pp. 575–601, 1992. [52] W. R. Pearson and D. Lipman, “Improved tools for biological sequence comparison,” Proceedings of the National Academy of Sciences, Vol. 85, pp. 2444–2448, 1988. [53] C.-L. Peng, “An approach for solving the constrained longest common subsequence problem,” Master Thesis, Department of Computer Science and Engineering, National Sun Yat-sen University, Taiwan, July 2003. [54] K. Reinert, H. P. Lenhof, P. Mutzel, K. Mehlhorn, and J. Kececioglu, “A branch-and-cut algorithm for multiple sequence alignment,” Proceed- ings of the 1st ACM Conference on Computational Melecular Biology, pp. 241–249, 1997. [55] C. Rick, “Simple and fast linear space computation of longest common subsequences,” Information Processing Letters, Vol. 75, pp. 275–281, 2000. [56] Y. Sakai, “A linear space algorithm for computing a longest common increasing subsequence,” Information Processing Letters, Vol. 99, No. 5, pp. 203–207, 2006. [57] D. Sankoff and J. B. Kruskal, Time Warps, String Edits, and Macro- molecules: The Theory and Practice of Sequence Comparison. Addison- Wesley, MA., 1983. [58] C. Schensted, “Longest increasing and decreasing subsequences,” Cana- dian Journal of Mathematics, Vol. 13, pp. 179–191, 1961. 115 [59] R. M. Schwartz and M. O. Dayhoff., Matrices for detecting distant rela- tionships. National Biomedical Research Foundation, Washington, DC, 1979. [60] T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, Vol. 147, No. 1, pp. 195– 197, 1981. [61] M. Steel, “The complexity of reconstructing trees from qualitative char- acters and subtrees,” Journal of Classification, Vol. 9, pp. 91–116, 1992. [62] K. T. Tseng, C. B. Yang, and K. S. Huang, “The better alignment among output alignments,” Journal of Computers, Vol. 3, pp. 51–62, 2007. [63] K. T. Tseng, C. B. Yang, K. S. Huang, and Y. H. Peng, “Near-optimal block alignments,” IEICE TRANSACTIONS on Information and Sys- tems, Vol. E91-D, No. 3, pp. 789–795, 2008. [64] R. A. Wagner and M. J. Fischer, “The string-to-string correction prob- lem,” Journal of the ACM, Vol. 21, No. 1, pp. 168–173, 1974. [65] L. Wang and T. Jiang, “On the complexity of multiple sequence align- ment,” Journal of Computational Biology, Vol. 1, pp. 337–348, 1994. [66] M. S. Waterman, “Sequence alignments in the neighborhood of the op- timum with general application to dynamic programming,” Proceedings of the National Academy of Sciences of the United States of America, Vol. 80, No. 10, pp. 3123–3124, 1983. [67] M. J. Wise, “Yap3: improved detection of similarities in computer pro- gram and other texts,” ACM SIGCSE Bulletin, Vol. 28, No. 1, pp. 130– 134, 1996. [68] B. Y. Wu, K. M. Chao, and C. Y. Tang, “Approximation and exact algorithms for constructing miminum ultrametric trees from distance matrices,” Journal of Combinatorial Optimation, Vol. 3, pp. 199–211, 1999. [69] C. B. Yang and R. C. T. Lee, “Systolic algorithms for the longest com- mon subsequence problem,” Journal of the Chinese Institute of Engi- neers, Vol. 10, No. 6, pp. 691–699, 1987. 116 [70] I.-H. Yang, C.-P. Huang, and K.-M. Chao, “A fast algorithm for comput- ing a longest common increasing subsequence,” Information Processing Letters, Vol. 93, No. 5, pp. 249–253, 2005. [71] J. Zhang and T. L. Madden, “PowerBLAST: a new network blast appli- cation for interactive or automated sequence analysis and annotation,” Genome Methods, pp. 649–656, 1997. [72] Z. Zhang, P. Berman, and W. Miller, “Alignments without low-scoring regions,” Research in Computational Molecular Biology, Vol. 5, pp. 294– 301, 1998. [73] J. Zhu, J. S. Liu, and C. E. Lawrence, “Bayesian adaptive sequence alignment algorithms,” Bioinformatics, Vol. 14, pp. 25–39, 1998. [74] M. Zuker, “Suboptimal sequence alignment in molecular biology : Align- ment with error analysis,” Journal of Molecular Biology, Vol. 221, No. 20, pp. 403–420, 1991.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0826108-182735.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS