Responsive image
博碩士論文 etd-0826108-182735 詳細資訊
Title page for etd-0826108-182735
論文名稱
Title
生物序列近似比對之演算法
Algorithms for Near-optimal Alignment Problems on Biosequences
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
140
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2007-06-09
繳交日期
Date of Submission
2008-08-26
關鍵字
Keywords
演算法、蛋白質、近似、生物序列
LCS, protein, biosequence, algorithm, near-optimal
統計
Statistics
本論文已被瀏覽 5803 次,被下載 1104
The thesis/dissertation has been browsed 5803 times, has been downloaded 1104 times.
中文摘要
本篇論文研究生物序列近似比對之演算法。生物序列比對問題向來為生物資訊學門之重要問題,旨在利用生物序列的比對來得出各生物在序列間的相同與相異性,以期更為了解生物體內各蛋白質之結構形成或其演化關係。
然生物序列比對之準確性問題由來便頗為生物學家所詬病,其原因主要在於比對時所時用的計分方式不全然正確所致。故而本篇論文先以基於計分方式為正確的狀況下、去求得同分之序列排列中較具生物意義的排列。次者再擴充問題為非同分但近似高分的排列中、是否有哪些排列較具生物意義?最後、有見於計分方式的分岐可能導致使用者的無所適從,本篇論文另外討論當多種計分方式同時使用時、當如何取得所謂的最佳排序?
本篇論文為促進生物序列排列的比對正確性,循序漸進地慢慢將問題擴大並討論及解決。多數問題在平方時間內解決、剩餘亦在三方時間內解決。未來方向可研究降低其時間與空間之需求,或推而廣之就其準確性做更深入的探討。
Abstract
With the improvement of biological techniques, the amount of biosequences
data, such as DNA, RNA and protein sequences, are growing explosively.
It is almost impossible to handle such huge amount of data purely by manpower.
Thus the requirement of the great computing power is essential.
There are some ways to treat biosequence data, finding identical biosequences,
searching similar biosequences, or mining the signature of biosequences.
All of these are based on the same problems, the biosequence alignment
problems.
In this dissertation, we shall study the biosequence alignment problems to
raise the biological meaning of the optimal or near-optimal alignments since the
biologists and computer scientists sometimes argue
the biological meaning of the mathematically optimal alignment
obtained based on some scoring functions.

We first study the methods to improve the optimal alignment of two given
biosequences. Since usually the optimal alignment is not unique, there
should exist the best one among the optimal alignments, and we try to
extract this by defining some other criteria to judge the goodness of
the alignments when the traditional methods cannot decide which is the better one.

Two algorithms are proposed for solving the newly defined biosequence
alignment problems, the smoothest optimal alignment and the most
conserved optimal alignment problems. Some other criteria are also discussed
since most of them can be solved in a similar way.

Then we notice that the most biologically meaningful alignment may not
be the optimal one since there is no perfect scoring matrix. We address
our candidates in those near-optimal alignments, and present a tracing
marking function to get all near-optimal alignments and use the criterion
"the most conserved" to filter it, which is named as the
near-optimal block alignment (NBA) problem.

Finally, as everybody knows that existing scoring matrices are not
perfect at all, we try to figure out how we choose the winner
when multiple scoring matrices are applied. We define some
reasonable schemes to decide the winner alignment.

In this dissertation, we solve and discuss the algorithms for near-optimal
alignment problems on biosequences.
In the future, we would like to do some experiments to support
or reject these concepts.
目次 Table of Contents
TABLE OF CONTENTS
Page
LIST OF FIGURES iii
LIST OF TABLES v
LIST OF SYMBOLS viii
LIST OF ABBREVIATION x
ABSTRACT xi
1 Introduction 1
2 Preliminaries 5
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 The Longest Common Subsequence Problem . . . . . . . . . . 7
2.2.1 Dynamic Programming Algorithm for 2-LCS . . . . . . 9
2.2.2 Linear Space Algorithm for 2-LCS . . . . . . . . . . . . 18
2.3 The Sequence Alignment Problem . . . . . . . . . . . . . . . . 20
2.3.1 Near-optimal Alignment . . . . . . . . . . . . . . . . . 22
2.3.2 Sequence Alignment Problem with Multiple Scoring
Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 The Better Alignment among the Output Alignments 27
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 The Newly Defined Biosequence Alignment Problems . . . . . 29
3.2.1 The Smoothest Optimal Alignment . . . . . . . . . . . 29
i
3.2.2 The Most Conserved Optimal Alignment . . . . . . . . 32
3.2.3 The Miscellaneous Reasonable Optimal Alignments . . 37
3.3 The Algorithms for Solving the Newly Defined Problems . . . 40
3.3.1 An Algorithm for the Smoothest Optimal Alignment . 40
3.3.2 An Algorithm for the Most Conserved Optimal Align-
ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . 54
4 Near-optimal Block Alignment 56
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Tracings in the Alignment Lattice . . . . . . . . . . . . . . . . 58
4.3 An Algorithm for Near-optimal Block Alignment . . . . . . . . 65
4.4 Comparing NBA with Affine Gap Penalty . . . . . . . . . . . 79
4.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . 81
5 Finding Winner Alignments with Multiple Scoring Matrices 82
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Losing Score Lattice . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Finding the Winner Alignment . . . . . . . . . . . . . . . . . 87
5.4 Variants of the Comparing Function . . . . . . . . . . . . . . . 98
5.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . 106
6 Conclusions 107
BIBLIOGRAPHY 110
INDEX 118
參考文獻 References
[1] S. Altschul and B. W. Erickson, “Optimal sequence alignment using
affine gap costs,” Journal of Molecular Biology, Vol. 48, pp. 603–616,
1986.
[2] S. F. Altschul, “Gap costs for multiple sequence alignment,” Journal of
Theoretical Biology, Vol. 138, pp. 297–309, 1989.
[3] S. F. Altschul, “A protein alignment scoring system sensitive to all evolu-
tionary distances,” Journal of Molecular Evolution, Vol. 36, pp. 290–300,
1993.
[4] S. F. Altschul, W. Gish,W. Miller, E.W. Myers, and D. J. Lipman, “Ba-
sic local alignment search tool,” Journal of Molecular Biology, Vol. 215,
pp. 403–410, 1990.
[5] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang,
W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs,” Nucleic Acids Re-
search, Vol. 25, pp. 3389–3402, 1997.
[6] S. F. Altschul and B. W. Erickson, “Locally optimal subalignments us-
ing nonlinear similarity functions,” Bulletin of Mathematical Biology,
Vol. 48, No. 5-6, pp. 633–660, 1986.
[7] A. Apostolico and C. Guerra, “The longest common subsequences prob-
lem revisited,” Algorithmica, Vol. 18, pp. 1–11, 1987.
[8] B. S. Baker and R. Giancarlo, “Sparse dynamic programming for longest
common subsequence from fragments,” Journal of Algorithms, Vol. 42,
No. 2, pp. 231–254, 2002.
110
[9] A. Banerjee and J. Ghosh, “Clickstream clustering using weighted
longest common subsequences,” In Proceedings of the Web Mining
Workshop at the 1st SIAM Conference on Data Mining, Chicago, April
2001.
[10] L. Bergroth, H. Hakonen, and T. Raita, “New approximation algorithms
for longest common subsequences,” Proceedings of String Processing and
Information Retrieval: A South American Symposium, SPIRE 1998,
pp. 32–40, 1998.
[11] L. Bergroth, H. Hakonen, and T. Raita, “A survey of longest common
subsequence algorithms,” Proceedings of Seventh International Sym-
posium on String Processing and Information Retrieval, SPIRE 2000,
pp. 39–48, 2000.
[12] P. Bonizzoni, G. D. Vedova, and G. Mauri, “Experimenting an approxi-
mation algorithm for the LCS,” Discrete Applied Mathematics, Vol. 110,
No. 1, pp. 13–24, 2001.
[13] P. Chain, S. Kurtz, E. Ohlebusch, and T. Slezak, “An applications-
focused review of comparative genomics tools: capabilities, limitations,
and future challenges,” Briefings in Bioinformatics, Vol. 4, pp. 105–123,
2003.
[14] Y. Y. Chen, C. B. Yang, and K. T. Tseng, “Prediction of protein struc-
tures based on curve alignment,” Proc. of the 20th Workshop on Combi-
natorial Mathematics and Computation Theory, Chiayi, Taiwan, pp. 33–
44, 2003.
[15] F. Y. L. Chin, A. D. Santis, A. L. Ferrara, N. L. Ho, and S. K. Kim, “A
simple algorithm for the constrained sequence problems,” Information
Processing Letters, Vol. 90, No. 4, pp. 175–179, 2004.
[16] J. F. Collins, A. F. Coulson, and A. Lyall, “The significance of protein
sequence similarities,” Computer Applications in the Biosciences, Vol. 4,
pp. 67–71, 1988.
[17] M. O. Dayhoff., Atlas of Protein Sequence and Structure. National
Biomedical Research Foundation, Washington, DC, 1978.
111
[18] L. A. Delcher, S. Kasif, A. D. Fleischmann, J. Peterson, O. White, and
S. L. Salzberg, “Alignment of whole genomes,” Nucleic Acids Research,
Vol. 27, No. 11, pp. 2369–2376, 1999.
[19] M. Farach, S. Kannan, and T. Warnow, “A robust model for finding
optimal evolutionary trees,” Algorithmica, Vol. 13, pp. 155–179, 1995.
[20] D. F. Feng, M. S. Johnson, and R. F. Doolittle, “Aligning amino acid se-
quences: comparison of commonly used method s,” Journal of Molecular
Evolution, Vol. 21, pp. 112–125, 1985.
[21] C. B. Fraser, Subsequences and supersequences of strings. University of
Glasgow, Computing Science Department Research Report, TR-1995-
16, 1995.
[22] K. A. Frazer, L. Elnitski, D. M. Church, I. Dubchak, and R. C. Hardison,
“Cross-species sequence comparisons: A review of methods and available
resources,” Genome Research, Vol. 13, pp. 1–12, 2003.
[23] F. Frommlet, M. Bogdan, and A. Futschik, “Power analysis of database
search using multiple scoring matrices,” Computational Statistics &
Data Analysis, Vol. 51, pp. 1656–1663, 2006.
[24] F. Frommlet and A. Futschik, “On the dependence structure of sequence
alignment scores calculated with multiple scoring matrices,” Statistical
Applications in Genetics and Molecular Biology, Vol. 3, 2004.
[25] F. Frommlet, A. Futschik, and M. Bogdan, “On the significance of se-
quence alignments when using multiple scoring matrices,” Bioinformat-
ics, Vol. 20, No. 6, pp. 881–887, 2004.
[26] O. Gotoh, “An improved algorithm for matching biological sequences,”
Journal of Molecular Biology, Vol. 162, pp. 705–708, 1982.
[27] O. Gotoh, “Optimal sequence alignment allowing for long gaps,” Bulletin
of Mathematical Biology, Vol. 52, pp. 359–373, 1990.
[28] R. I. Greenberg, “Bounds on the number of longest common subse-
quences,” Computing Research Repository, Vol. cs.DM/0301030, 2003.
[29] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer
Science and Computational Biology. Cambridge Press, NY, 1997.
112
[30] B. T. H. and W. M. S., “Determining all optimal and near-optimal solu-
tions when solving shortest path problems by dynamic programming,”
Operations Research, Vol. 32, No. 6, pp. 1381–1384, 1984.
[31] K. Hakata and H. Imai, “The longest common subsequence problem for
small alphabet size between many strings,” Proceedings of the Third In-
ternational Symposium on Algorithms and Computation, Lecture Notes
in Computer Science 650, Springer Verlag, pp. 469–478, 1992.
[32] M. D. Hendy and D. Penny, “Branch and bound algorithms to deter-
mine minimal evolutionary trees,” Mathematical Biosciences, Vol. 59,
pp. 277–290, 1982.
[33] M. Hilbert, G. Bohm, and R. Jaenicke, “Structural relationships of ho-
mologous proteins as a fundamen tal principle in homology modeling,”
Proteins, Vol. 17, No. 2, pp. 138–151, 1993.
[34] D. S. Hirschberg, “A linear space algorithm for computing maximal
common subsequence,” Communications of the ACM, Vol. 18, No. 6,
pp. 341–343, 1975.
[35] D. S. Hirschberg, “Algorithms for the longest common subsequence
problem,” Journal of ACM, Vol. 24, pp. 664–675, 1977.
[36] K. F. Huang, C. B. Yang, and K. T. Tseng, “An efficient algorithm for
multiple sequence alignment,” Proc. of the 19th Workshop on Combina-
torial Mathematics and Computation Theory, pp. 50–59, 2002.
[37] K.-S. Huang, C.-B. Yang, K.-T. Tseng, Y.-H. Peng, and H.-Y. Ann, “Dy-
namic programming algorithms for the mosaic longest common subse-
quence problem,” Information Processing Letters, Vol. 102, pp. 99–103,
2007.
[38] J. W. Hunt and T. G. Szymanski, “A fast algorithm for computing
longest common subsequences,” Communications of the ACM, Vol. 20,
No. 5, pp. 350–353, 1977.
113
[39] R. W. Irving and C. B. Fraser, “Two algorithms for the longest com-
mon subsequence of three (or more) strings,” Proceedings of CPM’92,
the Fourth Annual Symposium on Combinatorial Pattern Matching, Ari-
zona, Lecture Notes in Computer Science 644, Springer Verlag, pp. 214–
229, 1992.
[40] T. Jiang and M. Li, “On the approximation of shortest common super-
sequences and longest common subsequences,” SIAM Journal on Com-
puting, Vol. 24, pp. 1122–1139, 1995.
[41] T. Johtela, J. Smed, H. Hakonen, and T. Raita, “An efficient heuris-
tic for the LCS problem,” Third South American Workshop on String
Processing, WSP’96, pp. 126–140, 1996.
[42] J. D. Kececioglu, H. P. Lenhof, K. Mehlhorn, P. Mutzel, K. Reinert, and
M. Vingron, “A polyhedral approach to sequence alignment problems,”
Discrete Applied Mathematics, Vol. 104, pp. 143–186, 2000.
[43] A. M. Lesk, M. Levitt, and C. Chothia, “Alignment of the amino acid
sequences of distantly related protei ns using variable gap penalties,”
Protein Engineering, Vol. 1, pp. 77–78, 1986.
[44] R. Lewin, “When does homology mean something else?,” Science,
Vol. 237, p. 1570, 1987.
[45] D. Maier, “The complexity of some problems on subsequences and su-
persequences,” Journal of the ACM, Vol. 25, No. 2, pp. 322–336, 1978.
[46] W. J. Masek and M. S. Paterson, “A faster algorithm computing string
edit distances,” Journal of Computer and System Sciences, Vol. 20,
pp. 18–31, 1980.
[47] D. Naor and D. L. Brutlag, “On near-optimal alignments of biological
sequences,” Journal of Computing Biology, Vol. 4, pp. 349–366, 1994.
[48] D. B. Needleman and C. D. Wunsch, “A general method applicable to
the search for similarities in the amino acid sequence of two proteins,”
Journal of Molecular Biology, Vol. 48, No. 3, pp. 443–453, 1970.
114
[49] T. Oates, “Identifying distinctive subsequences in multivariate time se-
ries by clustering,” KDD ’99: Proceedings of the fifth ACM SIGKDD
international conference on Knowledge discovery and data mining, New
York, NY, USA, pp. 322–326, ACM, 1999.
[50] M. Paterson and V. Dancik, “Longest common subsequence,” Pro-
ceedings of the 19th Mathematical Foundations of Computer Science
(MFCS), LNCS 841, pp. 127–142, 1994.
[51] W. Pearson and W. Miller, “Dynamic programming algorithms for
biological sequence comparison,” Methods in Enzymology, Vol. 210,
pp. 575–601, 1992.
[52] W. R. Pearson and D. Lipman, “Improved tools for biological sequence
comparison,” Proceedings of the National Academy of Sciences, Vol. 85,
pp. 2444–2448, 1988.
[53] C.-L. Peng, “An approach for solving the constrained longest common
subsequence problem,” Master Thesis, Department of Computer Science
and Engineering, National Sun Yat-sen University, Taiwan, July 2003.
[54] K. Reinert, H. P. Lenhof, P. Mutzel, K. Mehlhorn, and J. Kececioglu,
“A branch-and-cut algorithm for multiple sequence alignment,” Proceed-
ings of the 1st ACM Conference on Computational Melecular Biology,
pp. 241–249, 1997.
[55] C. Rick, “Simple and fast linear space computation of longest common
subsequences,” Information Processing Letters, Vol. 75, pp. 275–281,
2000.
[56] Y. Sakai, “A linear space algorithm for computing a longest common
increasing subsequence,” Information Processing Letters, Vol. 99, No. 5,
pp. 203–207, 2006.
[57] D. Sankoff and J. B. Kruskal, Time Warps, String Edits, and Macro-
molecules: The Theory and Practice of Sequence Comparison. Addison-
Wesley, MA., 1983.
[58] C. Schensted, “Longest increasing and decreasing subsequences,” Cana-
dian Journal of Mathematics, Vol. 13, pp. 179–191, 1961.
115
[59] R. M. Schwartz and M. O. Dayhoff., Matrices for detecting distant rela-
tionships. National Biomedical Research Foundation, Washington, DC,
1979.
[60] T. F. Smith and M. S. Waterman, “Identification of common molecular
subsequences,” Journal of Molecular Biology, Vol. 147, No. 1, pp. 195–
197, 1981.
[61] M. Steel, “The complexity of reconstructing trees from qualitative char-
acters and subtrees,” Journal of Classification, Vol. 9, pp. 91–116, 1992.
[62] K. T. Tseng, C. B. Yang, and K. S. Huang, “The better alignment
among output alignments,” Journal of Computers, Vol. 3, pp. 51–62,
2007.
[63] K. T. Tseng, C. B. Yang, K. S. Huang, and Y. H. Peng, “Near-optimal
block alignments,” IEICE TRANSACTIONS on Information and Sys-
tems, Vol. E91-D, No. 3, pp. 789–795, 2008.
[64] R. A. Wagner and M. J. Fischer, “The string-to-string correction prob-
lem,” Journal of the ACM, Vol. 21, No. 1, pp. 168–173, 1974.
[65] L. Wang and T. Jiang, “On the complexity of multiple sequence align-
ment,” Journal of Computational Biology, Vol. 1, pp. 337–348, 1994.
[66] M. S. Waterman, “Sequence alignments in the neighborhood of the op-
timum with general application to dynamic programming,” Proceedings
of the National Academy of Sciences of the United States of America,
Vol. 80, No. 10, pp. 3123–3124, 1983.
[67] M. J. Wise, “Yap3: improved detection of similarities in computer pro-
gram and other texts,” ACM SIGCSE Bulletin, Vol. 28, No. 1, pp. 130–
134, 1996.
[68] B. Y. Wu, K. M. Chao, and C. Y. Tang, “Approximation and exact
algorithms for constructing miminum ultrametric trees from distance
matrices,” Journal of Combinatorial Optimation, Vol. 3, pp. 199–211,
1999.
[69] C. B. Yang and R. C. T. Lee, “Systolic algorithms for the longest com-
mon subsequence problem,” Journal of the Chinese Institute of Engi-
neers, Vol. 10, No. 6, pp. 691–699, 1987.
116
[70] I.-H. Yang, C.-P. Huang, and K.-M. Chao, “A fast algorithm for comput-
ing a longest common increasing subsequence,” Information Processing
Letters, Vol. 93, No. 5, pp. 249–253, 2005.
[71] J. Zhang and T. L. Madden, “PowerBLAST: a new network blast appli-
cation for interactive or automated sequence analysis and annotation,”
Genome Methods, pp. 649–656, 1997.
[72] Z. Zhang, P. Berman, and W. Miller, “Alignments without low-scoring
regions,” Research in Computational Molecular Biology, Vol. 5, pp. 294–
301, 1998.
[73] J. Zhu, J. S. Liu, and C. E. Lawrence, “Bayesian adaptive sequence
alignment algorithms,” Bioinformatics, Vol. 14, pp. 25–39, 1998.
[74] M. Zuker, “Suboptimal sequence alignment in molecular biology : Align-
ment with error analysis,” Journal of Molecular Biology, Vol. 221,
No. 20, pp. 403–420, 1991.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code