Responsive image
博碩士論文 etd-0806116-215529 詳細資訊
Title page for etd-0806116-215529
論文名稱
Title
編輯距離問題與相關問題演算法之回顧
A Survey on the Algorithms of the Edit Distance Problem and Related Variants
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
174
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2016-09-06
繳交日期
Date of Submission
2016-09-07
關鍵字
Keywords
相似度、動態規劃、最長共同子序列、基因重組、區塊編輯、編輯距離
Genome Rearrangement, Similarity, Dynamic Programing, Longest Common Subsequence, Edit Distance, Block Edit
統計
Statistics
本論文已被瀏覽 5715 次,被下載 485
The thesis/dissertation has been browsed 5715 times, has been downloaded 485 times.
中文摘要
摘 要
編輯距離問題已經被研究數十年。給予兩個序列(字串)A 和B,編輯距離問
題即是求得將A 轉換成B 的最小花費。根據可以被使用的編輯操作,編輯操作
的花費,以及輸入的字串格式,編輯距離問題可以分為多種的變形問題。變動長
度編碼字串以及循環的字串的編輯距離問題是輸入序列方面的變型問題。區塊編
輯問題是運算方面的變型問題。考慮連續刪除以及新增編輯操作的編輯距離問題
是編輯操作花費方面的變型問題。除此之外,基因重組問題也可以視為是編輯距
離問題的一種變型問題。在本論文中,我們回顧許多編輯距離問題相關的演算法,
變型問題以及基因重組問題。我們也透過實作不同的演算法來進行許多實驗,進
而說明不同演算法的實際執行效率。

關鍵詞:編輯距離、區塊編輯、基因重組、最長共同子序列、動態規劃、相似度
Abstract
Abstract
The edit distance problem has been studied for several decades. Given sequences
(strings) A and B with length m and n, respectively, m ≤ n, the edit distance
problem is to find the minimum cost of operations required to transform A to B.
According to different models of cost functions, operations and input sequences, the
problem has several variants. The edit distance on run-length encoding sequences and
cyclic sequences are the variants on the input aspect. The block edit problem is a
variant on the operation aspect. The edit distance considering consecutive insertions
and deletions is another variant on the cost function. Besides, the genome
rearrangement problem can also be viewed as a variant, whose operations include
inversions, reversals and transpositions. In this thesis, we survey some algorithms for
the edit distance problem, its variants and the genome rearrangement problem. We
also perform some experiments to illustrate the execution efficiency of various
algorithms.
Keywords: Edit Distance, Block Edit, Genome Rearrangement, Longest Common
Subsequence, Dynamic Programing, Similarity
目次 Table of Contents
VERIFICATION FORM i
THANKS iii
CHINESE ABSTRACT iv
ENGLISH ABSTRACT v
LIST OF FIGURES ix
LIST OF TABLES xiv
LIST OF SYMBOLS xvi
1 Introduction 1
2 Preliminaries 3
2.1 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Longest Common Subsequence . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Longest Increasing Subsequence . . . . . . . . . . . . . . . . . . . . . 7
2.5 Run-length Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Edit Distance and Similarity 10
3.1 Simple Version by Wagner and Fischer . . . . . . . . . . . . . . . . . 10
3.2 Algorithm by Lowrance and Wagner . . . . . . . . . . . . . . . . . . 11
3.3 Edit Distance of Considering Consecutive Insertions and Deletions by
Waterman et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Time Bounds by Wong and Chandra . . . . . . . . . . . . . . . . . . 14
3.5 Algorithm by Masek and Paterson . . . . . . . . . . . . . . . . . . . . 14
3.6 Relation between Edit Distance and Similarity by Smith et al. . . . . 17
3.7 Algorithm for Edit Distance Considering Consecutive Insertions and
Deletions with Linear Cost Function by Gotoh . . . . . . . . . . . . . 19
3.8 Algorithm for Edit Distance Considering Consecutive Insertions and
Deletions with Concave Cost Function by Waterman . . . . . . . . . 21
3.9 Distance Function by Smith et al. . . . . . . . . . . . . . . . . . . . . 23
3.10 Algorithm by Myers . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.11 Algorithm for Edit Distance with Concave Cost Function by Miller
and Myers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.12 Algorithm for Edit Distance of Considering Consecutive Insertions
and Deletions with Mixed Convex and Concave Cost Function by
Eppstein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.13 Algorithm by Wu et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.14 Cyclic String-to-String Correction Problem by Maes . . . . . . . . . . 41
3.15 Edit Distance on Run-length Encoding Sequences by Bunke and Csirik 42
3.16 Edit Distance of Cyclic Sequences by Marzal and Barrachina . . . . . 44
3.17 Edit Distance of Run-length Encoding Sequences by Arbell et al. . . . 45
3.18 Edit Distance of RNA Structures by Jiang et al. . . . . . . . . . . . . 50
3.19 Edit Distance between Run-length Encoding Sequence and Uncompressed
Sequence by Liu et al. . . . . . . . . . . . . . . . . . . . . . . 53
3.20 Block Edit Problem by Ann et al. . . . . . . . . . . . . . . . . . . . . 59
4 Genome Rearrangement 63
4.1 Sequence Alignment with Non-overlapping Inversions by Schoniger
and Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Exact and Approximation Algorithms for Reversal Distance on Permutations
by Keceioglu and Sanko . . . . . . . . . . . . . . . . . . . 66
4.3 Lower Bounds for Reversal Distance on Permutations by Bafna et al. 69
4.4 Approximation Algorithm for Transposition Distance on Permutations
by Walter et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Upper Bounds and Lower Bounds for Reversal Distance and Transposition
Distance on Binary Sequences by Christie and Irving . . . . 75
4.6 An Ecient Algorithm for Computing Non-overlapping Inversion and
Transposition Distance by Ta et al. . . . . . . . . . . . . . . . . . . . 77
5 Experimental Results 80
5.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6 Conclusions and Future Work 103
BIBLIOGRAPHY 109
A Miscellaneous Experimental Results 115
參考文獻 References
[1] A. Aggarwal, M. M. Klawe, S. Moran, P. Shor, and R. Wilber, "Geometric
applications of a matrix-searching algorithm," Algorithmica, Vol. 2, No. 1,
pp. 195-208, 1987.
[2] A. V. Aho, D. S. Hirschberg, and J. D. Ullman, "Bounds on the complexity
of the longest common subsequence problem," Journal of the ACM, Vol. 23,
No. 1, pp. 1-12, 1976.
[3] L. Allison and T. I. Dix, "A bit-string longest-common-subsequence algorithm,"
Information Processing Letters, Vol. 23, pp. 305-310, 1986.
[4] H. Y. Ann, C. B. Yang, Y. H. Peng, and B. C. Liaw, "Efficient algorithms for the
block edit problems," Information and Computation, Vol. 208(3), pp. 221-229,
2010.
[5] H. Y. Ann, C. B. Yang, C.-T. Tseng, and C. Y. Hor, "A fast and simple algorithm
for computing the longest common subsequence of run-length encoded
strings," Information Processing Letters, Vol. 108, pp. 360-364, 2008.
[6] O. Arbell, G. M. Landau, and J. S. B. Mitchell, "Edit distance of run-length
encoded strings," Information Processing Letters, Vol. 83, No. 6, pp. 307-314,
2002.
[7] V. Bafna and P. A. Pevzner, "Genome rearrangements and sorting by reversals,"
SIAM Journal of Computing, Vol. 25, No. 2, pp. 172-289, 1996.
[8] V. Bafna and P. A. Pevzner, "Sorting by transpositions," SIAM Journal on
Discrete Mathematics, Vol. 11, No. 2, pp. 224-240, 1998.
[9] A. Bergeron, J. Mixtacki, and J. Stoye, "Reversal distance without hurdles
and fortresses," Proceedings of 15th Annual Combinatorial Pattern Matching
Symposium, Vol. 3109, Istanbul, Turkey, pp. 389-399, 2004.
[10] P. Berman and S. Hannenhalli, "Fast sorting by reversal," Proceedings of the 7th
Annual Symposium on Combinatorial Pattern Matching, London, UK, pp. 168-185, 1996.
[11] P. Berman, S. Hannenhalli, and M. Karpinski, "1.375-approximation algorithm
for sorting by reversals," Proceedings of the 10th Annual European Symposium
on Algorithms, Vol. 2461, Rome, Italy, pp. 200-210, 2002.
[12] H. Bunke and J. Csirik, "An algorithm for matching run-length coded strings,"
Computing, Vol. 50, No. 4, pp. 297-314, 1993.
[13] H. Bunke and J. Csirik, "An improved algorithm for computing the edit distance
of run-length coded strings," Information Processing Letters, Vol. 54, No. 2,
pp. 93-96, 1995.
[14] A. Caprara, "Sorting by reversals is difficult," Proceedings of the First International Conference on Computational Molecular Biology, NM, USA, pp. 75-83,
1997.
[15] A. Caprara, "Sorting permutations by reversals and eulerian cycle decompositions," SIAM Journal of Discrete Mathematics, Vol. 12, No. 1, pp. 91-100,
1999.
[16] D. A. Christie, "A 3/2-approximation algorithm for sorting by reversals," Proceeding of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms,
Pennsylvania, USA, pp. 244-252, 1998.
[17] D. A. Christie and R. W. Irving, "Sorting strings by reversals and by transpositions," SIAM Journal on Discrete Mathematics, Vol. 14, No. 2, pp. 193-206,
2001.
[18] G. Cormode and S. Muthukrishnan, "The string edit distance matching problem
with moves," Proceedings of the 13th annual ACM-SIAM symposium on
Discrete algorithms, San Francisco, USA, pp. 667-676, 2002.
[19] N. EI-Mabrouk, "Reconstructing an ancestral genome using minimum segments
duplications and reversals," Journal of Computer and System Sciences, Vol. 65,
No. 3, pp. 442-464, 2002.
[20] D. Eppstein, "Sequence comparison with mixed convex and concave cost," Jour-
nal of Algorithms, Vol. 11, pp. 85-101, 1990.
[21] F. Ergun, S. Muthukrishnan, and C. Sahinalp, "Comparing sequences with segment rearrangements," Proceedings of the 23rd Foundations of Software Technology and Theoretical Computer Science, Mumbai, India, pp. 183-194, 2003.
[22] Z. Galil and R. Giancarlo, "Speeding up dynamic programming with application
to molecular biology," Theoretical Computer Science, Vol. 64, 19889.
[23] M. R. Garey and D. S. Johnson, "Complexity results for multiprocessor scheduling
under resource constraints," SIAM Journal on Computing, Vol. 4, pp. 397-411, 1975.
[24] O. Gotoh, "An improved algorithm for matching biological sequences," Journal
of Molecular Biology, Vol. 162, No. 3, pp. 705-708, 1982.
[25] J. Gregor and M. G. Thomason, "Dynamic programming alignment of sequences
representing cyclic patterns," IEEE Transactions on Pattern Analysis
and Machine Intelligence, Vol. 15, No. 2, pp. 129-135, 1993.
[26] S. Hannenhalli, "Polynomial-time algorithm for computing translocation distance
between genomes," Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, Espoo, Finland, pp. 162-176, 1996.
[27] S. Hannenhalli and P. A. Pevzner, "Transforming cabbage into turnip: polynomial
algorithm for sorting signed permutations by reversals," Proceedings of the
27th Annual Symposium on Theory of Computing, New York, USA, pp. 178-189, 1995.
[28] T. Hartman and R. Shamir, "A simpler 1.5-approximation algorithm for sorting
by transpositions," Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, Morelia, Mexico, pp. 156-169, 2003.
[29] D. S. Hirschberg, "A linear space algorithm for computing maximal common
subsequence," Communications of the ACM, Vol. 24, pp. 664-675, 1975.
[30] J. W. Hunt and T. G. Szymanski, "A fast algorithm for computing longest common subsequences," Communications of the ACM, Vol. 20, No. 5, pp. 350-353, 1977.
[31] T. Jiang, G. Lin, B. Ma, and K. Zhang, "A general edit distance between RNA
structures," Journal of Computational Biology, Vol. 9, No. 2, pp. 371-388, 2002.
[32] H. Kaplan and N. Shafrir, "The greedy algorithm for edit distance with moves,"
Information Processing Letters, Vol. 97, No. 1, pp. 23-27, 2006.
[33] J. Kececioglu and D. Sanko, "Exact and approximation algorithms for the inversion distance between two permutations," Proceedings of the 4th Annual
Symposium on Combinatorial Pattern Matching, Vol. 684, Berlin, Germany,
pp. 87-105, 1993.
[34] J. Kececioglu and D. Sanko, "Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement," Algorithmica, Vol. 13,
No. 1-2, pp. 180-210, 1995.
[35] J. W. Kim, A. Amir, G. M. Landau, and K. Park, "Computing similarity of run-length-encoded strings with affine gap penalty," Theoretical Computer Science,
Vol. 395, No. 2-3, pp. 268-282, 2008.
[36] M. M. Klawe and D. J. Kleitman, "An almost linear time algorithm for generalized
matrix searching," SIAM Journal on Discrete Mathematics, Vol. 3,
pp. 81-97, 1990.
[37] P. Kolman, "Approximating reversal distance for strings with bounded number
of duplicates in linear time," Proceedings of 30th International Symposium on
Mathematical Foundations of Computer Science, Gdansk, Poland, pp. 580-590,
2005.
[38] J. J. Liu, G. S. Huang, Y. L. Wang, and R. C. T. Lee, "Edit distance for a run-length-encoded string and an uncompressed string," Information Processing
Letters, Vol. 105, No. 1, pp. 12-16, 2007.
[39] J. J. Liu, Y. L.Wang, and R. C. T. Lee, "Finding a longest common subsequence
between a run-length-encoded string and an uncompressed string," Journal of
Complexity, Vol. 24, No. 2, pp. 173-184, 2008.
[40] D. Lopresti and A. Tomkins, "Block edit models for approximate string matching,"
Theoretical Computer Science, Vol. 181, pp. 159-179, 1997.
[41] R. Lowrance and R. A. Wagner, "An extension of the string-to-string correction
problem," Journal of the ACM, Vol. 22, No. 2, pp. 177-188, 1975.
[42] M. Maes, "On a cyclic string-to-string correction problem," Information Pro-
cessing Letters, Vol. 35, pp. 73-78, 1990.
[43] A. Marzal and S. Barrachina, "Speeding up the computation of the edit distance
for cyclic strings," In Proceeding of the 15th International Conference on
Pattern Recognition, Barcelona, Spain, pp. 895-898, 2000.
[44] W. J. Masek and M. S. Paterson, "A faster algorithm computing string edit
distances," Journal of Computer and System Sciences, Vol. 20, pp. 18-31, 1980.
[45] W. Miller and E. W. Myers, "Sequence comparison with concave weighting
functions," Bulletin of Mathematical Biology, Vol. 50, No. 2, pp. 97-120, 1988.
[46] S. Muthukrishnan and S. C. Sahinalp, "Approximate nearest neighbors and sequence comparison with block operations," Proceedings of the 32nd Symposium
on the Theory of Computing, Portland, USA, pp. 416-424, 2000.
[47] E. W. Myers, "An O(ND) difference algorithm and its variations," Algorithmica, No. 1, pp. 251-266, 1986.
[48] D. B. Needleman and C. D. Wunsch, "A general method applicable to the
search for similarities in the amino acid sequence of two proteins," Journal of
Molecular Biology, Vol. 48, No. 3, pp. 443-453, 1970.
[49] M. Schoniger and M. S. Waterman, "A local algorithm for DNA sequence
alignment with inversions," Bulletin of Mathematical Biology, Vol. 54, No. 4,
pp. 521-536, 1992.
[50] P. H. Sellers, "An algorithm for the distance between two finite sequences,"
Journal of Combinatorial Theory, Vol. 16, pp. 253-258, 1974.
[51] P. H. Sellers, "On the theory and computation of evolutionary distances," SIAM
Journal on Applied Mathematics, Vol. 26, No. 4, pp. 787-793, 1974.
[52] D. Shapira and J. A. Storer, "Large edit distance with multiple block operations,"
Proceedings of 10th International Symposium, String Processing and
Information Retrieval, Vol. 2857, Manaus, Brazil, pp. 369-377, 2003.
[53] D. Shapira and J. A. Storer, "Edit distance with move operations," Journal of
Discrete Algorithms, Vol. 5, No. 2, pp. 380-392, 2007.
[54] T. F. Smith and M. S. Waterman, "Comparison of biosequences," Advances in
Applied Mathematics, Vol. 2, pp. 482-489, 1981.
[55] T. F. Smith and M. S. Waterman, "Identification of common molecular subsequences,"
Journal of Molecular Biology, Vol. 147, No. 1, pp. 195-197, 1981.
[56] T. F. Smith, M. S. Waterman, and C. Burks, "The statistical distribution of nucleic acid similarities," Nucleic Acids Research, pp. 645-656, 1985.
[57] T. F. Smith, M. S. Waterman, and W. M. Fitch, "Comparative biosequence metrics," Journal of Molecular Mubon, Vol. 18, pp. 38-46, 1981.
[58] T. T. Ta, C. Y. Lin, and C. L. Lu, "An efficient algorithm for computing nonoverlapping inversion and transposition distance," Proceedings of 32nd Workshop on Combinatorial Mathematics and Computation Theory, Taichung, Taiwan, pp. 55-61, 2015.
[59] E. Ukkonen, "Algorithms for approximate string matching," Information and
Control, Vol. 64, pp. 100-118, 1985.
[60] R. A. Wagner, "On the complexity of the extended string-to-string correction
problem," Proceedings of Seventh Annual ACM Symposium on Theory of Computing, New York, USA, pp. 218-223, 1975.
[61] R. A. Wagner and M. J. Fischer, "The string-to-string correction problem,"
Journal of the ACM, Vol. 21, No. 1, pp. 168-173, 1974.
[62] M. E. M. T.Walter, Z. Dias, and J. Meidanis, "A new approach for approximating the transposition distance," Proceedings of the Seventh International Symposium on String Processing Information Retrieval, Corunna, Spain, pp. 199-208, 2000.
[63] M. S. Waterman, "Efficient sequence alignment algorithms," Journal of Theoretical Biology, Vol. 108, pp. 333-337, 1984.
[64] M. S. Waterman, T. F. Smith, and W. Beyer, "Some biological sequence metrics,"
Advances in Mathematics, Vol. 20, No. 3, pp. 367-387, 1976.
[65] R. Wilber, "The concave least-weight subsequence problem revisited," Journal
of Algorithms, Vol. 9, pp. 418-425, 1988.
[66] C. K. Wong and A. K. Chandra, "Bounds of the string editing problem," Journal of the ACM, Vol. 23, pp. 13-16, 1976.
[67] S. Wu, U. Manber, G. Myers, and W. Miller, "An O(NP) sequence comparison algorithm," Information Processing Letters, Vol. 35, pp. 317-323, 1990.
[68] C. B. Yang and R. C. T. Lee, "Systolic algorithms for the longest common
subsequence problem," Journal of the Chinese Institute of Engineers, Vol. 10, No. 6, pp. 691-699, 1987.
[69] F. F. Yao, "Speed-up in dynamic programming," SIAM Journal on Algebraic
Discrete Methods, Vol. 3, pp. 532-540, 1982.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code