論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
編輯距離問題與相關問題演算法之回顧 A Survey on the Algorithms of the Edit Distance Problem and Related Variants |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
174 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2016-09-06 |
繳交日期 Date of Submission |
2016-09-07 |
關鍵字 Keywords |
相似度、動態規劃、最長共同子序列、基因重組、區塊編輯、編輯距離 Genome Rearrangement, Similarity, Dynamic Programing, Longest Common Subsequence, Edit Distance, Block Edit |
||
統計 Statistics |
本論文已被瀏覽 5715 次,被下載 485 次 The thesis/dissertation has been browsed 5715 times, has been downloaded 485 times. |
中文摘要 |
摘 要 編輯距離問題已經被研究數十年。給予兩個序列(字串)A 和B,編輯距離問 題即是求得將A 轉換成B 的最小花費。根據可以被使用的編輯操作,編輯操作 的花費,以及輸入的字串格式,編輯距離問題可以分為多種的變形問題。變動長 度編碼字串以及循環的字串的編輯距離問題是輸入序列方面的變型問題。區塊編 輯問題是運算方面的變型問題。考慮連續刪除以及新增編輯操作的編輯距離問題 是編輯操作花費方面的變型問題。除此之外,基因重組問題也可以視為是編輯距 離問題的一種變型問題。在本論文中,我們回顧許多編輯距離問題相關的演算法, 變型問題以及基因重組問題。我們也透過實作不同的演算法來進行許多實驗,進 而說明不同演算法的實際執行效率。 關鍵詞:編輯距離、區塊編輯、基因重組、最長共同子序列、動態規劃、相似度 |
Abstract |
Abstract The edit distance problem has been studied for several decades. Given sequences (strings) A and B with length m and n, respectively, m ≤ n, the edit distance problem is to find the minimum cost of operations required to transform A to B. According to different models of cost functions, operations and input sequences, the problem has several variants. The edit distance on run-length encoding sequences and cyclic sequences are the variants on the input aspect. The block edit problem is a variant on the operation aspect. The edit distance considering consecutive insertions and deletions is another variant on the cost function. Besides, the genome rearrangement problem can also be viewed as a variant, whose operations include inversions, reversals and transpositions. In this thesis, we survey some algorithms for the edit distance problem, its variants and the genome rearrangement problem. We also perform some experiments to illustrate the execution efficiency of various algorithms. Keywords: Edit Distance, Block Edit, Genome Rearrangement, Longest Common Subsequence, Dynamic Programing, Similarity |
目次 Table of Contents |
VERIFICATION FORM i THANKS iii CHINESE ABSTRACT iv ENGLISH ABSTRACT v LIST OF FIGURES ix LIST OF TABLES xiv LIST OF SYMBOLS xvi 1 Introduction 1 2 Preliminaries 3 2.1 Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Longest Common Subsequence . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Longest Increasing Subsequence . . . . . . . . . . . . . . . . . . . . . 7 2.5 Run-length Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.6 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Edit Distance and Similarity 10 3.1 Simple Version by Wagner and Fischer . . . . . . . . . . . . . . . . . 10 3.2 Algorithm by Lowrance and Wagner . . . . . . . . . . . . . . . . . . 11 3.3 Edit Distance of Considering Consecutive Insertions and Deletions by Waterman et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.4 Time Bounds by Wong and Chandra . . . . . . . . . . . . . . . . . . 14 3.5 Algorithm by Masek and Paterson . . . . . . . . . . . . . . . . . . . . 14 3.6 Relation between Edit Distance and Similarity by Smith et al. . . . . 17 3.7 Algorithm for Edit Distance Considering Consecutive Insertions and Deletions with Linear Cost Function by Gotoh . . . . . . . . . . . . . 19 3.8 Algorithm for Edit Distance Considering Consecutive Insertions and Deletions with Concave Cost Function by Waterman . . . . . . . . . 21 3.9 Distance Function by Smith et al. . . . . . . . . . . . . . . . . . . . . 23 3.10 Algorithm by Myers . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.11 Algorithm for Edit Distance with Concave Cost Function by Miller and Myers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.12 Algorithm for Edit Distance of Considering Consecutive Insertions and Deletions with Mixed Convex and Concave Cost Function by Eppstein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.13 Algorithm by Wu et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.14 Cyclic String-to-String Correction Problem by Maes . . . . . . . . . . 41 3.15 Edit Distance on Run-length Encoding Sequences by Bunke and Csirik 42 3.16 Edit Distance of Cyclic Sequences by Marzal and Barrachina . . . . . 44 3.17 Edit Distance of Run-length Encoding Sequences by Arbell et al. . . . 45 3.18 Edit Distance of RNA Structures by Jiang et al. . . . . . . . . . . . . 50 3.19 Edit Distance between Run-length Encoding Sequence and Uncompressed Sequence by Liu et al. . . . . . . . . . . . . . . . . . . . . . . 53 3.20 Block Edit Problem by Ann et al. . . . . . . . . . . . . . . . . . . . . 59 4 Genome Rearrangement 63 4.1 Sequence Alignment with Non-overlapping Inversions by Schoniger and Waterman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Exact and Approximation Algorithms for Reversal Distance on Permutations by Keceioglu and Sanko . . . . . . . . . . . . . . . . . . . 66 4.3 Lower Bounds for Reversal Distance on Permutations by Bafna et al. 69 4.4 Approximation Algorithm for Transposition Distance on Permutations by Walter et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Upper Bounds and Lower Bounds for Reversal Distance and Transposition Distance on Binary Sequences by Christie and Irving . . . . 75 4.6 An Ecient Algorithm for Computing Non-overlapping Inversion and Transposition Distance by Ta et al. . . . . . . . . . . . . . . . . . . . 77 5 Experimental Results 80 5.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6 Conclusions and Future Work 103 BIBLIOGRAPHY 109 A Miscellaneous Experimental Results 115 |
參考文獻 References |
[1] A. Aggarwal, M. M. Klawe, S. Moran, P. Shor, and R. Wilber, "Geometric applications of a matrix-searching algorithm," Algorithmica, Vol. 2, No. 1, pp. 195-208, 1987. [2] A. V. Aho, D. S. Hirschberg, and J. D. Ullman, "Bounds on the complexity of the longest common subsequence problem," Journal of the ACM, Vol. 23, No. 1, pp. 1-12, 1976. [3] L. Allison and T. I. Dix, "A bit-string longest-common-subsequence algorithm," Information Processing Letters, Vol. 23, pp. 305-310, 1986. [4] H. Y. Ann, C. B. Yang, Y. H. Peng, and B. C. Liaw, "Efficient algorithms for the block edit problems," Information and Computation, Vol. 208(3), pp. 221-229, 2010. [5] H. Y. Ann, C. B. Yang, C.-T. Tseng, and C. Y. Hor, "A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings," Information Processing Letters, Vol. 108, pp. 360-364, 2008. [6] O. Arbell, G. M. Landau, and J. S. B. Mitchell, "Edit distance of run-length encoded strings," Information Processing Letters, Vol. 83, No. 6, pp. 307-314, 2002. [7] V. Bafna and P. A. Pevzner, "Genome rearrangements and sorting by reversals," SIAM Journal of Computing, Vol. 25, No. 2, pp. 172-289, 1996. [8] V. Bafna and P. A. Pevzner, "Sorting by transpositions," SIAM Journal on Discrete Mathematics, Vol. 11, No. 2, pp. 224-240, 1998. [9] A. Bergeron, J. Mixtacki, and J. Stoye, "Reversal distance without hurdles and fortresses," Proceedings of 15th Annual Combinatorial Pattern Matching Symposium, Vol. 3109, Istanbul, Turkey, pp. 389-399, 2004. [10] P. Berman and S. Hannenhalli, "Fast sorting by reversal," Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, London, UK, pp. 168-185, 1996. [11] P. Berman, S. Hannenhalli, and M. Karpinski, "1.375-approximation algorithm for sorting by reversals," Proceedings of the 10th Annual European Symposium on Algorithms, Vol. 2461, Rome, Italy, pp. 200-210, 2002. [12] H. Bunke and J. Csirik, "An algorithm for matching run-length coded strings," Computing, Vol. 50, No. 4, pp. 297-314, 1993. [13] H. Bunke and J. Csirik, "An improved algorithm for computing the edit distance of run-length coded strings," Information Processing Letters, Vol. 54, No. 2, pp. 93-96, 1995. [14] A. Caprara, "Sorting by reversals is difficult," Proceedings of the First International Conference on Computational Molecular Biology, NM, USA, pp. 75-83, 1997. [15] A. Caprara, "Sorting permutations by reversals and eulerian cycle decompositions," SIAM Journal of Discrete Mathematics, Vol. 12, No. 1, pp. 91-100, 1999. [16] D. A. Christie, "A 3/2-approximation algorithm for sorting by reversals," Proceeding of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, Pennsylvania, USA, pp. 244-252, 1998. [17] D. A. Christie and R. W. Irving, "Sorting strings by reversals and by transpositions," SIAM Journal on Discrete Mathematics, Vol. 14, No. 2, pp. 193-206, 2001. [18] G. Cormode and S. Muthukrishnan, "The string edit distance matching problem with moves," Proceedings of the 13th annual ACM-SIAM symposium on Discrete algorithms, San Francisco, USA, pp. 667-676, 2002. [19] N. EI-Mabrouk, "Reconstructing an ancestral genome using minimum segments duplications and reversals," Journal of Computer and System Sciences, Vol. 65, No. 3, pp. 442-464, 2002. [20] D. Eppstein, "Sequence comparison with mixed convex and concave cost," Jour- nal of Algorithms, Vol. 11, pp. 85-101, 1990. [21] F. Ergun, S. Muthukrishnan, and C. Sahinalp, "Comparing sequences with segment rearrangements," Proceedings of the 23rd Foundations of Software Technology and Theoretical Computer Science, Mumbai, India, pp. 183-194, 2003. [22] Z. Galil and R. Giancarlo, "Speeding up dynamic programming with application to molecular biology," Theoretical Computer Science, Vol. 64, 19889. [23] M. R. Garey and D. S. Johnson, "Complexity results for multiprocessor scheduling under resource constraints," SIAM Journal on Computing, Vol. 4, pp. 397-411, 1975. [24] O. Gotoh, "An improved algorithm for matching biological sequences," Journal of Molecular Biology, Vol. 162, No. 3, pp. 705-708, 1982. [25] J. Gregor and M. G. Thomason, "Dynamic programming alignment of sequences representing cyclic patterns," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, No. 2, pp. 129-135, 1993. [26] S. Hannenhalli, "Polynomial-time algorithm for computing translocation distance between genomes," Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, Espoo, Finland, pp. 162-176, 1996. [27] S. Hannenhalli and P. A. Pevzner, "Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals," Proceedings of the 27th Annual Symposium on Theory of Computing, New York, USA, pp. 178-189, 1995. [28] T. Hartman and R. Shamir, "A simpler 1.5-approximation algorithm for sorting by transpositions," Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, Morelia, Mexico, pp. 156-169, 2003. [29] D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequence," Communications of the ACM, Vol. 24, pp. 664-675, 1975. [30] J. W. Hunt and T. G. Szymanski, "A fast algorithm for computing longest common subsequences," Communications of the ACM, Vol. 20, No. 5, pp. 350-353, 1977. [31] T. Jiang, G. Lin, B. Ma, and K. Zhang, "A general edit distance between RNA structures," Journal of Computational Biology, Vol. 9, No. 2, pp. 371-388, 2002. [32] H. Kaplan and N. Shafrir, "The greedy algorithm for edit distance with moves," Information Processing Letters, Vol. 97, No. 1, pp. 23-27, 2006. [33] J. Kececioglu and D. Sanko, "Exact and approximation algorithms for the inversion distance between two permutations," Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, Vol. 684, Berlin, Germany, pp. 87-105, 1993. [34] J. Kececioglu and D. Sanko, "Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement," Algorithmica, Vol. 13, No. 1-2, pp. 180-210, 1995. [35] J. W. Kim, A. Amir, G. M. Landau, and K. Park, "Computing similarity of run-length-encoded strings with affine gap penalty," Theoretical Computer Science, Vol. 395, No. 2-3, pp. 268-282, 2008. [36] M. M. Klawe and D. J. Kleitman, "An almost linear time algorithm for generalized matrix searching," SIAM Journal on Discrete Mathematics, Vol. 3, pp. 81-97, 1990. [37] P. Kolman, "Approximating reversal distance for strings with bounded number of duplicates in linear time," Proceedings of 30th International Symposium on Mathematical Foundations of Computer Science, Gdansk, Poland, pp. 580-590, 2005. [38] J. J. Liu, G. S. Huang, Y. L. Wang, and R. C. T. Lee, "Edit distance for a run-length-encoded string and an uncompressed string," Information Processing Letters, Vol. 105, No. 1, pp. 12-16, 2007. [39] J. J. Liu, Y. L.Wang, and R. C. T. Lee, "Finding a longest common subsequence between a run-length-encoded string and an uncompressed string," Journal of Complexity, Vol. 24, No. 2, pp. 173-184, 2008. [40] D. Lopresti and A. Tomkins, "Block edit models for approximate string matching," Theoretical Computer Science, Vol. 181, pp. 159-179, 1997. [41] R. Lowrance and R. A. Wagner, "An extension of the string-to-string correction problem," Journal of the ACM, Vol. 22, No. 2, pp. 177-188, 1975. [42] M. Maes, "On a cyclic string-to-string correction problem," Information Pro- cessing Letters, Vol. 35, pp. 73-78, 1990. [43] A. Marzal and S. Barrachina, "Speeding up the computation of the edit distance for cyclic strings," In Proceeding of the 15th International Conference on Pattern Recognition, Barcelona, Spain, pp. 895-898, 2000. [44] W. J. Masek and M. S. Paterson, "A faster algorithm computing string edit distances," Journal of Computer and System Sciences, Vol. 20, pp. 18-31, 1980. [45] W. Miller and E. W. Myers, "Sequence comparison with concave weighting functions," Bulletin of Mathematical Biology, Vol. 50, No. 2, pp. 97-120, 1988. [46] S. Muthukrishnan and S. C. Sahinalp, "Approximate nearest neighbors and sequence comparison with block operations," Proceedings of the 32nd Symposium on the Theory of Computing, Portland, USA, pp. 416-424, 2000. [47] E. W. Myers, "An O(ND) difference algorithm and its variations," Algorithmica, No. 1, pp. 251-266, 1986. [48] D. B. Needleman and C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins," Journal of Molecular Biology, Vol. 48, No. 3, pp. 443-453, 1970. [49] M. Schoniger and M. S. Waterman, "A local algorithm for DNA sequence alignment with inversions," Bulletin of Mathematical Biology, Vol. 54, No. 4, pp. 521-536, 1992. [50] P. H. Sellers, "An algorithm for the distance between two finite sequences," Journal of Combinatorial Theory, Vol. 16, pp. 253-258, 1974. [51] P. H. Sellers, "On the theory and computation of evolutionary distances," SIAM Journal on Applied Mathematics, Vol. 26, No. 4, pp. 787-793, 1974. [52] D. Shapira and J. A. Storer, "Large edit distance with multiple block operations," Proceedings of 10th International Symposium, String Processing and Information Retrieval, Vol. 2857, Manaus, Brazil, pp. 369-377, 2003. [53] D. Shapira and J. A. Storer, "Edit distance with move operations," Journal of Discrete Algorithms, Vol. 5, No. 2, pp. 380-392, 2007. [54] T. F. Smith and M. S. Waterman, "Comparison of biosequences," Advances in Applied Mathematics, Vol. 2, pp. 482-489, 1981. [55] T. F. Smith and M. S. Waterman, "Identification of common molecular subsequences," Journal of Molecular Biology, Vol. 147, No. 1, pp. 195-197, 1981. [56] T. F. Smith, M. S. Waterman, and C. Burks, "The statistical distribution of nucleic acid similarities," Nucleic Acids Research, pp. 645-656, 1985. [57] T. F. Smith, M. S. Waterman, and W. M. Fitch, "Comparative biosequence metrics," Journal of Molecular Mubon, Vol. 18, pp. 38-46, 1981. [58] T. T. Ta, C. Y. Lin, and C. L. Lu, "An efficient algorithm for computing nonoverlapping inversion and transposition distance," Proceedings of 32nd Workshop on Combinatorial Mathematics and Computation Theory, Taichung, Taiwan, pp. 55-61, 2015. [59] E. Ukkonen, "Algorithms for approximate string matching," Information and Control, Vol. 64, pp. 100-118, 1985. [60] R. A. Wagner, "On the complexity of the extended string-to-string correction problem," Proceedings of Seventh Annual ACM Symposium on Theory of Computing, New York, USA, pp. 218-223, 1975. [61] R. A. Wagner and M. J. Fischer, "The string-to-string correction problem," Journal of the ACM, Vol. 21, No. 1, pp. 168-173, 1974. [62] M. E. M. T.Walter, Z. Dias, and J. Meidanis, "A new approach for approximating the transposition distance," Proceedings of the Seventh International Symposium on String Processing Information Retrieval, Corunna, Spain, pp. 199-208, 2000. [63] M. S. Waterman, "Efficient sequence alignment algorithms," Journal of Theoretical Biology, Vol. 108, pp. 333-337, 1984. [64] M. S. Waterman, T. F. Smith, and W. Beyer, "Some biological sequence metrics," Advances in Mathematics, Vol. 20, No. 3, pp. 367-387, 1976. [65] R. Wilber, "The concave least-weight subsequence problem revisited," Journal of Algorithms, Vol. 9, pp. 418-425, 1988. [66] C. K. Wong and A. K. Chandra, "Bounds of the string editing problem," Journal of the ACM, Vol. 23, pp. 13-16, 1976. [67] S. Wu, U. Manber, G. Myers, and W. Miller, "An O(NP) sequence comparison algorithm," Information Processing Letters, Vol. 35, pp. 317-323, 1990. [68] C. B. Yang and R. C. T. Lee, "Systolic algorithms for the longest common subsequence problem," Journal of the Chinese Institute of Engineers, Vol. 10, No. 6, pp. 691-699, 1987. [69] F. F. Yao, "Speed-up in dynamic programming," SIAM Journal on Algebraic Discrete Methods, Vol. 3, pp. 532-540, 1982. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內校外完全公開 unrestricted 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |