國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,多重序列的最長共同子序列之基因演算法 ,A Genetic Algorithm for the Longest Common Subsequence of Multiple Sequences

論文名稱 Title	多重序列的最長共同子序列之基因演算法 A Genetic Algorithm for the Longest Common Subsequence of Multiple Sequences
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	97 學年度第 1 學期 The fall semester of Academic Year 97	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	57
研究生 Author	蔣宗翰 Chung-Han Chiang
指導教授 Advisor	楊昌彪 Chang-Biau Yang
召集委員 Convenor	張儀興 none
口試委員 Advisory Committee	范俊逸, 蕭學宏, 李宗南 none; none; none
口試日期 Date of Exam	2008-07-14	繳交日期 Date of Submission	2009-01-06
關鍵字 Keywords	最長共同子序列、基因演算法 longest common subsequence, multiple sequences, genetic algorithm
統計 Statistics	本論文已被瀏覽 5702 次，被下載 2036 次 The thesis/dissertation has been browsed 5702 times, has been downloaded 2036 times.

中文摘要
許多的方法已經被提出在尋找最長共同子序列(LCS)的問題上，而這些方法在最壞的情況下時，其時間複雜度為O(n2)，n是指輸入序列的長度。然而，當輸入序列的長度n在非常大的時候，這些演算法會變得不可實行。近來，k條序列間的最長共同子序列(k-LCS, k≧2)的問題變得越來越引人注意。已有一些演算法被提出來為解決此問題，但是，為解此問題而所需的執行時間依然太長，以致於無法實行。在本論文中，我們提出一種基因演算法來解決k-LCS問題，其時間複雜度為O(Gpk(n + \|P_j\|))，G為所經過的世代數，p為樣板序列的數量，k為輸入序列的數量，而n與\|P_j\|分別為輸入序列的長度以及樣板序列的長度。如同我們的實驗結果，當輸入序列數量為20、輸入序列長度為1000時，我們演算法的效能比率(\|CS\|/\|LCS\|)是大於0.8的，其中，\|CS\|所指的是我們所找到的解答長度，而\|LCS\|是真正的LCS長度。我們與Expansion演算法以及BNMAS演算法做效能比率的比較，當輸入序列的數量從2到20條，輸入的序列長度為100到2000時，我們所求得的效能比率是非常好。
Abstract
Various approaches have been proposed for finding the longest common subsequence (LCS) of two sequences. The time complexities of these algorithms are usually $O(n^2)$ in the worst case, where $n$ is the length of input sequences. However, these algorithms would become infeasible when the input length, $n$, is very long. Recently, the $k$-LCS $(k ≥ 2)$ problem has become more attractive. Some algorithms have been proposed for solving the problem, but the execution time required for solving the $k$-LCS problem is still too long to be practical. In this thesis, we propose a genetic algorithm for solving the $k$-LCS problem with time complexity $O(Gpk(n + \|P_j\|))$, which $G$ is the number of generations, $p$ is the number of template patterns, $k$ is the number of input sequences, $n$ and $\|P_j\|$ are the length of input sequences and the length of template patterns, respectively. As our experimental results show, when $k$ is 20 and $n$ is 1000, the performance ratio ($\|CS\|/\|LCS\|$) of our algorithm is greater than 0.8, where $\|CS\|$ denotes the length of the solution we find, and $\|LCS\|$ represents the length of the real (optimal) LCS. Comparing the performance ratios with Expansion Algorithm and BNMAS Algorithm, our algorithm is much better than them when the number of input sequences varies from 2 to 20 and the length of the input sequences varies from 100 to 2000.

目次 Table of Contents
LIST OF FIGURES iii LIST OF TABLES iv 中文摘要 ABSTRACT v 1 Introduction 1 2 Preliminaries 3 2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 The Longest Common SubsequenceProblem . . . . . . . . . . 4 2.3 Dynamic Programming Algorithm for 2-LCS . . . . . . . . . . 5 2.4 The Multiple Sequence Alignment Problem . . . . . . . . . . . 8 2.5 The Genetic Algorithm . . . .. . . . . . . . . . . . . . . . . . 15 3 Previous Work of the k-LCS Problem 19 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 The Expansion Algorithm . . . . . . . . . . . . . . . . 20 3.2.2 The Best Next for Maximal Available Symbols . . . . . 22 3.2.3 Starting from Scratch: Growing LCS with Evolution . 23 3.2.4 ACO for k-LCS . . . . . . . . . . . . . . . . . . . . . . 25 4 Our Genetic Algorithm for k-LCS 27 5 Experimental Results and Discussion 33 5.1 Experimental Results . . . . . . . . . . . . . . . . . . . . 33 5.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 Conclusion 42 BIBLIOGRAPHY 44

參考文獻 References
[1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, Ba- sic local alignment search tool," Journal of Molecular Biology, Vol. 215, pp. 403{410, 1990. [2] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Re- search, Vol. 25, pp. 3389{3402, 1997. [3] L. Bergroth, H. Hakonen, and T. Raita, A survey of longest common subsequence algorithms," Proceedings of Seventh International Sym- posium on String Processing and Information Retrieval, SPIRE 2000, pp. 39{48, 2000. [4] P. Bonizzoni, G. D. Vedova, and G. Mauri, Experimenting an approxi- mation algorithm for the LCS," Discrete Applied Mathematics, Vol. 110, No. 1, pp. 13{24, 2001. [5] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge Press, NY, 1997. [6] K. Hakata and H. Imai, The longest common subsequence problem for small alphabet size between many strings," Proceedings of the Third In- ternational Symposium on Algorithms and Computation, Lecture Notes in Computer Science 650, Springer Verlag, pp. 469{478, 1992. [7] D. S. Hirschberg, Algorithms for the longest common subsequence problem," Journal of ACM, Vol. 24, pp. 664{675, 1977. [8] J. H. Holland, Adaptation in Natural and Artificial Systems. University of Michigan Press, Michigan, 1975. [9] K. F. Huang, C. B. Yang, and K. T. Tseng, An efficient algorithm for multiple sequence alignment," Proc. of the 19th Workshop on Combina- torial Mathematics and Computation Theory, pp. 50{59, 2002. [10] T. Jiang and M. Li, On the approximation of shortest common super- sequences and longest common subsequences," SIAM Journal on Com- puting, Vol. 24, pp. 1122{1139, 1995. [11] B. A. Julstrom and B. Hinkemeyer, Starting from scratch: Growing longest common subsequences with evolution," Proceedings of the 9th In- ternational Conference on Parallel Problem Solving From Nature (PPSN IX), Lecture Notes in Computer Science 4193, Springer Berlin / Hei- delberg, pp. 930{938, 2006. [12] C.-B. Y. Kuo-Si Huang and K.-T. Tseng, Fast algorithms for finding the common subsequence of multiple sequences, taipei, taiwan, dec. 15- 17, 2004," Proc. of International Computer Symposium, p. 90, 2004. [13] D. Maier, The complexity of some problems on subsequences and su- persequences," Journal of the ACM, Vol. 25, pp. 322{336, 2001. [14] W. J. Masek and M. S. Paterson, A faster algorithm computing string edit distances," Journal of Computer and System Sciences, Vol. 20, pp. 18{31, 1980. [15] D. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins," Journal of Molecular Biology, Vol. 48, No. 3, pp. 443{453, 1970. [16] W. A. Pearson, Rapid and sensitive sequence comparison with fastp and fasta," Methods in Enzymology, Vol. 183, pp. 63{98, 1990. [17] W. R. Pearson and D. Lipman, Improved tools for biological sequence comparison," Proceedings of the National Academy of Sciences, Vol. 85, pp. 2444{2448, 1988. [18] I. R. and F. C., Two algorithms for the longest common subsequence of three (or more) strings," Proceedings of the 3rd Annual Symposium on Combinatorial Pattern Matching, New York, Springer-Verlag, Vol. 644, pp. 214{229, 1992. [19] C. Rick, Simple and fast linear space computation of longest common subsequences," Information Processing Letters, Vol. 75, pp. 275{281, 2000. [20] J. Setubal and J. Meidanis, Introduction to Computational Molecular Biology. PWS Publishing Company, Boston, second ed., 1997. [21] S. J. Shyu and C.-Y. Tsai, Finding the longest common subsequence for multiple biological sequences by ant colony optimization," Computers and Operations Research, Vol. 36, pp. 73{91, 2007. [22] T. F. Smith and M. S. Waterman, Identification of common molecular subsequences," Journal of Molecular Biology, Vol. 147, No. 1, pp. 195{ 197, 1981. [23] J. D. Thompson, D. G. Higgins, and T. J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice," Nucleic Acids Research, Vol. 22, pp. 4673{4680, 1994. [24] Y. T. Tsai and J. T. Hsu, An approximation algorithm for multiple longest common subsequence problems," Proceeding of the 6th World Multiconference on Systemics, Cybernetics and Informatics, SCI2002, pp. 456{460, 2002. [25] R. A. Wagner and M. J. Fischer, The string-to-string correction prob- lem," Journal of the ACM, Vol. 21, No. 1, pp. 168{173, 1974. [26] L. Wang and T. Jiang, On the complexity of multiple sequence align- ment," Journal of Computational Biology, Vol. 1, pp. 337{348, 1994. [27] C. B. Yang and R. C. T. Lee, Systolic algorithms for the longest com- mon subsequence problem," Journal of the Chinese Institute of Engi- neers, Vol. 10, No. 6, pp. 691{699, 1987. [28] J. Zhang and T. L. Madden, PowerBLAST: a new network blast appli- cation for interactive or automated sequence analysis and annotation," Genome Methods, pp. 649{656, 1997.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內立即公開，校外一年後公開 off campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0106109-080018.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS