國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,尋找最相似序列之剪枝搜尋演算法,A Prune-and-Search Algorithm for Finding the Most Similar Sequence

論文名稱 Title	尋找最相似序列之剪枝搜尋演算法 A Prune-and-Search Algorithm for Finding the Most Similar Sequence
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	106 學年度第 2 學期 The spring semester of Academic Year 106	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	75
研究生 Author	洪修傑 Hsiu-Chieh Hung
指導教授 Advisor	楊昌彪 Chang-Biau Yang
召集委員 Convenor	陳嘉平 Chia-Ping Chen
口試委員 Advisory Committee	黃國璽, 彭永興, 何秋誼 Kuo-Si Huang; Yung-Hsing Peng; Chiou-Yi Hor
口試日期 Date of Exam	2018-07-30	繳交日期 Date of Submission	2018-08-02
關鍵字 Keywords	L1範數、曼哈頓距離、三角不等式、編輯距離、最相似序列、文字序列資料庫 L1-norm, Manhattan distance, triangle inequality, character-sequence database, edit distance, the most similar sequence
統計 Statistics	本論文已被瀏覽 5673 次，被下載 50 次 The thesis/dissertation has been browsed 5673 times, has been downloaded 50 times.

中文摘要
如果想用循序搜尋法在文字序列資料庫中找出最相似的序列，是非常耗時的。在過去的研究中，都沒有任何關於循序搜尋法或者其它搜尋順序方法的機率理論分析。因此本論文將以理論的形式來探討，當給定一條查詢序列時，資料庫中的搜尋順序。為了減少搜索空間，我們計算出查詢序列在曼哈頓距離模型上的搜尋機率，以找出下一條需要比較的序列。根據這個搜尋機率，第三條以及之後需要比較的序列都能被決定出來。因此，我們提出了一個搜尋策略⟨0.81, 1⟩來找出最相似的序列，0.81 和1 都分別代表參考序列和查詢序列之間的編輯距離倍數。我們的搜尋策略也利用了編輯距離的三角不等式來加快搜尋速度。透過這種方式，我們的搜尋策略就能把不須比較的序列刪除以加快搜尋的效率。
Abstract
For finding the most similar sequence in a character-sequence database, it is very time-consuming if the one-by-one sequential search strategy is applied. In the previous studies, there is no theoretical analysis about the probability of the sequential search, or any other order of searching. This thesis studies and theoretically discusses the searching order for a given query sequence in a database. To reduce the searching space, the searching probability for the query sequence on the Manhattan distance model is calculated for selecting the next compared sequence. Accordingly, the third order and the subsequent compared sequences can be determined. Hence, a searching strategy of parameter ⟨0.81, 1⟩ is proposed for finding the most similar sequence, where 0.81 and 1 indicate the multipliers of the edit distance between the reference sequence and the query sequence, respectively. Our searching strategy also considers the triangle inequality of the edit distance for accelerating the searching speed. In this way, our searching strategy can improve the searching efficiency by pruning some unnecessary sequences away.

目次 Table of Contents
THESIS VERIFICATION FORM . . . . . . . . . . . . . . . . . . . . . . i THESIS AUTHORIZATION FORM . . . . . . . . . . . . . . . . . . . . iii ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . iv CHINESE ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . v ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 The Longest Common Subsequence Problem . . . . . . . . . . . . . . 3 2.2 The Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 The Triangle Inequality of the Edit Distance . . . . . . . . . . . . . . 7 2.4 The Lp-norm and Manhattan Distance . . . . . . . . . . . . . . . . . 9 2.5 Finding the Most Similar Sequence in a Character-sequence Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 3. The Prune-and-Search Algorithm . . . . . . . . . . . . . . 11 3.1 The First Compared Sequence . . . . . . . . . . . . . . . . . . . . . . 11 3.2 The Second Compared Sequence . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 The Intersection of Two Circles . . . . . . . . . . . . . . . . . 13 3.2.2 The Expected Value of the Manhattan Distance Model . . . . 17 3.3 The Third and the Subsequent Sequences . . . . . . . . . . . . . . . . 37 3.4 The Algorithm for the MSS Problem . . . . . . . . . . . . . . . . . . 40 Chapter 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

參考文獻 References
[1] A. Aggarwal, L. J. Guibas, J. Saxe, and P. W. Shor, "A linear-time algorithm for computing the voronoi diagram of a convex polygon," Discrete & Computational Geometry, Vol. 4, No. 6, pp. 591-604, 1989. [2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," Journal of Molecular Biology, Vol. 215, No. 2, pp. 403-410, 1990. [3] S. F. Altschul, T. L. Madden, A. A. Scher, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Research, Vol. 25, No. 17, pp. 3389-3402, 1997. [4] O. Arbell, G. Landau, and J. Mitchell, "Edit distance of run-length encoded strings," Information Processing Letters, Vol. 83, No. 6, pp. 307-314, 2002. [5] P. Bille, "A survey on tree edit distance and related problems," Theoretical Computer Science, Vol. 337, No. 1-3, pp. 217-219, 2005. [6] L. Chen and R. Ng, "On the marriage of Lp-norms and edit distance," Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, pp. 792-803, 2004. [7] F. Y. L. Chin, A. D. Santis, A. L. Ferrara, N. L. Ho, and S. K. Kim, "A simple algorithm for the constrained sequence problems," Information Processing Letters, Vol. 90, No. 4, pp. 175-179, 2004. [8] F. Corpet, "Multiple sequence alignment with hierarchical clustering," Nucleic Acids Research, Vol. 16, No. 22, pp. 10881-10890, 1988. [9] R. J. Hathaway, J. C. Bezdek, and Y. K. Hu, "Generalized fuzzy c-means clustering strategies using Lp norm distances," IEEE Transactions on Fuzzy Systems, Vol. 8, No. 5, pp. 576-582, 2000. [10] D. G. Higgins, A. J. Bleasby, and R. Fuchs, "CLUSTAL V: improved software for multiple sequence alignment," Bioinformatics, Vol. 8, No. 2, pp. 189-191, 1992. [11] D. G. Higgins and P. M. Sharp, "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer," Gene, Vol. 73, No. 1, pp. 237-244, 1988. [12] D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequences," Communications of the ACM, Vol. 18, No. 6, pp. 341-343, 1975. [13] D. S. Hirschberg, "Algorithms for the longest common subsequence problem," Journal of the ACM, Vol. 24, No. 4, pp. 664-675, 1977. [14] C. S. Iliopoulos and M. S. Rahman, "New efficient algorithms for LCS and constrained LCS problem," Information Processing Letters, Vol. 106, No. 1, pp. 13-18, 2008. [15] E. Keogh and C. A. Ratanamahatana, "Exact indexing of dynamic time warping," Knowledge and Information Systems, Vol. 7, No. 3, pp. 358-386, 2005. [16] E. J. Keogh and M. J. Pazzani, "Derivative dynamic time warping," Proceedings of the First Society for Industrial and Applied Mathematics International Conference on Data Mining, Vol. 1, Chicago, IL, USA, pp. 5-7, 2001. [17] M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan, H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson, and D. G. Higgins, "Clustal W and Clustal X version 2.0," Bioinformatics, Vol. 23, No. 21, pp. 2947-2948, 2007. [18] W. J. Masek and M. S. Pateson, "A faster algorithm computing string edit distances," Journal of Computer and System Sciences, Vol. 20, No. 1, pp. 18-31, 1980. [19] S. McGinnis and T. L. Madden, "BLAST: at the core of a powerful and diverse set of sequence analysis tool," Nucleic Acids Research, Vol. 32, pp. W20-W25, 2004. [20] G. Navarro, "A guided tour to approximate string matching," ACM Computing Surveys, Vol. 33, No. 1, pp. 31-88, 2001. [21] S. Park, W. W. Chu, J. Yoon, and C. Hsu, "Efficient searches for similar subsequences of different lengths in sequence databases," Proceedings of 16th International Conference on Data Engineering, San Diego, USA, pp. 23-32, 2000. [22] E. Ristad and P. Yianilos, "Learning string-edit distance," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 5, pp. 522-532, 1998. [23] S. Salvador and P. Chan, "Toward accurate dynamic time warping in linear time and space," Intelligent Data Analysis, Vol. 11, No. 5, pp. 561-580, 2007. [24] F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Sding, J. D. Thompson, and D. G. Higgins, "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega," Molecular Systems Biology, Vol. 7, p. 539, 2011. [25] J. D. Thompson, T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins, "The CLUSTAL X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools," Nucleic Acids Research, Vol. 25, No. 24, pp. 4876-4882, 1997. [26] J. D. Thompson, D. G. Higgins, and T. J. Gibson, "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," Nucleic Acids Research, Vol. 22, No. 22, pp. 4673-4680, 1994. [27] R. Wagner and M. Fischer, "The string-to-string correction problem," Journal of the ACM, Vol. 21, No. 1, pp. 168-173, 1974. [28] M. S. Waterman, T. F. Smith, and W. A. Beyer, "Some biological sequence metrics," Advances in Mathematics, Vol. 20, No. 3, pp. 367-387, 1976. [29] S.Wu, U. Manber, and G. Myers, "An O(NP) sequence comparison algorithm," Information Processing Letters, Vol. 35, pp. 317-323, 1990.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0702118-135347.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS