Responsive image
博碩士論文 etd-0702118-135347 詳細資訊
Title page for etd-0702118-135347
論文名稱
Title
尋找最相似序列之剪枝搜尋演算法
A Prune-and-Search Algorithm for Finding the Most Similar Sequence
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
75
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2018-07-30
繳交日期
Date of Submission
2018-08-02
關鍵字
Keywords
L1範數、曼哈頓距離、三角不等式、編輯距離、最相似序列、文字序列資料庫
L1-norm, Manhattan distance, triangle inequality, character-sequence database, edit distance, the most similar sequence
統計
Statistics
本論文已被瀏覽 5673 次,被下載 50
The thesis/dissertation has been browsed 5673 times, has been downloaded 50 times.
中文摘要
如果想用循序搜尋法在文字序列資料庫中找出最相似的序列,是非常耗時的。在過去的研究中,都沒有任何關於循序搜尋法或者其它搜尋順序方法的機率理論分析。因此本論文將以理論的形式來探討,當給定一條查詢序列時,資料庫中的搜尋順序。為了減少搜索空間,我們計算出查詢序列在曼哈頓距離模型上的搜尋機率,以找出下一條需要比較的序列。根據這個搜尋機率,第三條以及之後需要比較的序列都能被決定出來。因此,我們提出了一個搜尋策略⟨0.81, 1⟩來找出最相似的序列,0.81 和1 都分別代表參考序列和查詢序列之間的編輯距離倍數。我們的搜尋策略也利用了編輯距離的三角不等式來加快搜尋速度。透過這種方式,我們的搜尋策略就能把不須比較的序列刪除以加快搜尋的效率。
Abstract
For finding the most similar sequence in a character-sequence database, it is very time-consuming if the one-by-one sequential search strategy is applied. In the previous studies, there is no theoretical analysis about the probability of the sequential search, or any other order of searching. This thesis studies and theoretically discusses the searching order for a given query sequence in a database. To reduce the searching space, the searching probability for the query sequence on the Manhattan distance model is calculated for selecting the next compared sequence. Accordingly, the third order and the subsequent compared sequences can be determined. Hence, a searching strategy of parameter ⟨0.81, 1⟩ is proposed for finding the most similar sequence, where 0.81 and 1 indicate the multipliers of the edit distance between the reference sequence and the query sequence, respectively. Our searching strategy also considers the triangle inequality of the edit distance for accelerating the searching speed. In this way, our searching strategy can improve the searching efficiency by pruning some unnecessary sequences away.
目次 Table of Contents
THESIS VERIFICATION FORM . . . . . . . . . . . . . . . . . . . . . . i
THESIS AUTHORIZATION FORM . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
CHINESE ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
LIST OF SYMBOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The Longest Common Subsequence Problem . . . . . . . . . . . . . . 3
2.2 The Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 The Triangle Inequality of the Edit Distance . . . . . . . . . . . . . . 7
2.4 The Lp-norm and Manhattan Distance . . . . . . . . . . . . . . . . . 9
2.5 Finding the Most Similar Sequence in a Character-sequence
Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Chapter 3. The Prune-and-Search Algorithm . . . . . . . . . . . . . . 11
3.1 The First Compared Sequence . . . . . . . . . . . . . . . . . . . . . . 11
3.2 The Second Compared Sequence . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 The Intersection of Two Circles . . . . . . . . . . . . . . . . . 13
3.2.2 The Expected Value of the Manhattan Distance Model . . . . 17
3.3 The Third and the Subsequent Sequences . . . . . . . . . . . . . . . . 37
3.4 The Algorithm for the MSS Problem . . . . . . . . . . . . . . . . . . 40
Chapter 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
參考文獻 References
[1] A. Aggarwal, L. J. Guibas, J. Saxe, and P. W. Shor, "A linear-time algorithm for computing the voronoi diagram of a convex polygon," Discrete & Computational Geometry, Vol. 4, No. 6, pp. 591-604, 1989.
[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," Journal of Molecular Biology, Vol. 215, No. 2, pp. 403-410, 1990.
[3] S. F. Altschul, T. L. Madden, A. A. Sch er, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic Acids Research, Vol. 25, No. 17, pp. 3389-3402, 1997.
[4] O. Arbell, G. Landau, and J. Mitchell, "Edit distance of run-length encoded strings," Information Processing Letters, Vol. 83, No. 6, pp. 307-314, 2002.
[5] P. Bille, "A survey on tree edit distance and related problems," Theoretical Computer Science, Vol. 337, No. 1-3, pp. 217-219, 2005.
[6] L. Chen and R. Ng, "On the marriage of Lp-norms and edit distance," Proceedings of the Thirtieth International Conference on Very Large Data Bases, Toronto, Canada, pp. 792-803, 2004.
[7] F. Y. L. Chin, A. D. Santis, A. L. Ferrara, N. L. Ho, and S. K. Kim, "A simple algorithm for the constrained sequence problems," Information Processing Letters, Vol. 90, No. 4, pp. 175-179, 2004.
[8] F. Corpet, "Multiple sequence alignment with hierarchical clustering," Nucleic Acids Research, Vol. 16, No. 22, pp. 10881-10890, 1988.
[9] R. J. Hathaway, J. C. Bezdek, and Y. K. Hu, "Generalized fuzzy c-means clustering strategies using Lp norm distances," IEEE Transactions on Fuzzy Systems, Vol. 8, No. 5, pp. 576-582, 2000.
[10] D. G. Higgins, A. J. Bleasby, and R. Fuchs, "CLUSTAL V: improved software for multiple sequence alignment," Bioinformatics, Vol. 8, No. 2, pp. 189-191, 1992.
[11] D. G. Higgins and P. M. Sharp, "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer," Gene, Vol. 73, No. 1, pp. 237-244, 1988.
[12] D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequences," Communications of the ACM, Vol. 18, No. 6, pp. 341-343, 1975.
[13] D. S. Hirschberg, "Algorithms for the longest common subsequence problem," Journal of the ACM, Vol. 24, No. 4, pp. 664-675, 1977.
[14] C. S. Iliopoulos and M. S. Rahman, "New efficient algorithms for LCS and constrained LCS problem," Information Processing Letters, Vol. 106, No. 1, pp. 13-18, 2008.
[15] E. Keogh and C. A. Ratanamahatana, "Exact indexing of dynamic time warping," Knowledge and Information Systems, Vol. 7, No. 3, pp. 358-386, 2005.
[16] E. J. Keogh and M. J. Pazzani, "Derivative dynamic time warping," Proceedings of the First Society for Industrial and Applied Mathematics International Conference on Data Mining, Vol. 1, Chicago, IL, USA, pp. 5-7, 2001.
[17] M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan, H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson, and D. G. Higgins, "Clustal W and Clustal X version 2.0," Bioinformatics, Vol. 23, No. 21, pp. 2947-2948, 2007.
[18] W. J. Masek and M. S. Pateson, "A faster algorithm computing string edit distances," Journal of Computer and System Sciences, Vol. 20, No. 1, pp. 18-31, 1980.
[19] S. McGinnis and T. L. Madden, "BLAST: at the core of a powerful and diverse set of sequence analysis tool," Nucleic Acids Research, Vol. 32, pp. W20-W25, 2004.
[20] G. Navarro, "A guided tour to approximate string matching," ACM Computing Surveys, Vol. 33, No. 1, pp. 31-88, 2001.
[21] S. Park, W. W. Chu, J. Yoon, and C. Hsu, "Efficient searches for similar subsequences of different lengths in sequence databases," Proceedings of 16th International Conference on Data Engineering, San Diego, USA, pp. 23-32, 2000.
[22] E. Ristad and P. Yianilos, "Learning string-edit distance," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 5, pp. 522-532, 1998.
[23] S. Salvador and P. Chan, "Toward accurate dynamic time warping in linear time and space," Intelligent Data Analysis, Vol. 11, No. 5, pp. 561-580, 2007.
[24] F. Sievers, A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Sding, J. D. Thompson, and D. G. Higgins, "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega," Molecular Systems Biology, Vol. 7, p. 539, 2011.
[25] J. D. Thompson, T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins, "The CLUSTAL X windows interface:
flexible strategies for multiple sequence alignment aided by quality analysis tools," Nucleic Acids Research, Vol. 25, No. 24, pp. 4876-4882, 1997.
[26] J. D. Thompson, D. G. Higgins, and T. J. Gibson, "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," Nucleic Acids Research, Vol. 22, No. 22, pp. 4673-4680, 1994.
[27] R. Wagner and M. Fischer, "The string-to-string correction problem," Journal of the ACM, Vol. 21, No. 1, pp. 168-173, 1974.
[28] M. S. Waterman, T. F. Smith, and W. A. Beyer, "Some biological sequence metrics," Advances in Mathematics, Vol. 20, No. 3, pp. 367-387, 1976.
[29] S.Wu, U. Manber, and G. Myers, "An O(NP) sequence comparison algorithm," Information Processing Letters, Vol. 35, pp. 317-323, 1990.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code