國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個以雜湊過濾的方法來支援基因資料庫的近似比對,A Hash Trie Filter Approach to Approximate String Match for Genomic Databases

論文名稱 Title	一個以雜湊過濾的方法來支援基因資料庫的近似比對 A Hash Trie Filter Approach to Approximate String Match for Genomic Databases
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	93 學年度第 2 學期 The spring semester of Academic Year 93	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	110
研究生 Author	徐敏哲 Min-tze Hsu
指導教授 Advisor	張玉盈 Ye-in Chang
召集委員 Convenor	黃三益 San-yi Huang
口試委員 Advisory Committee	李建億 Chien-i Lee
口試日期 Date of Exam	2005-06-17	繳交日期 Date of Submission	2005-06-28
關鍵字 Keywords	區域順序、過濾器方法、全域順序、近似字串搜尋、基因序列資料庫 global order, genomic sequence databases, local order, approximate string match, filter methods
統計 Statistics	本論文已被瀏覽 5758 次，被下載 0 次 The thesis/dissertation has been browsed 5758 times, has been downloaded 0 times.

中文摘要
GenBank及EMBL等基因序列資料庫，都是常被分子生物學家用來做同源相似關係的搜尋。因為存放於資料庫中的基因序列，每筆資料長度由數千到數百萬字元長不等，而且每年成長數量驚人，因此提供一個良好的資料結構支援索引，並且配合有效率的搜尋方法，已成為當務之急。基因序列是由四核
Abstract
Genomic sequence databases, like GenBank, EMBL, are widely used by molecular biologists for homology searching. Because of the long length of each genomic sequence and the increase of the size of genomic sequence databases, the importance of efficient searching methods for fast queries grows. The DNA sequences are composed of four kinds of nucleotides, and these genomic sequences can be regarded as the text strings. However, there is no concept of words in a genomic sequence, which makes the search of the genomic sequence in the genomic database much difficult. Approximate String Matching (ASM) with k errors is considered for genomic sequences, where k errors would be caused by insertion, deletion, and replacement operations. Filtration of the DNA sequence is a widely adopted technique to reduce the number of the text areas (i.e., candidates) for further verification. In most of the filter methods, they first split the database sequence into q-grams. A sequence of grams (subpatterns) which match some part of the text will be passed as a candidate. The match problem of grams with the part of the text could be speed up by using the index structure for the exact match. Candidates will then be examined by dynamic programming to get the final result. However, in the previous methods for ASM, most of them considered the local order within each gram. Only the (k + s) h-samples filter considers the global order of the sequence of matched grams. Although the (k + s) h-samples filter keeps the global order of the sequence of the grams, it still has some disadvantages. First, to be a candidate in the (k + s) h-samples filter, the number of the ordered matched grams, s, is always fixed to 2 which results in low precision. Second, the (k + s) h-samples filter uses the query time to build the index for query patterns. In this thesis, we propose a new approximate string matching method, the hash trie filter, for efficiently searching in genomic databases. We build a hash trie in the pre-computing time for the genomic sequence stored in database. Although the size q of each split grams is also decided by the same formula used in the (k + s) h-samples filter, we have proposed a different way to find the ordered subpatterns in text T. Moreover, we reduce the number of candidates by pruning some unreasonable matched positions. Furthermore, unlike the (k + s) h-samples filter which always uses s = 2 to decide whether s matched subpatterns could be a candidate or not, our method will dynamically decide s, resulting in the increase of precision. The simulation results show that our hash trie filter outperforms the (k +s) h-samples filter in terms of the response time, the number of verified candidates, and the precision under different length of the query patterns and different error levels.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Query Types of Genomic Databases . . . . . . . . . . . . . . . . . . . 5 1.3 String Matching Methods . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Indexing Methods for Exact Match . . . . . . . . . . . . . . . 8 1.3.2 Approximate String Match (ASM) . . . . . . . . . . . . . . . 9 1.4 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 20 2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1 Linear Scan Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Index Structures for ASM . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 Inverted Indices . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 The Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 ASM Methods on Index Structures . . . . . . . . . . . . . . . 26 2.3 Filter Methods for ASM . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 The q-gram Filter . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 LET & SET Filters . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.3 The l-tuple Filter . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.4 The h-samples Filter . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.5 The (k+s) h-samples Filter . . . . . . . . . . . . . . . . . . . 32 2.3.6 The Counting Filter . . . . . . . . . . . . . . . . . . . . . . . 33 ii Page 3. A Hash Trie Filter Approach for ASM . . . . . . . . . . . . . . . . . 36 3.1 The Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 The Construction of a Hash Trie . . . . . . . . . . . . . . . . . . . . . 37 3.3 Query and Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.1 Step 1: Deciding the Length q0 of a Subpattern . . . . . . . . 39 3.3.2 Step 2: Construct the Table PT for the Subpatterns . . . . . 45 3.3.3 Traversing the Hash Trie . . . . . . . . . . . . . . . . . . . . . 46 3.3.4 Step 3: The Pruning Step . . . . . . . . . . . . . . . . . . . . 48 3.3.5 Step 4: Finding the Ordered Subpatterns . . . . . . . . . . . . 51 3.3.6 Step 5: Verification with Dynamic Programming . . . . . . . . 59 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1 The Real DNA Sequences . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.1 Time for Constructing a Hash Trie . . . . . . . . . . . . . . . 63 4.2.2 Performance under Short Query Patterns . . . . . . . . . . . . 65 4.2.3 Performance under Long Query Patterns . . . . . . . . . . . . 66 4.2.4 Performance under Low Error Levels . . . . . . . . . . . . . . 71 4.2.5 Performance under High Error Levels . . . . . . . . . . . . . . 74 4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

參考文獻 References
[1] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic Local Alignment Search Tools,” Journal of Molecular Biology, Vol. 215, No. 3, pp. 403-410,Oct. 1990. [2] A. Andersson, N. J. Larsson, and K. Swanson, “Suffix Trees on Words,” Algorithmica, Vol. 23, No. 3, pp. 246-260, Jan. 1999. [3] A. Andersson and S. Nilsson, “Efficient Implementation of Suffix Trees,” Software Practice and Experience, Vol. 25, No. 2, pp. 129-141, Feb. 1995. [4] R. Baeza-Yates, “Efficient Text Searching,” Ph.D. Thesis, Dept. of Computer Science, University of Waterloo, Feb., 1989. [5] R. Baeza-Yates, “Text Retrieval: Theory and Practice,” Proc. of the 12th IFIP - World Computer Congress, pp. 465-476, 1992. [6] R. Baeza-Yates, “Some New Results on Approximate String Matching,” Workshop on Data Structures, Abstract, 1991. [7] R. Baeza-Yates and G. Gonnet, “A New Approach to Text Searching,” Proc. of Annu. ACM Conf. on Research and Development in Information Retrieval, pp. 168-175, 1989. [8] R. Baeza-Yates and G. Gonnet, “A New Approach to Text Searching,” Communications of the ACM, Vol. 35, No. 10, pp. 74-82, Oct. 1992. [9] R. Baeza-Yates and G. Navarro, “A Faster Algorithm for Approximate String Matching,” Proc. of the 7th Annual Symp. on Combinatorial Pattern Matching, pp. 1-23, 1996. [10] R. Baeza-Yates and G. Navarro, “Faster Approximate String Matching,” Algorithmica, Vol. 23, No. 2, pp. 127-158, Feb. 1999. [11] R. Baeza-Yates and C. H. Perleberg, “Fast and Practical Approximate Pattern Matching,” Proc. of the 3th Annual Symp. on Combinational Pattern Matching, pp. 185-192, 1992. [12] S. Burkhardt, “Filter Algorithms for Approximate String Matching,” Ph.D. Thesis, Dept. of Computer Science, Saarland University, 2002. [13] S. Burkhardt, A. Crauser, P. Ferragina, H. Lenhof, E. Rivals, and M. Vingron, “q-gram Based Database Searching Using a Suffix Array (QUASAR),” Annual Conf. on Research in Computational Molecular Biology, pp. 77-83, 1999. [14] S. Burkhardt and J. Krkkinen, “Better Filtering with Gapped q-grams,” Fundamenta Informaticae, Vol. 56, No. 1-2, pp. 51-70, Jan. 2003. [15] W. Chang and J. Lampe, “Theoretical and Empirical Comparisons of Approximate String Matching Algorithms,” Proc. of the 3rd Annual Symp. on Combinatorial Pattern Matching, pp. 172-181, 1992. [16] W. Chang and E. Lawler, “Sublinear Approximate String Matching and Biological Applications,” Algorithmica, Vol. 12, No. 4, pp. 327-344, May 1994. [17] W. Chang and T. Marr, “Approximate String Matching and Local Similarity,” Proc. of the 5th Annual Symp. on Combinatorial Pattern Matching (CPM '94), pp. 259-273, 1994. [18] E. Chavez and G. Navarro, “A Metric Index for Approximate String Matching,” Proc. of the 5th Latin American Symp. on Theoretical Informatics, pp. 181-195, 2002. [19] Y. Chen, “Signature Trees for Signature Files,” Proc. of the 82th Information Processing Letters, pp. 213-221, 2002. [20] L. L. Cheng, D. W. Cheung, and S. M. Yiu, “Approximate String Matching in DNA Sequences,” Proc. of the 8th Int. Conf. on Database System for Advanced Applications, pp. 303-320, 2003. [21] G. Das, R. Fleisher, L. Gasieniek, D. Gunopulos, , and J. Karkainen, “Episode Matching,” Proc. of the 8th Annual Symp. on Combinatorial Pattern Matching (CPM '97), pp. 12-27, 1997. [22] P. Ferragina and R. Grossi, “The String B-Tree: A New Data Structure for String Search in External Memory and Its Applications,” Journal of the ACM, Vol. 46, No. 2, pp. 236-280, March 1999. [23] K. Fredriksson and G. Navarro, “Improved Single and Multiple Approximate String Matching,” Proc. of CPM'2004, pp. 457-471, Lecture Notes in Computer Science 2004. [24] S. Ganguly and M. Noordewier, “Proximal: A Database System for the Efficient Retrieval of Genetic Information,” Computer in Biology and Medicine, Vol. 26, No. 3, pp. 199-207, May 1996. [25] R. Giegerich, S. Kurtz, F. Hischke, and E. Ohlebusch, “A General Technique to Improve Filter Algorithms for Approximate String Matching,” Proc. of the 4th South American Workshop on String Processing, pp. 38-52, 1996. [26] G. H. Gonnet, “String Matching Problems from Bioinformatics Which Still Need Better Solutions,” Proc. of the 9th Inter. Symp. on String Processing and Information Retrieval, Lecture Notes In Computer Science, 2002. [27] R. Grossi and F. Luccio, “Simple and Efficient String Matching with k Mismatches,” Journal of Inf. Process. Lett., Vol. 33, No. 3, pp. 113-120, Nov. 1989. [28] D. Gusfield, “Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology,” Cambridge University Press, 1997. [29] H. Heikki, “Practical Methods for Approximate String Matching,” Aacdemic Dissertation, Dept. of Computer Sciences, Tampere University, 2003. [30] J. L. Houle, W. Cadigan, S. Henry, A. Pinnamanenib, and S. Lundahlc, “Database Mining in the Human Genome Initiative,” http://www.biodatabases.com/whitepaper01.html. [31] J. W. Hu, “An ACGT-Words Tree for Efficient Data Access in Genomeic Databases,” Master Thesis, Dept. of Computer Science and Eng., National Sun Yat-Sen University, 2003. [32] E. Hunt, M. P. Atkinson, and R. W. Irving, “A Database Index to Large Biological Sequences,” Proc. of the 27th VLDB Conference, pp. 139-148, 2001. [33] E. Hunt, M. P. Atkinson, and R. W. Irving, “Database Indexing for Large DNA and Protein Sequence Collections,” The VLDB Journal, Vol. 10, No. 1, pp. 256-271, Nov. 2002. [34] E. Hunt, R. W. Irving, and M. Atkinson, “Persistent Suffix Trees and Suffix Binary Search Trees as DNA Sequence Indexes,” Technical Report No. TR-2000-63 of the Computing Science Department of Glasgow University, 2000. [35] R. W. Irving and L. Love, “The Suffix Binary Search Tree and Suffix AVL Tree,” Technical Report no. TR-2000-54 of the Computing Science Department of Glasgow University, 2000. [36] R. W. Irving and L. Love, “Suffix Binary Search Trees and Suffix Arrays,” Technical Report No. TR-2001-82 of the Computing Science Department of Glasgow University, 2001. [37] P. Jokinen, J. Tarhio, and E. Ukkonen, “A Comparison of Approximate String Matching Algorithms,” Software Practice and Experience, Vol. 26, No. 12, pp. 1439-1458, Dec. 1996. [38] C. T. Lee, “Computational Biology,” http://www.csie.ncnu.edu.tw/_rctlee/biology.html. [39] H. P. Lee, Y. T. Tsai, and C. Y. Tang, “A Seriate Coverage Filtration Approach for Homology Search,” Proc. of the 2004 ACM Symp. on Applied Computing, pp. 180-184, 2004. [40] V. Levenshtein, “Binary Codes Capable of Correcting Spurious Instertion and Deletion of Ones,” Proc. of the 1st Information Transmission, pp. 8-17, 1965. [41] D. J. Lipman and W. R. Pearson, “Rapid and Sensitive Protein Similarity Searches,” Science, Vol. 227, No. 12, pp. 1435-1441, March 1985. [42] U. Manber and E. W. Myers, “Suffix Arrays: A New Method for On-line String Searches,” SIAM Journal on Computing, Vol. 22, No. 5, pp. 935-948, Oct. 1993. [43] P. D. Michailidis and K. G. Margaritis, “On-line Approximate String Matching Algorithms Survey and Experimental Results,” Int. Journal of Computer Math, Vol. 79, No. 8, pp. 867-888, Nov. 2002. [44] G. Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, Vol. 33, No. 1, pp. 31-88, March 2001. [45] G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio, “Indexing Methods for Approximate String Matching,” Bulletin of the IEEE Computer Society Technical Committee on Data Eng., Vol. 24, No. 4, pp. 19-27, Dec. 2001. [46] G. Navarro and R. Beaza-Yates, “A Hybrid Indexing Method for Approximate String Matching,” Journal of Discret Algorithms, Vol. 1, No. 1, pp. 205-239, Feb. 2000. [47] G. Navarro and K. Fredriksson, “Average Complexity of Exact and Approximate String Matching,” Theoretical Computer Science, Vol. 321, No. 2-3, pp. 283-290, Aug. 2004. [48] G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio, “Indexing Text with Approximate q-grams,” Proc. of the 11th Annual Symp. on Combinatorial Pattern Matching, pp. 350-363, 2000. [49] G. Navorro, “Multiple Approximate String Matching by Counting,” Proc. of the 4th South American Workshop on String Processing, pp. 125-139, 1997. [50] G. Navorro and R. Beaza-Yates, “Improving an Algorithm for Approximate Pattern Matching,” Tech. Rep. TR/DCC-98-5, Dept. of Computer Science, Univ. of Chile, ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/dexp.ps.gz. [51] S. Needleman and C. Wunsch, “A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins,” Journal of Molecular Biology, Vol. 48, No. 4, pp. 444-453, Jan. 1970. [52] D. Sanko_ and J. B. Kruskal, Time Wraps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison Wesley, 1983. [53] P. H. Sellers, “The Theory and Computation of Evolutionary Distance: Pattern Recognition,” Journal of Algorithms, Vol. 1, No. 4, pp. 395-373, Dec. 1980. [54] R. Shamir, “Algorithms for Molecular Biology,” http://www.math.tau.ac.il/_rshamir/algmb/01/algmb01.html. [55] F. Shi, “Fast Approximate String Matching with q-blocks Sequences,” Proc. of the 3th South American Workshop on String Processing, pp. 257-271, 1996. [56] T. F. Smith and M. S. Waterman, “Identification of Common Molecular Subsequences,” Journal of Molecular Biology, Vol. 147, No. 1, pp. 195-197, Mar. 1995. [57] E. Sutinen and J. Tarhio, “On Using q-gram Locations in Approximate String Matching,” Proc. of the 3th Annual European Symp. on Algorithms, pp. 327-340, 1995. [58] T. Takaoka, “Approximate Pattern Matching with Samples,” Proc. of ISAC '94, pp. 234-242, 1994. [59] E. Ukkonen, “Finding Approximate Patterns in Strings,” Journal of Algorithms, Vol. 6, No. 1, pp. 132-137, Mar. 1985. [60] E. Ukkonen, “Approximate String Matching with q-grams and Maximal Matches,” Theor. Comput. Science 1, Vol. 92, No. 2, pp. 191-211, Jan. 1992. [61] J. Ullman, “A Binary n-gram Technique for Automatic Correction of Substitution, Deletion, Insertion, and Reversal Errors in Words,” Journal of Comput., Vol. 10, No. 3, pp. 141-147, Jan. 1977. [62] R. A. Wagner and M. J. Fischer, “The String to String Correction Problem,” Journal of the Association for Computing Machinery, Vol. 21, No. 1, pp. 168-173, Jan. 1974. [63] P. Weiner, “Linear Pattern Matching Algorithms,” Proc. of the 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1-11, 1973. [64] W. J. Wilbur and D. J. Lipman, “The Context Dependent Comparison of Biological Sequences,” SIAM Journal of Applied Mathematics, Vol. 44, No. 3, pp. 557-567, Feb. 1984. [65] H. Williams and J. Zobel, “Indexing Nucleotide Databases for Fast Query Evaluation,” Int. Conf. on Extending Database Technology, pp. 275-288, 1996. [66] H. Williams and J. Zobel, “Indexing and Retrieval for Genomic Databases,” IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 1, pp. 63-78, Jan./Feb. 2002. [67] S. Wu and U. Manber, “Agrep \| A Fast Approximate Pattern Matching Tool,” Proc. of the USENIX Technical Conf., pp. 153-162, 1992. [68] S. Wu and U. Manber, “Fast Text Searching: Allowing Errors,” Commun. of the ACM, Vol. 35, No. 10, pp. 83-91, Oct. 1992. [69] J. Zobel and P. Dart, “Phonetic String Matching: Lessons from Information Retrieval,” Proc. of the 19th ACM Int. Conf. on Information Retrieval (SIGIR '96), pp. 166-172, 1996.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.134.102.182 論文開放下載的時間是校外不公開 Your IP address is 3.134.102.182 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS