國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個有效基因資料截取的 ACGT-Words 樹 ,An ACGT-Words Tree for Efficient Data Access in Genomic Databases

論文名稱 Title	一個有效基因資料截取的 ACGT-Words 樹 An ACGT-Words Tree for Efficient Data Access in Genomic Databases
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	91 學年度第 2 學期 The spring semester of Academic Year 91	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	99
研究生 Author	胡仁維 Jen-Wei Hu
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	郭大維 Tei-Wei Kuo
口試委員 Advisory Committee	李建億, 黃三益 Chien-I Lee; San-Yi Huang
口試日期 Date of Exam	2003-06-20	繳交日期 Date of Submission	2003-07-25
關鍵字 Keywords	DNA序列、基因資料庫、字尾陣列、索引、字尾樹 DNA sequence, genomic databases, indexing, suffix tree, suffix array
統計 Statistics	本論文已被瀏覽 5679 次，被下載 2283 次 The thesis/dissertation has been browsed 5679 times, has been downloaded 2283 times.

中文摘要
有越來越多的基因序列資料庫，像是GenBank, EMBL等，提供給生物學家來做查詢。因為基因序列資料庫所存的序列越來越多，在這些序列上做索引以加速查詢的速度也隨之重要。DNA序列是由四個不同的核甘酸所組成，而這些序列可以看成是一文字字串。與傳統的資料庫類似，在基因資料庫中，有一些研究是使用索引來提供更快速的資料存取。Inverted-list indexing是使用hashing方式去索引DNA序列，然而一個完美的hashing函式是很難建立，而且在hash之中也可能常出現碰撞。而不同於inverted-list indexing，有其他的資料結構，像是suffix tree、suffix array與suffix binary search tree來索引DNA序列。這些索引結構的一個共同特性是，它們都儲存所有序列之中的suffixes（字尾），它們也沒有字的觀念。Suffix tree的優點是觀念及建構簡單但其浪費儲存空間；suffix array與suffix binary search tree改善了suffix tree在空間上的浪費，但由於使用binary search來搜尋資料，因此會浪費許多時間在搜尋上。另一個索引的資料結構為word suffix tree，它使用了字的觀念並且只儲存序列一部份的suffixes。雖然減少了儲存的空間，但也因此在搜尋的過程中，有一些資訊會被遺漏。在我們論文裡，我們提出了一個索引的結構，ACGT-Words tree，來提供在基因資料庫之中做快速而有效的查詢。在此索引結構中，我們定義了不同於word suffix tree中字的概念，利用此定義將DNA序列拆成個別的字，並利用所產生的這些ACGT-Words去建立索引結構。在我們的方法之中沒有儲存序列的suffixes，因此在空間的使用上會比suffix tree來的少。在搜尋上，我們提出了一搜尋方式，可以比suffix array花更少時間完成搜尋。最後，我們的方法也可以避免word suffix tree會遺漏一些資訊的這種缺點。接著，根據我們的索引結構，我們提出一個可以加速在建構樹結構的過程以及兩個加快搜尋速度的方法。在加速建構索引結構上，我們先排序所產生的ACGT-Words，利用此前處理的方式來加快建構樹結構的速度。另外，在兩個加快搜尋的方法上，若查詢序列能符合一些條件，我們可以提供更好的搜尋效能。最後從我們模擬測試結果顯示，我們所提出的索引結構ACGT-Words tree分別在空間的使用與搜尋的效能比suffix tree與suffix array來的優異。而且，我們所提出的改進方法，在建構索引結構與搜尋方面也都比原來建構與搜尋的方式有更好的效能。
Abstract
Genomic sequence databases, like GenBank, EMBL, are widely used by molecular biologists for homology searching. Because of the increase of the size of genomic sequence databases, the importance of indexing the sequences for fast queries grows. The DNA sequences are composed of 4 base pairs, and these genomic sequences can be regarded as the text strings. Similar to conventional databases, there are some approaches use indexes to provide efficient access to the data. The inverted-list indexing approach uses hashing to store the database sequences. However, the perfect hashing function is difficult to construct, and the collision in a hash table may occur frequently. Different from the inverted-list approach, there are other data structures, such as the suffix tree, the suffix array, and the suffix binary search tree, to index the genomic sequences. One characteristic of those suffix-tree-like data structures is that they store all suffixes of the sequences. They do not break the sequences into words. The advantage of the suffix tree is simple. However, the storage space of the suffix tree is too large. The suffix array and the suffix binary search tree reduce more storage space than the suffix tree. But since they use the binary searching technique to find the query sequence, they waste too much time to do the search. Another data structure, the word suffix tree, uses the concept of words and stores partial suffixes to index the DNA sequence. Although the word suffix tree reduces the storage space, it will lose information in the search process. In this thesis, we propose a new index structure, ACGT-Words tree, for efficiently support query processing in genomic databases. We define the concept of words which is different from the word definition given in the word suffix tree, and separate the DNA sequences stored in the database and in the query sequence into distinct words. Our approach does not store all of the suffixes in the database sequences. Therefore, we need less space than the suffix tree approach. We also propose an efficient search algorithm to do the sequence match based on the ACGT-Words tree index structure; therefore, we can take less time to finish the search than the suffix array approach. Our approach also avoids the missing cases in the word suffix tree. Then, based on the ACGT-Words tree, we propose one improved operation for data insertion and two improved operations for the searching process. In the improved operation for insertion, we sort the ACGT-Words generated and then preprocess them before constructing the tree structure. In the two improved operations, we can provide better performance when the query sequence satisfies some conditions. The simulation results show that the ACGT-Words tree outperforms the suffix tree and the suffix array in terms of storage and processing time, respectively. Moreover, we show that the improved operations in the ACGT-Words tree also require shorter time to construct or search than the original processes or the suffix array.

目次 Table of Contents
ABSTRACT LIST OF FIGURES LIST OF TABLES 1.Introduction 1.1 Genomics 1.2 Query Types of Genomic Databases 1.3 Indexing Methods 1.4 Motivations 1.5 Organization of the Thesis 2. A Survey 2.1 Inverted Indices 2.2 Suffix Tries 2.3 Suffix Trees 2.3.1 The Definition 2.3.2 Construction 2.3.3 Searching 2.4 Suffix Arrays 2.4.1 The Definition 2.4.2 Construction 2.4.3 Searching 2.5 Suffix Binary Search Trees 2.6 Word Suffix Trees 2.6.1 The Definition 2.6.2 Construction 3. The ACGT-Words Tree 3.1 The Definition 3.2 Tree Construction 3.3 Search 4. Improvements of Operations in an ACGT-Words Tree 4.1 Improvement of the Insertion Operation 4.2 Improvements of the Search Operation 4.2.1 Case 1 4.2.2 Case 2 5. Performance 5.1 Generation of Synthetic Data 5.2 Simulation Result 5.2.1 The Suffix Tree vs. the ACGT-Words Tree 5.2.2 The Suffix Array vs. the ACGT-Words Tree 5.2.3 The ACGT-Words Tree vs. the Improved the Insertion Operation 5.2.4 The ACGT-Words Tree vs. the Improved the Search Operations 5.3 Performance Result for the Input from Genomic Databases 6. Conclusion 6.1 Summary 6.2 Future Work

參考文獻 References
S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, "Basic Local Alignment Search Tool," Journal of Molecular Biology, Vol. 215, pp. 403-410, Oct. 1990. A. Andersson and S. Nilsson, "Efficient Implementation of Suffix Trees," Software Practice and Experience, Vol. 25, No.2, pp. 129-141, Feb. 1995. A. Andersson, N. J. Larsson, and K. Swanson, "Suffix Trees on Words," Algorithmica, Vol. 23, No. 3, pp. 246-260, Jan. 1999. S. Burkhardt, A. Crauser, P. Ferragina, H. Lenhof, E. Rivals, and M. Vingron, "q-gram Based Database Searching Using a Suffix Array (QUASAR)," Annual Conf. on Research in Computational Molecular Biology, pp. 77-83, 1999. W. Chen and K. Aberer, "Efficient Querying on Genomic Databases by Using Metric Space Indexing Techniques," Proc. of 8th Int. Conf. and Workshop on Database and Expert Systems Application, pp. 148-152, 1997. A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg, "Alignment of Whole Genomes," Nucleic Acid Research, Vol. 27, No. 11, pp. 2369-2376, June 1999. M. Farach, "Optimal Suffix Tree Construction with Large Alphabets," The 38nd Annual Symposium on Foundations of Computer Science, pp. 137-143, 1997. S. Ganguly and M. Noordewier, "Proximal: A Database System for the Efficient Retrieval of Genetic Information," Computer in Biology and Medicine, Vol. 26, No. 3, pp. 199-207, May 1996. D. Gusfield, "Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology," Cambridge University Press, 1997. J. L. Houle, W. Cadigan, S. Henry, A. Pinnamanenib, and S. Lundahlc, "Database Mining in the Human Genome Initiative," http://www.biodatabases.com/whitepaper01.html. E. Hunt, R. W. Irving, and M. Atkinson, "Persistent Suffix Trees and Suffix Binary Search Trees as DNA Sequence Indexes," Technical Report no. TR-2000-63 of the Computing Science Department of Glasgow University, October 2000. E. Hunt, M. P. Atkinson, and R. W. Irving, "A Database Index to Large Biological Sequences," Proc. of the 27th VLDB Conf., pp. 139-148, 2001. R. W. Irving and L. Love, "The Suffix Binary Search Tree and Suffix AVL Tree," Technical Report no. TR-2000-54 of the Computing Science Department of Glasgow University, July 2000. R. W. Irving and L. Love, "Suffix Binary Search Trees and Suffix Arrays," Technical Report no. TR-2001-82 of the Computing Science Department of Glasgow University, March 2001. T. Kahveci and A. K. Singh, "An Efficient Index Structure for String Databases," Proc. of the 27th VLDB Conf., pp. 351-160, 2001. J. Karkkainen and E. Sutinen, "Lempel-Ziv Index for q-Grams," Algorithmica, Vol. 21, No. 1, pp. 137-154, May 1998. T. Kasai, H. Arimura, and S. Arikawa, "Efficient Substring Traversal with Suffix Arrays," DOI Technical Report of the Informatics Department of Kyushu University, February 2001. C. T. Lee, "Computational Biology," http://www.csie.ncnu.edu.tw/~rctlee/biology.html. D. J. Lipman and W. R. Pearson, "Rapid and Sensitive Protein Similarity Searches," Science, Vol. 227, pp. 1435-1441, March 1985. U. Manber and E. W. Myers, "Suffix Arrays: A new method for on-line string searches," SIAM Journal on Computing, Vol. 22, No. 5, pp. 935-948, Oct. 1993. G. Navarro and R. Baeza-Yates, "A Practical q-Gram Index for Text Retrieval Allowing Errors," CLEI, Vol. 1, No. 2, pp. 273-282, Dec. 1998. G. Navarro, "Modern Information Retrieval," Addison Wesley/ACM Press, Reading, MA, pp. 191-228, 1999. G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio, "Indexing Methods for Approximate String Matching," IEEE Data Eng. Bulletin, Vol. 24, No. 4, pp. 19-27, Dec. 2001. R.Shamir, "Algorithms for Molecular Biology," http://www.math.tau.ac.il/~rshamir/algmb/01/algmb01.html. T. F. Smith and M. S. Waterman, "Identification of Common Molecular Subsequences," Journal of Molecular Biology, Vol. 147, pp. 195-197, March 1981 P. Weiner, "Linear Pattern Matching Algorithms," Proc. IEEE 14th Annual Symposium on Switching and Automata Theory, pp. 1-11, 1973. W. J. Wilbur and D. J. Lipman, "The Context Dependent Comparison of Biological Sequences," SIAM Journal of Applied Mathematics, Vol. 44, No. 3, pp. 557-567, 1984. H. Williams and J. Zobel, "Indexing Nucleotide Databases for Fast Query Evaluation," Int. Conf. on Extending Database Technology, pp. 275-288, 1996. H. E. Williams and J. Zobel, "Indexing and Retrieval for Genomic Databases," IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 1, pp. 63-78, Jan./Feb. 2002. M. Zhou and F. W. Tompa, "The Suffix-Signature Method for Searching for Phrases in Text," Information Systems, Vol. 23, No. 8, pp. 567-588, Dec. 1998.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內外都一年後公開 withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0725103-114311.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS