國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個於蛋白質資料庫中有效率地基於位元模式來探勘順序項目之方法,An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases

論文名稱 Title	一個於蛋白質資料庫中有效率地基於位元模式來探勘順序項目之方法 An Efficient Bit-Pattern-Based Algorithm for Mining Sequential Patterns in Protein Databases
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	97 學年度第 2 學期 The spring semester of Academic Year 97	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	86
研究生 Author	鄭尹涵 Yin-han Jeng
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	陳健輝 Gen-Huey Chen
口試委員 Advisory Committee	黃三益, 黃三益, 李建億 San-Yi Huang; San-Yi Huang; Chien-I Lee
口試日期 Date of Exam	2009-06-12	繳交日期 Date of Submission	2009-06-26
關鍵字 Keywords	蛋白質資料庫、位元模式、順序項目 Sequential Patterns, Protein Databases, Bit-Pattern-Based
統計 Statistics	本論文已被瀏覽 5728 次，被下載 2144 次 The thesis/dissertation has been browsed 5728 times, has been downloaded 2144 times.

中文摘要
蛋白質是生物細胞與組織的結構元素，它對於生命有機體是很重要的建構元素。重覆片段是指在蛋白質中出現的頻率夠大的某些片段。這樣的片段可以定義出蛋白質中重要的功能區域，分辨此蛋白質的家族，及找出此蛋白質的功能。此外，它也在物種的進化上提供了有價值的資訊。重覆片段包含不定長度的間隔。考慮到在蛋白質資料庫上探勘沒有間隔長度限制的連續重覆片段問題，我們也許可以使用搜尋連續重覆片段的演算法。然而，在蛋白質資料庫中，短片段在蛋白質序列的出現順序非常重要，而且這些短片段有可能重覆出現在一個蛋白質序列相當多次。我們不能直接使用傳統的連續重覆片段搜尋方法來在探勘。在蛋白質資料庫上在探勘連續出現的重覆片段已發展出相當多的演算法，例如，SP-index演算法。這些演算法是先在解答空間中列舉出所有限制長度的重覆片段(短片段)，再來找出所有的連續重覆片段。SP-index演算法是基於傳統的連續重覆片段搜尋方法，且考慮重覆片段會重覆出現在同一序列上的問題。雖然SP-index演算法考慮到生物資訊的特性，但它仍然包含了很費時的步驟，也就是建SP-tree來找出高頻率的重覆片段這個步驟上。在這個步驟，此演算法必須去追蹤很多節點，才能得到結果。因此，在這篇論文中，我們提出一個以位元模式為基礎的演算法去改進SP-index演算法的缺點。首先，我們將蛋白質序列轉換成位元序列。再來，我們利用AND運算元去得到高頻率短片段。因為我們使用了位元運算方法，這使得我們可以有效率地得到所有的高頻率短片段。然後，我們將不必要的短片段移除，這樣的做法導致當進入最後一步驟時，我們不需要測試太多的高頻率短片段。最後，我們用OR運算法得到最長的重覆片段。在這個步驟中，我們測試兩個短片段是否可以被連接來建構長片段，而且只要測試一次就可得到結果。因為我們只把焦點放在短片段出現的位置，我們只需要用OR運算元來判斷算出來的位元序列，便可得到結果。使用這樣的方式，我們可以避免很多的測試過程。根據測試真實生物資料的結果，可以得知，我們可以改善SP-index演算法的效率。此外，根據模擬的結果，我們可以得知，由於SP-index演算法在產生連續重覆片段上需要測試多個節點來得到結果，所以我們提出的演算法所需要的處理時間會比SP-index演算法來得短。
Abstract
Proteins are the structural components of living cells and tissues, and thus an important building block in all living organisms. Patterns in proteins sequences are some subsequences which appear frequently. Patterns often denote important functional regions in proteins and can be used to characterize a protein family or discover the function of proteins. Moreover, it provides valuable information about the evolution of species. Patterns contain gaps of arbitrary size. Considering the no--gap--limit sequential pattern problem in a protein database, we may use the algorithm of mining sequential patterns to solve it. However, in a protein database, the order of segment appearing in protein sequences is important and it may appear many times repeatedly in a protein sequence. Therefore, we can not directly use the traditional sequential pattern mining algorithms to mine them. Many algorithms have been proposed to mine sequential patterns in protein databases, for example, the SP-index algorithm. They enumerate patterns of limited sizes (segments) in the solution space and find all patterns. The SP-index algorithm is based on the traditional sequential pattern mining algorithms and considers the the problem of the multiple--appearances of segments in a protein sequence. Although the SP-index algorithm considers the characteristics of bioinformatics, it still contains a time--consuming step which constructs the SP-tree to find the frequent patterns. In this step, it has to trace many nodes to get the result. Therefore, in this thesis, we propose a Bit--Pattern--based (BP) algorithm to improve the disadvantages of the SP-index algorithm. First, we transform the protein sequences into bit sequences. Second, we construct the frequent segments by using the AND operator. Because we use the bit operator, it is efficient to get the frequent segments. Then, we prune unnecessary frequent segments, which results in the case that we do not have to test many frequent segments in the following step. Third, we use the OR operator to get the longest pattern. In this step, we test whether two segments can be linked together to construct a long segment, and we get the result by testing once. Because we focus on which position the segment appears on, we can use the OR operator and then judge the bit sequences to get the result. Thus, we can avoid many testing processes. From our performance study based on the biological data, we show that we can improve the efficiency of the SP-index algorithm. Moreover, from our simulation results, we show that our proposed algorithm can improve the processing time up to 50\% as compared to the SP-index algorithm, since the SP--index algorithm has to trace many nodes to construct the longest pattern.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Patterns in Protein Sequences . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Examples of Patterns . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Definitions of Protein Sequence Patterns . . . . . . . . . . . . 6 1.3 Discovering Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Extracting Deterministic Motifs . . . . . . . . . . . . . . . . . 8 1.3.2 Extracting Probability Motifs . . . . . . . . . . . . . . . . . . 10 1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2. A Survey of Protein Motif Discovering Algorithms . . . . . . . . . 16 2.1 Deterministic Motifs Finding . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 The PrefixSpan Algorithm . . . . . . . . . . . . . . . . . . . 17 2.1.2 The SPAM Algorithm . . . . . . . . . . . . . . . . . . . . . . 18 2.1.3 The SP-index Algorithm . . . . . . . . . . . . . . . . . . . . . 23 2.2 Probabilistic Motifs Finding . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 The Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . 26 2.2.2 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.3 The MEME Algorithm . . . . . . . . . . . . 30 3. The BP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Definitions and Problem Statement . . . . . . . . . . . . . . . . . . . 32 3.2 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 Step 1: Transforming Protein Sequences into Bit Sequences . . 34 3.2.2 Step 2: Finding Frequent Segments . . . . . . . . . . . . . . . 36 3.2.2.1 Using the AND Operator to Find All Frequent Segments 36 3.2.2.2 Pruning Unnecessary Frequent Segments . . . . . . . 40 3.2.3 Step 3: Finding the Longest Pattern Using the OR Operator . 42 3.2.4 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1 Synthetic Data and Biological Data . . . . . . . . . . . . . . . . . . . 59 4.2 Biological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 LIST OF FIGURES Figure Page 1.1 The structures of proteins . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 The motifs in protein sequences . . . . . . . . . . . . . . . . . . . . . 5 1.3 A PROSITE pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 A classification of motifs in protein sequences . . . . . . . . . . . . . 9 1.5 Example 1: Searching patterns on the SP–tree in the SP-index algorithm 12 1.6 The pattern tree in the SP-index algorithm . . . . . . . . . . . . . . . 13 1.7 Example 2: Searching segment {ACD} on the SP–tree in the SP-index algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 The lexicographic sequence tree . . . . . . . . . . . . . . . . . . . . . 20 2.2 The SPAM algorithm with pruning . . . . . . . . . . . . . . . . . . . 21 2.3 The S-step processing on sequence bitmap ({a}) . . . . . . . . . . . . 22 2.4 The I-step processing on sequence bitmap ({a},{b}) . . . . . . . . . . 22 2.5 The SP-index in Example for database D shown in table 2.3 . . . . . 24 2.6 The segment tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 The pattern tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.8 The pseudo–code of the EM algorithm . . . . . . . . . . . . . . . . . 29 2.9 The pseudo–code of MEME algorithm . . . . . . . . . . . . . . . . . 31 3.1 Procedure Bit Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Procedure BitAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 The process of finding the bit sequences of segment {AB} in Step 2 of protein sequence S1: (a) two bit sequences of segment {A} and {B}; (b) shifting the bit sequence of segment {B} by one position; (c) the resulting bit sequence of segment {AB} after applying the AND operator. 39 3.4 The process of finding the bit sequences of segment {ACD} in Step 2 of protein sequence S1: (a) two bit sequences of segment {AC} and {D}; (b) shifting the bit sequence of segment {D} by two positions; (c) the resulting bit sequences of segment {ACD} after applying the AND operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Procedure Link Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Procedure Find Longest Pattern . . . . . . . . . . . . . . . . . . . . . 44 3.7 The pattern {AB}{ACD} . . . . . . . . . . . . . . . . . . . . . . . . 47 3.8 The process of changing the bit sequence of pattern {AB}{ACD} . . 48 3.9 The pattern {AB}{ACD} and {AB}{ACDA} . . . . . . . . . . . . 51 3.10 The process of finding the longest pattern . . . . . . . . . . . . . . . 55 3.11 The generalized suffix tree . . . . . . . . . . . . . . . . . . . . . . . . 56 3.12 The SP-index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.13 The segment tree in the sp-index algorithm . . . . . . . . . . . . . . . 58 3.14 The pattern tree in the sp-index algorithm . . . . . . . . . . . . . . . 58 4.1 A comparison of two algorithms with different values of support for biological data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 A comparison of two algorithms with different values of number of protein sequences for biological data . . . . . . . . . . . . . . . . . . . 62 4.3 A comparison of two algorithms with different values of the length of protein sequences for biological data . . . . . . . . . . . . . . . . . . . 62 4.4 A comparison of two algorithms with different values of the support for synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5 A comparison of two algorithms with different values of the number of protein sequences for synthetic data . . . . . . . . . . . . . . . . . . . 65 4.6 A comparison of two algorithms with different values of the minimal length of segments for synthetic data . . . . . . . . . . . . . . . . . . 65 4.7 A comparison of two algorithms with different values of the length of protein sequences for synthetic data . . . . . . . . . . . . . . . . . . . 66 4.8 A comparison of two algorithms with different values of the maximal length of frequent segment for synthetic data . . . . . . . . . . . . . . 67 LIST OF TABLES Table Page 1.1 The codes of 20 amino acids . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 The database DB1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 A sequence database . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Projected databases and sequential patterns . . . . . . . . . . . . . . 19 2.3 The database D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 An example of Position Weighted Matrix. . . . . . . . . . . . . . . . . 27 3.1 Description of parameters . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 The database DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 The BM table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 The BPS table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 The Pruning Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6 New bit sequences for segment {AB} . . . . . . . . . . . . . . . . . . 45 3.7 The result of pattern {AB}{AB} . . . . . . . . . . . . . . . . . . . . 45 3.8 The result of pattern {AB}{ACD} . . . . . . . . . . . . . . . . . . . 46 3.9 The result of pattern {AB}{ACD} . . . . . . . . . . . . . . . . . . . 47 3.10 The result of pattern {AB}{ACD}{ACD} . . . . . . . . . . . . . . 49 3.11 The result of pattern {AB}{ACD}{ACDA} . . . . . . . . . . . . . 49 3.12 The result of pattern {AB}{ACDA} . . . . . . . . . . . . . . . . . . 50 3.13 The database DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.14 The BM table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.15 All frequent segments of length 2 . . . . . . . . . . . . . . . . . . . . 52 3.16 All frequent segments of length 3 . . . . . . . . . . . . . . . . . . . . 53 3.17 All frequent segments of length 4 . . . . . . . . . . . . . . . . . . . . 53 3.18 The BPS table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.19 The PT table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 Parameters of the data generator [21]. . . . . . . . . . . . . . . . . . . 60

參考文獻 References
[1] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. of the 11th Int. Conf. on Data Eng., pp. 3–14, 1995. [2] S. W. Altschul, W. Gish, W. Miller, and D. J. L. E. W. Myers, “Basic Local Alignment Search Tool,” Journal of Mol Biol, Vol. 215, No. 3, pp. 403–410, Oct. 1990. [3] J. Ayres, J. Gehrke, T. Yiu, and J. Flannick, “Sequential PAttern Mining Using a Bitmap Representation,” Proc. of the Special Interest Group on Knowledge Discovery and Data Mining(SIGKDD), pp. 215–224, 2002. [4] T. Bailey and C. Elkan, “Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers,” Proc. of the second Intl. Conf. Son Intelligent Systems for Molecular Biology, pp. 28–136, 1994. [5] T. L. Bailey and C. Elkan, “Unsupervised Learning of Multiple Motifs in Biopoly- mers Using Expectation Maximization,” Machine Learning, Vol. 21, No. 1–2, pp. 51–80, Oct. 1995. [6] A. Brazma, I. Eidhammer, and D. Gilbert, “Approaches to the Automatic Dis- covery of Patterns in Biosequences,” Journal of Computational Biology, Vol. 5, No. 2, pp. 279–305, Dec. 1998. [7] M. Crochemore and M. Sagot, Motifs in Sequences: Localization and Extraction. Marcel Dekker, first ed., 2001. [8] P. G. Ferreira and P. J. Azevedo, “Query Driven Sequence Pattern Mining,” Proc. of the XXI Simpsio Brasileiro de Banco de Dados(SBBD), 2006. [9] P. G. Ferreira and P. J. Azevedo, “Evaluating Deterministic Motif Signifi- cance Measures in Protein Databases,” Algorithms for Molecular Biology, Vol. 2, No. 16, Dec. 2007. [10] V. Guralnik and G. Karypis, “A Scalable Algorithm for Clustering Protein Se- quences,” Proc. of the Workshop on Data Mining in Bioinformatics, pp. 73–80, 2001. [11] J. Ho, L. Lukov, and S. Chawla, “Sequential Pattern Mining with Constraints on Large Protein Databases,” Proc. of the Int. Conf. on Management of Data, pp. 20–22, 2005. [12] I. Jonassen, “http://www.ii.uib.no/ inge/patterns.html,” Patterns in biose- quences. [13] C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald, and J. Wootton, “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment,” Science, Vol. 262, No. 5131, pp. 208–214, Oct. 1993. [14] C. Lawrence and A. Reilly, “An Expectation Maximization (EM) Algorithm for The Identification and Characterization of Common Sites in Unaligned Biopoly- mer Sequences,” Protein, Vol. 7, No. 1, pp. 41–55, Oct. 1993. [15] C. T. Lee, “Computational Biology,” http://www.csie.ncnu.edu.tw/ rctlee/biology.html. [16] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,” pp. 215–226, 2001. [17] PROSITE, “http://ca.expasy.org/prosite/,” . [18] X. Shang, Z. Li, andW. Li, “Mining Functional Associated Patterns from Biolog- ical Network Data,” Proc. of the 2009 ACM symposium on Applied Computing, pp. 1488–1489, 2009. [19] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proc. of the 5th Intl. Computer Symp. Workshop on Artificial Intelligence, pp. 1682–1687, 2002. [20] Z. Sun, J. Yang, and J. S. Deogun, “MISAE: a new approach for regulatory mo- tif extraction,” Proc. of the 2004 IEEE Computational Systems Bioinformatics Conf., pp. 173–181, 2004. [21] K.Wang, Y. Xu, and J. X. Yu, “Scalable Sequential Pattern Mining for Biological Sequences,” Proc. of the 13th ACM International Conf. on Information and Knowledge Management, pp. 178–187, 2004. [22] Wikipedia, “http://en.wikipedia.org/wiki/Protein,” . [23] J. Yang, J. S. Deogun, and Z. Sun, “A New Scheme for Protein Sequence Motif Extraction,” Proc. of the 38th Hawii Int. Conf. on System Sciences, pp. 280a – 280a, 2005. [24] J. Yang, J. S. Deogun, and Z. Sun, “Finding Patterns in Biological Sequences by Longest Common Subsequences and Shortest Common Supersequences,” Proc. of the 6th IEEE Symposium on BionInformatics and BioEngineering, pp. 53–60, 2006.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內外都一年後公開 withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0626109-152123.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS