論文使用權限 Thesis access permission:校內外都一年後公開 withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
用於循序資料分群之混合式尋找特徵方法 Hybrid Algorithms of Finding Features for Clustering Sequential Data |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
92 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2010-06-11 |
繳交日期 Date of Submission |
2010-07-08 |
關鍵字 Keywords |
蛋白質資料庫、特徵、循序資料、分群 Clustering, Protein databases, Sequential data, Features |
||
統計 Statistics |
本論文已被瀏覽 5690 次,被下載 1053 次 The thesis/dissertation has been browsed 5690 times, has been downloaded 1053 times. |
中文摘要 |
蛋白質是生物細胞與組織的結構元素,它對於生命有機體是個重要的 建構元素。重覆片段是指在蛋白質序列中出現的頻率夠大。這樣的片 段可以定義出蛋白質中重要的功能區域,分辨此蛋白質的家族,及找 出此蛋白質的功能。此外,它也提供了一些關物種進化有價值的資 訊。蛋白質序列分群把有著類似結構的分在同一群,如此一來便有助 於研究蛋白質序列的相似功能。有許多已經被提出的演算法都是根據 蛋白質序列的相似性去作蛋白質的分群,即根據在蛋白質資料庫中序 列的重複片段去作蛋白質的分群,例如,Feature-based clustering 演算法的全域方法和局部方法。他們利用 mining sequential patterns 的演算法找出序列重複片段來解決在蛋白質序列資料庫中 沒有 gap 限制的序列重複片段問題,然後分別去找出全域特徵和局部 特徵來作分群。Feature-based clustering 演算法是一個截然不同 的蛋白質分群方法,不需要作所有對於所有的分析,並且使用趨近於 線性複雜度的 K - means 分群演算法。雖然 Feature-based clustering 演算法具有可擴展性,並且可以得到相當不錯的分群結 果,但是它分開執行全域方法和局部方法會耗費較多的時間。因此, 在這論文中,我們提出了 Hybrid 演算法來找出並標記為特徵來改進 Feature-based clustering 演算法。我們從局部特徵與封閉頻繁序 列重複片段之間的關係,觀察到一個有趣的結果。我們的重要發現 是,在封閉頻繁序列重複片段中的某些特徵可以拆成幾個局部的特 徵,而且這幾個局部特徵出現在資料庫中的總數量會等於在封閉頻繁 序列重複片段中的所對應的特徵出現在資料庫中的數量。全域方法和 局部方法在找出序列重覆片段之後都有兩個階段,尋找特徵和標記特 徵。在我們的 Hybrid 演算法中的方法 1(LocalG),我們首先找到局 部的特徵並且作標記。然後,我們找到了全域的特徵。最後,我們從 已經標記好的局部特徵向量有效率地去標記出全域特徵的向量。在我 們的 Hybrid 演算法的方法 2(CLoseLG),我們首先直接找出封閉頻 繁序列的重覆片段。接下來,我們有效率地從封閉頻繁序列的重覆片 段找到局部的候選特徵,然後標記出局部特徵。最後,我們找出全域 特徵並且作標記。根據模擬的結果,我們證明了我們提出的 Hybrid 演算法 比 Feature-based clustering 演算法更有效率。 |
Abstract |
Proteins are the structural components of living cells and tissues, and thus an important building block in all living organisms. Patterns in proteins sequences are some subsequences which appear frequently. Patterns often denote important functional regions in proteins and can be used to characterize a protein family or discover the function of proteins. Moreover, it provides valuable information about the evolution of species. Grouping protein sequences that share similar structure helps in identifying sequences with similar functionality. Many algorithms have been proposed for clustering proteins according to their similarity, i.e., sequential patterns in protein databases, for example, feature-based clustering algorithms of the global approach and the local approach. They use the algorithm of mining sequential patterns to solve the no-gap-limit sequential pattern problem in a protein sequences database, and then find global features and local features separately for clustering. Feature-based clustering algorithms are entirely different approaches to protein clustering that do not require an all-against-all analysis and use a near-linear complexity K-means based clustering algorithm. Although feature-based clustering algorithms are scalable and lead to reasonably good clusters, they consume time on performing the global approach and the local approach separately. Therefore, in this thesis, we propose hybrid algorithms to find and mark features for feature-based clustering algorithms. We observe an interesting result from the relation between the local features and the closed frequent sequential patterns. The important observation which we find is that some features in the closed frequent sequential patterns can be taken apart to several features in the local selected features and the total support number of these features in the local selected features is equal to the support number of the corresponding feature in the closed frequent sequential patterns. There are two phases, find-feature and mark-feature, in the global approach and the local approach after mining sequential patterns. In our hybrid algorithms of Method 1 (LocalG), we first find and mark the local features. Then, we find the global features. Finally, we mark the bit vectors of the global features efficiently from the bit vector of the local features. In our hybrid algorithms of Method 2 (CLoseLG), we first find the closed frequent sequential patterns directly. Next, we find local candidate features efficiently from the closed frequent sequential patterns and then mark the local features. Finally, we find and mark the global features. From our performance study based on the biological data and the synthetic data, we show that our proposed hybrid algorithms are more efficient than the feature-based algorithm. |
目次 Table of Contents |
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Patterns in Protein Sequences . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Definitions of Protein Sequence Patterns . . . . . . . . . . . . 2 1.3 The K-means Clustering Algorithm . . . . . . . . . . . . . . . . . . . 8 1.4 The Problem of Protein Clustering . . . . . . . . . . . . . . . . . . . 9 1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 16 2. A Survey of Algorithms for Protein Clustering . . . . . . . . . . . 17 2.1 Mining Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.1 The PrefixSpan Algorithm . . . . . . . . . . . . . . . . . . . 19 2.2 The Vector-Space K-Means Clustering Algorithm . . . . . . . . . . . 21 2.3 Feature-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 CloSpan: Closed Sequential Patterns Mining . . . . . . . . . . . . . . 28 2.5 The CONTOUR Algorithm . . . . . . . . . . . . . . . . . . . . . . . 29 3. Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Method 1 (LocalG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Method 2 (CLoseLG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Page 3.4 Counterexamples for CONTOUR . . . . . . . . . . . . . . . . . . . . 45 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1 The Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Biological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 |
參考文獻 References |
[1] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. of the 11th Int. Conf. on Data Eng. , pp. 3 – 14, 1995. [2] S. Altschul, T. Madden, A. Schafer, J. Zhang, Z. Zhang, W. Miller, and D. Lip- man, “Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Program,” Proc. of the Nucleic Acids Research, pp. 3389 – 3402, 1997. [3] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, “Sequential Pattern Mining Us- ing a Bitmap Representation,” Proc. of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 429–435, 2002. [4] M. Crochemore and M. Sagot, Motifs in Sequences: Localization and Extraction. Marcel Dekker, first ed., 2001. [5] P. G. Ferreira and P. J. Azevedo, “Query Driven Sequence Pattern Mining,” Proc. of the XXI Simpsio Brasileiro de Banco de Dados(SBBD), 2006. [6] P. G. Ferreira and P. J. Azevedo, “Evaluating Deterministic Motif Significance Measures in Protein Databases,” Algorithms fo Molecular Biology, Vol. 2, No. 16, Dec. 2007. [7] V. Guralnik and G. Karypis, “A Scalable Algorithm for Clustering Sequential Data,” Proc. of IEEE Int. Conf. on Data Mining, pp. 179–186, 2001. [8] J. Han, J. Pei, and Y. Yin, “CURE: An Efficient Clustering Algorithm for Large Databases,” Information Systems, Vol. 26, No. 1, pp. 35–58, March 2001. [9] J. Han, J. Pei, B. M. Asl, Q. Chen, U. Dayal, and M. C. Hsu, “FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining,” Proc. of the 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 355–359, 2000. [10] I. Jonassen, “http://www.ii.uib.no/ inge/patterns.html,” Patterns in biose- quences. [11] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Anslysis,” Wiley Series in Probability and Mathematical Statistics Applied Probability and Statistics, New York: Wiley, 1990. [12] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Anslysis,” Wiley Series in Probability and Mathematical Statistics Applied Probability and Statistics, New York: Wiley, 1990. [13] C. Lee, “Computational Biology,” http://www.csie.ncnu.edu.tw/ rctlee/biology.html. [14] M. Lin and S. Lee, “Incremental Update on Sequential Patterns in Large Database,” Proc. of the 10th IEEE Int. Conf. Tools with Artificial Intelligence, pp. 24 – 31, 1998. [15] G. K. M.J. Joshi and V. Kumar, “Universal Formulation of Sequential Pat- terns,” Technical report, University of Minnesota, Department of Computer Sci- ence Minneapolis, 1999. [16] M.J.Zaki, “Efficient Enumeration of Frequent Sequences,” Proc. of the 7th Int. Conf. on Information and Knowledge Management, pp. 68 – 75, 1998. [17] J. Pei, J. Han, M. A. Behzad, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth,” Int. Conf. on Data Engineering, pp. 215–224, 2001. [18] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” Proc. of the 5th Int. Conf. on Extending Database Technology, pp. 3 – 17, 1996. [19] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clus- tering Techniques,” Proc. of the Knowledge Discovery and Data Mining, pp. 1 – 2, 2000. [20] J. Wang and J. Han, “BIDE: Efficient Mining of Frequent Closed Sequences,” Proc. of the 20th Int. Conf. on Data Engineering, pp. 79–90, 2004. [21] J. Wang, Y. Zhang, L. Zhou, G. Karypis, and C. C. Aggarwal, “Discriminating Subsequence Discovery for Sequence Clustering,” Proc. of the Sequential Data Mining, pp. 605–610, 2007. [22] J. Wang, Y. Zhang, L. Zhou, G. Karypis, and C. C. Aggarwal, “CONTOUR: an Efficient Algorithm for Discovering Discriminating Subsequences,” Proc. of the Data Mining and Knowledge Discovery, pp. 1 – 29, 2009. [23] Wikipedia, “http://en.wikipedia.org/wiki/Protein,” . [24] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns in Large Datasets,” Proc. of the Sequential Data Mining, pp. 166–177, 2003. [25] J. Yang, J. S. Deogun, and Z. Sun, “A New Scheme for Protein Sequence Motif Extraction,” Proc. of the 38th Hawii Int. Conf. on System Sciences, pp. 280a – 280a, 2005. [26] M. Zaki and C. Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining,” Proc. of the SIAM Int. Conf. on Data Mining, pp. 457 – 473, 2002. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內外都一年後公開 withheld 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |