Responsive image
博碩士論文 etd-0708110-144724 詳細資訊
Title page for etd-0708110-144724
論文名稱
Title
用於循序資料分群之混合式尋找特徵方法
Hybrid Algorithms of Finding Features for Clustering Sequential Data
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
92
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2010-06-11
繳交日期
Date of Submission
2010-07-08
關鍵字
Keywords
蛋白質資料庫、特徵、循序資料、分群
Clustering, Protein databases, Sequential data, Features
統計
Statistics
本論文已被瀏覽 5690 次,被下載 1053
The thesis/dissertation has been browsed 5690 times, has been downloaded 1053 times.
中文摘要
蛋白質是生物細胞與組織的結構元素,它對於生命有機體是個重要的
建構元素。重覆片段是指在蛋白質序列中出現的頻率夠大。這樣的片
段可以定義出蛋白質中重要的功能區域,分辨此蛋白質的家族,及找
出此蛋白質的功能。此外,它也提供了一些關物種進化有價值的資
訊。蛋白質序列分群把有著類似結構的分在同一群,如此一來便有助
於研究蛋白質序列的相似功能。有許多已經被提出的演算法都是根據
蛋白質序列的相似性去作蛋白質的分群,即根據在蛋白質資料庫中序
列的重複片段去作蛋白質的分群,例如,Feature-based clustering
演算法的全域方法和局部方法。他們利用 mining sequential
patterns 的演算法找出序列重複片段來解決在蛋白質序列資料庫中
沒有 gap 限制的序列重複片段問題,然後分別去找出全域特徵和局部
特徵來作分群。Feature-based clustering 演算法是一個截然不同
的蛋白質分群方法,不需要作所有對於所有的分析,並且使用趨近於
線性複雜度的 K - means 分群演算法。雖然 Feature-based
clustering 演算法具有可擴展性,並且可以得到相當不錯的分群結
果,但是它分開執行全域方法和局部方法會耗費較多的時間。因此,
在這論文中,我們提出了 Hybrid 演算法來找出並標記為特徵來改進
Feature-based clustering 演算法。我們從局部特徵與封閉頻繁序
列重複片段之間的關係,觀察到一個有趣的結果。我們的重要發現
是,在封閉頻繁序列重複片段中的某些特徵可以拆成幾個局部的特
徵,而且這幾個局部特徵出現在資料庫中的總數量會等於在封閉頻繁
序列重複片段中的所對應的特徵出現在資料庫中的數量。全域方法和
局部方法在找出序列重覆片段之後都有兩個階段,尋找特徵和標記特
徵。在我們的 Hybrid 演算法中的方法 1(LocalG),我們首先找到局
部的特徵並且作標記。然後,我們找到了全域的特徵。最後,我們從
已經標記好的局部特徵向量有效率地去標記出全域特徵的向量。在我
們的 Hybrid 演算法的方法 2(CLoseLG),我們首先直接找出封閉頻
繁序列的重覆片段。接下來,我們有效率地從封閉頻繁序列的重覆片
段找到局部的候選特徵,然後標記出局部特徵。最後,我們找出全域
特徵並且作標記。根據模擬的結果,我們證明了我們提出的 Hybrid
演算法 比 Feature-based clustering 演算法更有效率。
Abstract
Proteins are
the structural components of living cells and tissues, and thus an
important building block in all living organisms. Patterns in
proteins sequences are some subsequences which appear frequently.
Patterns often denote important functional regions in proteins and
can be used to characterize a protein family or discover the
function of proteins. Moreover, it provides valuable information
about the evolution of species. Grouping protein sequences that
share similar structure helps in identifying sequences with similar
functionality. Many algorithms have been proposed for clustering
proteins according to their similarity, i.e., sequential
patterns in protein databases, for example, feature-based clustering
algorithms of the global approach and the local approach. They use
the algorithm of mining sequential patterns to solve the
no-gap-limit sequential pattern problem in a protein sequences
database, and then find global features and local features
separately for clustering. Feature-based clustering algorithms are
entirely different approaches to protein clustering that do not
require an all-against-all analysis and use a near-linear
complexity K-means based clustering algorithm. Although
feature-based clustering algorithms are scalable and lead to
reasonably good clusters, they consume time on performing the global
approach and the local approach separately. Therefore, in this
thesis, we propose hybrid algorithms to find and mark features for
feature-based clustering algorithms. We observe an interesting
result from the relation between the local features and the closed
frequent sequential patterns. The important observation which we
find is that some features in the closed frequent sequential
patterns can be taken apart to several features in the local
selected features and the total support number of these features in
the local selected features is equal to the support number of the
corresponding feature in the closed frequent sequential patterns.
There are two phases, find-feature and mark-feature, in the global
approach and the local approach after mining sequential patterns. In
our hybrid algorithms of Method 1 (LocalG), we first find and mark
the local features. Then, we find the global features. Finally, we
mark the bit vectors of the global features efficiently from the bit
vector of the local features. In our hybrid algorithms of Method 2
(CLoseLG), we first find the closed frequent sequential patterns
directly. Next, we find local candidate features efficiently from
the closed frequent sequential patterns and then mark the local
features. Finally, we find and mark the global features. From our
performance study based on the biological data and the synthetic
data, we show that our proposed hybrid algorithms are more efficient
than the feature-based algorithm.
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Patterns in Protein Sequences . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Definitions of Protein Sequence Patterns . . . . . . . . . . . . 2
1.3 The K-means Clustering Algorithm . . . . . . . . . . . . . . . . . . . 8
1.4 The Problem of Protein Clustering . . . . . . . . . . . . . . . . . . . 9
1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 16
2. A Survey of Algorithms for Protein Clustering . . . . . . . . . . . 17
2.1 Mining Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 The PrefixSpan Algorithm . . . . . . . . . . . . . . . . . . . 19
2.2 The Vector-Space K-Means Clustering Algorithm . . . . . . . . . . . 21
2.3 Feature-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 CloSpan: Closed Sequential Patterns Mining . . . . . . . . . . . . . . 28
2.5 The CONTOUR Algorithm . . . . . . . . . . . . . . . . . . . . . . . 29
3. Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Method 1 (LocalG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Method 2 (CLoseLG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Page
3.4 Counterexamples for CONTOUR . . . . . . . . . . . . . . . . . . . . 45
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 The Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Biological Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
參考文獻 References
[1] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc. of the 11th
Int. Conf. on Data Eng. , pp. 3 – 14, 1995.
[2] S. Altschul, T. Madden, A. Schafer, J. Zhang, Z. Zhang, W. Miller, and D. Lip-
man, “Gapped Blast and Psi-Blast: A New Generation of Protein Database
Search Program,” Proc. of the Nucleic Acids Research, pp. 3389 – 3402, 1997.
[3] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, “Sequential Pattern Mining Us-
ing a Bitmap Representation,” Proc. of the 8th ACM SIGKDD Int. Conf. on
Knowledge Discovery and Data Mining, pp. 429–435, 2002.
[4] M. Crochemore and M. Sagot, Motifs in Sequences: Localization and Extraction.
Marcel Dekker, first ed., 2001.
[5] P. G. Ferreira and P. J. Azevedo, “Query Driven Sequence Pattern Mining,”
Proc. of the XXI Simpsio Brasileiro de Banco de Dados(SBBD), 2006.
[6] P. G. Ferreira and P. J. Azevedo, “Evaluating Deterministic Motif Significance
Measures in Protein Databases,” Algorithms fo Molecular Biology, Vol. 2, No. 16,
Dec. 2007.
[7] V. Guralnik and G. Karypis, “A Scalable Algorithm for Clustering Sequential
Data,” Proc. of IEEE Int. Conf. on Data Mining, pp. 179–186, 2001.
[8] J. Han, J. Pei, and Y. Yin, “CURE: An Efficient Clustering Algorithm for Large
Databases,” Information Systems, Vol. 26, No. 1, pp. 35–58, March 2001.
[9] J. Han, J. Pei, B. M. Asl, Q. Chen, U. Dayal, and M. C. Hsu, “FreeSpan:
Frequent Pattern-Projected Sequential Pattern Mining,” Proc. of the 6th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 355–359,
2000.
[10] I. Jonassen, “http://www.ii.uib.no/ inge/patterns.html,” Patterns in biose-
quences.
[11] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction
to Cluster Anslysis,” Wiley Series in Probability and Mathematical Statistics
Applied Probability and Statistics, New York: Wiley, 1990.
[12] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction
to Cluster Anslysis,” Wiley Series in Probability and Mathematical Statistics
Applied Probability and Statistics, New York: Wiley, 1990.
[13] C. Lee, “Computational Biology,”
http://www.csie.ncnu.edu.tw/ rctlee/biology.html.
[14] M. Lin and S. Lee, “Incremental Update on Sequential Patterns in Large
Database,” Proc. of the 10th IEEE Int. Conf. Tools with Artificial Intelligence,
pp. 24 – 31, 1998.
[15] G. K. M.J. Joshi and V. Kumar, “Universal Formulation of Sequential Pat-
terns,” Technical report, University of Minnesota, Department of Computer Sci-
ence Minneapolis, 1999.
[16] M.J.Zaki, “Efficient Enumeration of Frequent Sequences,” Proc. of the 7th
Int. Conf. on Information and Knowledge Management, pp. 68 – 75, 1998.
[17] J. Pei, J. Han, M. A. Behzad, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu,
“PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern
Growth,” Int. Conf. on Data Engineering, pp. 215–224, 2001.
[18] R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and
Performance Improvements,” Proc. of the 5th Int. Conf. on Extending Database
Technology, pp. 3 – 17, 1996.
[19] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clus-
tering Techniques,” Proc. of the Knowledge Discovery and Data Mining, pp. 1 –
2, 2000.
[20] J. Wang and J. Han, “BIDE: Efficient Mining of Frequent Closed Sequences,”
Proc. of the 20th Int. Conf. on Data Engineering, pp. 79–90, 2004.
[21] J. Wang, Y. Zhang, L. Zhou, G. Karypis, and C. C. Aggarwal, “Discriminating
Subsequence Discovery for Sequence Clustering,” Proc. of the Sequential Data
Mining, pp. 605–610, 2007.
[22] J. Wang, Y. Zhang, L. Zhou, G. Karypis, and C. C. Aggarwal, “CONTOUR: an
Efficient Algorithm for Discovering Discriminating Subsequences,” Proc. of the
Data Mining and Knowledge Discovery, pp. 1 – 29, 2009.
[23] Wikipedia, “http://en.wikipedia.org/wiki/Protein,” .
[24] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns
in Large Datasets,” Proc. of the Sequential Data Mining, pp. 166–177, 2003.
[25] J. Yang, J. S. Deogun, and Z. Sun, “A New Scheme for Protein Sequence Motif
Extraction,” Proc. of the 38th Hawii Int. Conf. on System Sciences, pp. 280a –
280a, 2005.
[26] M. Zaki and C. Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset
Mining,” Proc. of the SIAM Int. Conf. on Data Mining, pp. 457 – 473, 2002.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內外都一年後公開 withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code