Responsive image
博碩士論文 etd-0620108-172233 詳細資訊
Title page for etd-0620108-172233
論文名稱
Title
一個以大項目集為基礎於DNA微陣列資料中探勘子空間分群之方法
A Large Itemset-Based Approach to Mining Subspace Clusters from DNA Microarray Data
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
75
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2008-05-30
繳交日期
Date of Submission
2008-06-20
關鍵字
Keywords
pCluster、高頻模式樹、大項目集、微陣列、子空間分群
Large Itemset, Microarray, Subspace Clustering, pCluster, FP-tree
統計
Statistics
本論文已被瀏覽 5652 次,被下載 0
The thesis/dissertation has been browsed 5652 times, has been downloaded 0 times.
中文摘要
DNA 微陣列是在實驗性分子生物學上最新的發展之一,並且開啟了產生分子資訊以表現許多生物系統或臨床興趣之資料集的可能性,而分群技術已被證明能幫助理解基因功能、基因調節、細胞進程以及細胞亞型。研究人員證明出大部分的情況下,多筆基因會構成一種疾病,也就刺激研究者去找出某些基因在某些條件下有相似的表現。大部分的子空間分群模組都依據物件在所有條件或部分條件下的距離來定義其相似性,然而,物件間即使距離很遠也可能有很強烈的相關性。許多已提出的方法,例如:pCluster 和zCluster,即為找出某些基因在某些條件下有一致性表現的子空間分群,然而,這兩個方法都包含很費時的步驟,也就是建構基因對的最大維度集合以及分佈其字首樹每個節點上的基因資訊。因此,在這篇論文中,我們提出一個以大項目集為基礎的分群演算法來改進pCluster 和
zCluster 的缺點。首先,我們避免產生基因對的最大維度集合,我們只建構條件對的最大維度集合以降低處理時間。再來,我們轉換從條件對的最大維度集合中挖掘出最大可能基因集合的任務為挖掘出其大項目集的問題,我們利用了挖掘關聯式法則中大項目集的概念,其中大項目集表示在交易資料中出現次數夠多的項目所組成的集合。由於我們只對擁有夠多基因的子空間分群感興趣,因此我們值得去注意在條件對的最大維度集合中出現夠多次的基因集合;換句話說,我們想從條件對的最大維度集合中找出大項目集,因此我們便可獲得和夠多條件對有關的基因集合。在這一步驟中,我們善用一個有效找出大項目集的資料結構之一的高頻模式樹之修正版本,從條件對的最大維度集合中找出基因的大項目集。因此,我們便可以避免複雜的分佈過程,並且利用高頻模式樹大量地降低搜尋空
間。最後,我們發展一個演算法從搜尋完高頻模式樹之後的基因集合和條件對中建構出最後的分群。由於我們只對夠大並且不屬於任何分群的分群感興趣,因此我們交替地合併或擴大基因集合及條件集合來建構盡可能大的子空間分群以滿足需求。根據模擬的結果,我們可以證明由於前人方法需要建構基因對的最大維度集合,所以我們提出的方法比前人的方法需要較短的處理時間。
Abstract
DNA Microarrays are one of the latest breakthroughs in experimental molecular biology and have opened the possibility of creating datasets of molecular information to represent many systems of biological or clinical interest. Clustering techniques have been proven to be helpful to understand gene function, gene regulation, cellular processes, and subtypes of cells. Investigations show that more often than not, several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels are similar under a subset of conditions. Most of the subspace clustering models define similarity among different objects by distances over either all or only a subset of the dimensions. However, strong correlations may still exist among a set of objects, even if they are far apart from each other as measured by the distance functions. Many techniques, such as pCluster and zCluster, have been proposed to find subspace clusters with the coherence expression of a subset of genes on a subset of conditions. However, both of them contain the time-consuming steps, which are constructing gene-pair MDSs and distributing the gene information in each node of a prefix tree. Therefore, in this thesis, we propose a Large Itemset-Based Clustering (LISC) algorithm to improve the disadvantages of the pCluster and zCluster algorithms. First, we avoid to construct the gene-pair MDSs. We only construct the condition-pair MDSs to reduce the processing time. Second, we transform the task of mining the possible maximal gene sets into the mining problem of the large itemsets from the condition-pair MDSs. We make use of the concept of the large itemset which is used in mining association rules, where a large itemset is represented as a set of items appearing in a sufficient number of transactions. Since we are only interested in the subspace cluster with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonably large support from the condition-pair MDSs. In other words, we want to find the large itemsets from the condition-pair MDSs; therefore, we obtain the gene set with respect to enough condition-pairs. In this step, we efficiently use the revised version of FP-tree structure, which has been shown to be one of the most efficient data structures for mining large itemsets, to find the large itemsets of gene sets from the condition-pair MDSs. Thus, we can avoid the complex distributing operation and reduce the search space dramatically by using the FP-tree structure. Finally, we develop an algorithm to construct the final clusters from the gene set and the condition--pair after searching the FP-tree. Since we are interested in the clusters which are large enough and not belong to any other clusters, we alternately combine or extend the gene sets and the condition sets to construct the interesting subspace clusters as large as possible. From our simulation results, we show that our proposed algorithm needs shorter processing time than those previous proposed algorithms, since they need to construct gene-pair MDSs.
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 DNAMicroarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Clustering in a DNAMicroarray . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. A Survey of Subspace Cluster Algorithms . . . . . . . . . . . . . . 12
2.1 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 δ-Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 pCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 zCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3. The LISC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Definitions and ProblemStatement . . . . . . . . . . . . . . . . . . . 22
3.2 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Step 1: Finding Condition–PairMDSs . . . . . . . . . . . . . 24
3.2.2 Step 2: Mining Conditional Pattern Bases . . . . . . . . . . . 26
3.2.3 Step 3: Constructing Subspace Clusters . . . . . . . . . . . . . 37
ii
Page
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 RealMicroarray Datasets . . . . . . . . . . . . . . . . . . . . 52
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
參考文獻 References
[1] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park, “Fast Algorithms
for Projected Clustering,” Proc. of ACM SIGMOD Conf. on Management
of Data, pp. 61–72, 1999.
[2] C. C. Aggarwal and P. S. Yu, “Finding Generalized Projected Clusters in High
Dimensional Spaces,” ACM SIGMOD Record, Vol. 29, No. 2, pp. 70–81, 2000.
[3] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, and J. S. Park, “Automatic
Subspace Clustering of High Dimensional Data for Data Mining Applications,”
Proc. of ACM SIGMOD Conf. on Management of Data, pp. 94–105, 1998.
[4] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is Nearest Neighbors
Meaningful,” Proc. of the Int. Conf. Database Theories, pp. 217–235, 1999.
[5] P. O. Brown and D. Botstein, “Exploring the New World of the Genome with
DNA Microarrays,” Nature Genetics, Vol. 21, No. 1, pp. 33–37, Jan. 1999.
[6] Y. I. Chang, J. R. Chen, and L.W. Lee, “An Efficient Union Approach to Mining
Closed Large Itemsets in DNA Microarray Datasets,” Proc. of the Int. Medical
Informatics Symp., 2007.
[7] C. H. Cheng, A. W. Fu, and Y. Zhang, “Entropy-based Subspace Clustering for
Mining Numerical Data,” Proc. of ACM SIGMOD Conf. on Knowledge Discovery
and Data Mining, pp. 84–93, 1999.
[8] Y. Cheng and G. M. Church, “Biclustering of Expression Data,” Proc. of the 8th
Int. Conf. on Intelligent System for Molecular Biology, pp. 93–103, 2000.
[9] P. D’haeseleer, S. Liang, and R. Somogyi, “Genetic Network Inference: From Coexpression
Clustering to Reverse Engineering,” Bioinformatics, Vol. 16, No. 8,
pp. 707–726, Jan. 2000.
[10] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise,” Proc. of the 2nd
Int. Conf. on Knowledge Discovery and Data Mining, pp. 226–231, 1996.
59
[11] D. H. Fisher, “Knowledge Acquisition via Incremental Conceptual Clustering,”
Machine Learning, Vol. 2, No. 2, pp. 139–172, Sept. 1987.
[12] K. Fukunaga, Introduction to Statistical Pattern Recognition (2nd ed.). Academic
Press Professional, Inc., 1990.
[13] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,
H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S.
Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction
by Gene Expression Monitoring,” Science, Vol. 286, No. 5439, pp. 531–537,
1999.
[14] K. Hakamada, M. Okamoto, and T. Hanai, “Novel Technique for Preprocessing
High Dimensional Time-course Data from DNA Microarray: Mathematical
Model-based Clustering,” Bioinformatics, Vol. 22, No. 7, pp. 843–848, Jan. 2006.
[15] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann,
2001.
[16] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,”
Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 1–12,
2000.
[17] H. V. Jagadish, J. Madar, and R. T. Ng, “Semantic Compression and Pattern
Extraction with Fascicles,” Proc. of the 25th Int. Conf. on Very Large Data
Bases, pp. 186–198, 1999.
[18] D. Jiang, J. Pei, and A. Zhang, “A General Approach to Mining Quality Patternbased
Clusters from Microarray Data,” Proc. of the 10th Int. Conf. on Database
Systems for Advanced Applications, pp. 188–200, 2005.
[19] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression Data:
A Survey,” IEEE Trans. on Knowledge and Data Eng., Vol. 16, No. 11, pp. 1370–
1386, Nov. 2004.
[20] J. Y. Koo, I. Sohn, S. Kim, and J. W. Lee, “Structured Polychotomous Machine
Diagnosis of Multiple Cancer Types Using Gene Expression,” Bioinformatics,
Vol. 22, No. 8, pp. 950–958, Feb. 2006.
[21] L. Lazzeroni and A. Owen, “Plaid Models for Gene Expression Data,” Statistica
Sinica, Vol. 12, No. 1, pp. 61–86, Jan. 2002.
[22] X. Liu and L. Wang, “Computing the Maximum Similarity Bi-clusters of Gene
Expression Data,” Bioinformatics, Vol. 23, No. 1, pp. 50–56, 2007.
60
[23] S. C. Madeira and A. L. Oliveira, “Biclustering Algorithms for Biological Data
Analysis: A Survey,” IEEE/ACM Trans. on Computational Biology and Bioinformatics,
Vol. 1, No. 1, pp. 24–45, Jan. 2004.
[24] R. S. Michalski and R. E. Stepp, “Learning from Observation: Conceptual Clustering,”
Machine Learning: Artificial Intelligence Approach, Vol. 1, pp. 331–363,
1983.
[25] S. Minato, “Zero-Suppressed BDDs for Set Manipulation in Combinatorial Problems,”
Proc. of IEEE/ACM Design Automation Conf., pp. 272–277, 1993.
[26] F. Murtagh, “A Survey of Recent Advances in Hierarchical Clustering Algorithms,”
The Computer Journal, Vol. 26, No. 4, pp. 354–359, 1983.
[27] R. T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial
Data Mining,” Proc. of the 20th Int. Conf. on Very Large Data Bases, pp. 144–
155, 1994.
[28] J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu, “Maple: A Fast Algorithm for
Maximal Pattern-based Clustering,” Proc. of the 3rd IEEE Int. Conf. on Data
Mining, p. 259, 2003.
[29] M. Sultan, D. A. Wigle, C. A. Cumbaa, M. Maziarz, J. Glasgow, M. S. Tsao, and
I. Jurisica, “Binary Tree-Structured Vector Quantization Approach to Clustering
and Visualizing Microarray Data,” Bioinformatics, Vol. 18, pp. 111–119, 2002.
[30] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church, “Yeast Micro
Data Set,” http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000.
[31] A. Tefferi, M. E. Bolander, S. M. Ansell, E. D. Wieben, and T. C. Spelsberg,
“Primer on Medical Genomics Part III: Microarray Experiments and Data Analysis,”
Mayo Clinic Proc., Vol. 77, No. 9, pp. 927–940, Sept. 2002.
[32] H. Wang, F. Chu, W. Fan, P. S. Yu, and J. Pei, “A Fast Algorithm for Subspace
Clustering by Pattern Similarity,” Proc. of the 16th Int. Conf. on Scientific and
Statistical Database Management, pp. 51–60, 2004.
[33] H. Wang, W. Wang, J. Yang, and P. S. Yu, “Clustering by Pattern Similarity in
Large Data Sets,” Proc. of ACM SIGMOD Int. Conf. on Management of Data,
pp. 394–405, 2002.
[34] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan,
J. R. Marks, and J. R. Nevins, “Predicting the Clinical Status of Human Breast
Cancer Using Gene Expression Profiles,” Proc. of the National Academy of Science,
pp. 11462–11467, 2001.
61
[35] E. Yang, P. T. Foteinou, K. R. King, M. L. Yarmush, and I. P. Androulakis, “A
Novel Non-overlapping Bi-clustering Algorithm for Network Generation Using
Living Cell Array Data,” Bioinformatics, Vol. 23, No. 17, pp. 2306–2313, Sept.
2007.
[36] J. Yang, H. Wang, W. Wang, and P. S. Yu, “Enhanced Biclustering on Expression
Data,” Proc. of the 3rd IEEE Int. Symposium on BioInformatics and
BioEngineering, pp. 321–327, 2003.
[37] J. Yang, W. Wang, H. Wang, and P. S. Yu, “δ-Clusters: Capturing Subspace
Correlation in a Large Data Set,” Proc. of the 18th Int. Conf. on Data Eng. ,
pp. 517–528, 2002.
[38] S. Yoon, C. Nardini, L. Benini, and G. D. Micheli, “Enhanced pClustering and
Its Applications to Gene Expression Data,” Proc. of the 4th IEEE Symposium
on BioInformatics and BioEngineering, pp. 275–282, 2004.
[39] S. Yoon, C. Nardini, L. Benini, and G. D. Micheli, “Discovering Coherent Biclusters
from Gene Expression Data Using Zero-Suppressed Binary Decision Diagrams,”
IEEE/ACM Trans. on Computational Biology and Bioinformatic, Vol. 2,
No. 4, pp. 339–354, Oct. 2005.
[40] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering
Method for Very Large Databases,” Proc. of ACM SIGMOD Int. Conf. on
Management of Data, pp. 103–114, 1996.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 3.15.7.13
論文開放下載的時間是 校外不公開

Your IP address is 3.15.7.13
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code