國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個以大項目集為基礎於DNA微陣列資料中探勘子空間分群之方法,A Large Itemset-Based Approach to Mining Subspace Clusters from DNA Microarray Data

論文名稱 Title	一個以大項目集為基礎於DNA微陣列資料中探勘子空間分群之方法 A Large Itemset-Based Approach to Mining Subspace Clusters from DNA Microarray Data
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	96 學年度第 2 學期 The spring semester of Academic Year 96	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	75
研究生 Author	蔡月琪 Yueh-Chi Tsai
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	陳健輝 Gen-huey Chen
口試委員 Advisory Committee	黃三益, 林宣華, 李建億 San-Yi Huang; Shian-Hua Lin; Chien-I Lee
口試日期 Date of Exam	2008-05-30	繳交日期 Date of Submission	2008-06-20
關鍵字 Keywords	pCluster、高頻模式樹、大項目集、微陣列、子空間分群 Large Itemset, Microarray, Subspace Clustering, pCluster, FP-tree
統計 Statistics	本論文已被瀏覽 5653 次，被下載 0 次 The thesis/dissertation has been browsed 5653 times, has been downloaded 0 times.

中文摘要
DNA 微陣列是在實驗性分子生物學上最新的發展之一，並且開啟了產生分子資訊以表現許多生物系統或臨床興趣之資料集的可能性，而分群技術已被證明能幫助理解基因功能、基因調節、細胞進程以及細胞亞型。研究人員證明出大部分的情況下，多筆基因會構成一種疾病，也就刺激研究者去找出某些基因在某些條件下有相似的表現。大部分的子空間分群模組都依據物件在所有條件或部分條件下的距離來定義其相似性，然而，物件間即使距離很遠也可能有很強烈的相關性。許多已提出的方法，例如：pCluster 和zCluster，即為找出某些基因在某些條件下有一致性表現的子空間分群，然而，這兩個方法都包含很費時的步驟，也就是建構基因對的最大維度集合以及分佈其字首樹每個節點上的基因資訊。因此，在這篇論文中，我們提出一個以大項目集為基礎的分群演算法來改進pCluster 和 zCluster 的缺點。首先，我們避免產生基因對的最大維度集合，我們只建構條件對的最大維度集合以降低處理時間。再來，我們轉換從條件對的最大維度集合中挖掘出最大可能基因集合的任務為挖掘出其大項目集的問題，我們利用了挖掘關聯式法則中大項目集的概念，其中大項目集表示在交易資料中出現次數夠多的項目所組成的集合。由於我們只對擁有夠多基因的子空間分群感興趣，因此我們值得去注意在條件對的最大維度集合中出現夠多次的基因集合；換句話說，我們想從條件對的最大維度集合中找出大項目集，因此我們便可獲得和夠多條件對有關的基因集合。在這一步驟中，我們善用一個有效找出大項目集的資料結構之一的高頻模式樹之修正版本，從條件對的最大維度集合中找出基因的大項目集。因此，我們便可以避免複雜的分佈過程，並且利用高頻模式樹大量地降低搜尋空間。最後，我們發展一個演算法從搜尋完高頻模式樹之後的基因集合和條件對中建構出最後的分群。由於我們只對夠大並且不屬於任何分群的分群感興趣，因此我們交替地合併或擴大基因集合及條件集合來建構盡可能大的子空間分群以滿足需求。根據模擬的結果，我們可以證明由於前人方法需要建構基因對的最大維度集合，所以我們提出的方法比前人的方法需要較短的處理時間。
Abstract
DNA Microarrays are one of the latest breakthroughs in experimental molecular biology and have opened the possibility of creating datasets of molecular information to represent many systems of biological or clinical interest. Clustering techniques have been proven to be helpful to understand gene function, gene regulation, cellular processes, and subtypes of cells. Investigations show that more often than not, several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels are similar under a subset of conditions. Most of the subspace clustering models define similarity among different objects by distances over either all or only a subset of the dimensions. However, strong correlations may still exist among a set of objects, even if they are far apart from each other as measured by the distance functions. Many techniques, such as pCluster and zCluster, have been proposed to find subspace clusters with the coherence expression of a subset of genes on a subset of conditions. However, both of them contain the time-consuming steps, which are constructing gene-pair MDSs and distributing the gene information in each node of a prefix tree. Therefore, in this thesis, we propose a Large Itemset-Based Clustering (LISC) algorithm to improve the disadvantages of the pCluster and zCluster algorithms. First, we avoid to construct the gene-pair MDSs. We only construct the condition-pair MDSs to reduce the processing time. Second, we transform the task of mining the possible maximal gene sets into the mining problem of the large itemsets from the condition-pair MDSs. We make use of the concept of the large itemset which is used in mining association rules, where a large itemset is represented as a set of items appearing in a sufficient number of transactions. Since we are only interested in the subspace cluster with gene sets as large as possible, it is desirable to pay attention to those gene sets which have reasonably large support from the condition-pair MDSs. In other words, we want to find the large itemsets from the condition-pair MDSs; therefore, we obtain the gene set with respect to enough condition-pairs. In this step, we efficiently use the revised version of FP-tree structure, which has been shown to be one of the most efficient data structures for mining large itemsets, to find the large itemsets of gene sets from the condition-pair MDSs. Thus, we can avoid the complex distributing operation and reduce the search space dramatically by using the FP-tree structure. Finally, we develop an algorithm to construct the final clusters from the gene set and the condition--pair after searching the FP-tree. Since we are interested in the clusters which are large enough and not belong to any other clusters, we alternately combine or extend the gene sets and the condition sets to construct the interesting subspace clusters as large as possible. From our simulation results, we show that our proposed algorithm needs shorter processing time than those previous proposed algorithms, since they need to construct gene-pair MDSs.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 DNAMicroarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Clustering in a DNAMicroarray . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2. A Survey of Subspace Cluster Algorithms . . . . . . . . . . . . . . 12 2.1 Biclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 δ-Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 pCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 zCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3. The LISC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Definitions and ProblemStatement . . . . . . . . . . . . . . . . . . . 22 3.2 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Step 1: Finding Condition–PairMDSs . . . . . . . . . . . . . 24 3.2.2 Step 2: Mining Conditional Pattern Bases . . . . . . . . . . . 26 3.2.3 Step 3: Constructing Subspace Clusters . . . . . . . . . . . . . 37 ii Page 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.2 RealMicroarray Datasets . . . . . . . . . . . . . . . . . . . . 52 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

參考文獻 References
[1] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park, “Fast Algorithms for Projected Clustering,” Proc. of ACM SIGMOD Conf. on Management of Data, pp. 61–72, 1999. [2] C. C. Aggarwal and P. S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” ACM SIGMOD Record, Vol. 29, No. 2, pp. 70–81, 2000. [3] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, and J. S. Park, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proc. of ACM SIGMOD Conf. on Management of Data, pp. 94–105, 1998. [4] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is Nearest Neighbors Meaningful,” Proc. of the Int. Conf. Database Theories, pp. 217–235, 1999. [5] P. O. Brown and D. Botstein, “Exploring the New World of the Genome with DNA Microarrays,” Nature Genetics, Vol. 21, No. 1, pp. 33–37, Jan. 1999. [6] Y. I. Chang, J. R. Chen, and L.W. Lee, “An Efficient Union Approach to Mining Closed Large Itemsets in DNA Microarray Datasets,” Proc. of the Int. Medical Informatics Symp., 2007. [7] C. H. Cheng, A. W. Fu, and Y. Zhang, “Entropy-based Subspace Clustering for Mining Numerical Data,” Proc. of ACM SIGMOD Conf. on Knowledge Discovery and Data Mining, pp. 84–93, 1999. [8] Y. Cheng and G. M. Church, “Biclustering of Expression Data,” Proc. of the 8th Int. Conf. on Intelligent System for Molecular Biology, pp. 93–103, 2000. [9] P. D’haeseleer, S. Liang, and R. Somogyi, “Genetic Network Inference: From Coexpression Clustering to Reverse Engineering,” Bioinformatics, Vol. 16, No. 8, pp. 707–726, Jan. 2000. [10] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, pp. 226–231, 1996. 59 [11] D. H. Fisher, “Knowledge Acquisition via Incremental Conceptual Clustering,” Machine Learning, Vol. 2, No. 2, pp. 139–172, Sept. 1987. [12] K. Fukunaga, Introduction to Statistical Pattern Recognition (2nd ed.). Academic Press Professional, Inc., 1990. [13] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, Vol. 286, No. 5439, pp. 531–537, 1999. [14] K. Hakamada, M. Okamoto, and T. Hanai, “Novel Technique for Preprocessing High Dimensional Time-course Data from DNA Microarray: Mathematical Model-based Clustering,” Bioinformatics, Vol. 22, No. 7, pp. 843–848, Jan. 2006. [15] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001. [16] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 1–12, 2000. [17] H. V. Jagadish, J. Madar, and R. T. Ng, “Semantic Compression and Pattern Extraction with Fascicles,” Proc. of the 25th Int. Conf. on Very Large Data Bases, pp. 186–198, 1999. [18] D. Jiang, J. Pei, and A. Zhang, “A General Approach to Mining Quality Patternbased Clusters from Microarray Data,” Proc. of the 10th Int. Conf. on Database Systems for Advanced Applications, pp. 188–200, 2005. [19] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression Data: A Survey,” IEEE Trans. on Knowledge and Data Eng., Vol. 16, No. 11, pp. 1370– 1386, Nov. 2004. [20] J. Y. Koo, I. Sohn, S. Kim, and J. W. Lee, “Structured Polychotomous Machine Diagnosis of Multiple Cancer Types Using Gene Expression,” Bioinformatics, Vol. 22, No. 8, pp. 950–958, Feb. 2006. [21] L. Lazzeroni and A. Owen, “Plaid Models for Gene Expression Data,” Statistica Sinica, Vol. 12, No. 1, pp. 61–86, Jan. 2002. [22] X. Liu and L. Wang, “Computing the Maximum Similarity Bi-clusters of Gene Expression Data,” Bioinformatics, Vol. 23, No. 1, pp. 50–56, 2007. 60 [23] S. C. Madeira and A. L. Oliveira, “Biclustering Algorithms for Biological Data Analysis: A Survey,” IEEE/ACM Trans. on Computational Biology and Bioinformatics, Vol. 1, No. 1, pp. 24–45, Jan. 2004. [24] R. S. Michalski and R. E. Stepp, “Learning from Observation: Conceptual Clustering,” Machine Learning: Artificial Intelligence Approach, Vol. 1, pp. 331–363, 1983. [25] S. Minato, “Zero-Suppressed BDDs for Set Manipulation in Combinatorial Problems,” Proc. of IEEE/ACM Design Automation Conf., pp. 272–277, 1993. [26] F. Murtagh, “A Survey of Recent Advances in Hierarchical Clustering Algorithms,” The Computer Journal, Vol. 26, No. 4, pp. 354–359, 1983. [27] R. T. Ng and J. Han, “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proc. of the 20th Int. Conf. on Very Large Data Bases, pp. 144– 155, 1994. [28] J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu, “Maple: A Fast Algorithm for Maximal Pattern-based Clustering,” Proc. of the 3rd IEEE Int. Conf. on Data Mining, p. 259, 2003. [29] M. Sultan, D. A. Wigle, C. A. Cumbaa, M. Maziarz, J. Glasgow, M. S. Tsao, and I. Jurisica, “Binary Tree-Structured Vector Quantization Approach to Clustering and Visualizing Microarray Data,” Bioinformatics, Vol. 18, pp. 111–119, 2002. [30] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church, “Yeast Micro Data Set,” http://arep.med.harvard.edu/biclustering/yeast.matrix, 2000. [31] A. Tefferi, M. E. Bolander, S. M. Ansell, E. D. Wieben, and T. C. Spelsberg, “Primer on Medical Genomics Part III: Microarray Experiments and Data Analysis,” Mayo Clinic Proc., Vol. 77, No. 9, pp. 927–940, Sept. 2002. [32] H. Wang, F. Chu, W. Fan, P. S. Yu, and J. Pei, “A Fast Algorithm for Subspace Clustering by Pattern Similarity,” Proc. of the 16th Int. Conf. on Scientific and Statistical Database Management, pp. 51–60, 2004. [33] H. Wang, W. Wang, J. Yang, and P. S. Yu, “Clustering by Pattern Similarity in Large Data Sets,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 394–405, 2002. [34] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. R. Marks, and J. R. Nevins, “Predicting the Clinical Status of Human Breast Cancer Using Gene Expression Profiles,” Proc. of the National Academy of Science, pp. 11462–11467, 2001. 61 [35] E. Yang, P. T. Foteinou, K. R. King, M. L. Yarmush, and I. P. Androulakis, “A Novel Non-overlapping Bi-clustering Algorithm for Network Generation Using Living Cell Array Data,” Bioinformatics, Vol. 23, No. 17, pp. 2306–2313, Sept. 2007. [36] J. Yang, H. Wang, W. Wang, and P. S. Yu, “Enhanced Biclustering on Expression Data,” Proc. of the 3rd IEEE Int. Symposium on BioInformatics and BioEngineering, pp. 321–327, 2003. [37] J. Yang, W. Wang, H. Wang, and P. S. Yu, “δ-Clusters: Capturing Subspace Correlation in a Large Data Set,” Proc. of the 18th Int. Conf. on Data Eng. , pp. 517–528, 2002. [38] S. Yoon, C. Nardini, L. Benini, and G. D. Micheli, “Enhanced pClustering and Its Applications to Gene Expression Data,” Proc. of the 4th IEEE Symposium on BioInformatics and BioEngineering, pp. 275–282, 2004. [39] S. Yoon, C. Nardini, L. Benini, and G. D. Micheli, “Discovering Coherent Biclusters from Gene Expression Data Using Zero-Suppressed Binary Decision Diagrams,” IEEE/ACM Trans. on Computational Biology and Bioinformatic, Vol. 2, No. 4, pp. 339–354, Oct. 2005. [40] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 103–114, 1996.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.189.170.208 論文開放下載的時間是校外不公開 Your IP address is 18.189.170.208 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS