國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,微陣列資料庫的有效率雙向分群方法 ,Efficient Biclustering Methods for Microarray Databases

論文名稱 Title	微陣列資料庫的有效率雙向分群方法 Efficient Biclustering Methods for Microarray Databases
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	98 學年度第 2 學期 The spring semester of Academic Year 98	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	134
研究生 Author	陳俊榮 Jiun-Rung Chen
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	楊維邦 Wei-Pang Yang
口試委員 Advisory Committee	郭大維, 曾新穆, 李宗南, 李強, 李素瑛 Tei-Wei Kuo; Vincent Shin-Mu Tseng; Chungnan Lee; Chiang Lee; Suh-Yin Lee
口試日期 Date of Exam	2010-05-28	繳交日期 Date of Submission	2010-06-14
關鍵字 Keywords	微陣列、雙向分群、人類基因體計畫、一致性數值、一致性進展 coherent evolution, microarray, bicluster, coherent value, Human Genome Project
統計 Statistics	本論文已被瀏覽 5751 次，被下載 1252 次 The thesis/dissertation has been browsed 5751 times, has been downloaded 1252 times.

中文摘要
由於人類基因體計畫，生物資料如微陣列資料等被大量地產生出來。既然這些資料的數量非常龐大，資料探勘的技術便可以用來幫助生物學家有效率地分析這些資料。對微陣列資料來說，同時對列（如基因）及行（如實驗條件）進行分群的雙向分群技術，已被證明在尋找令人感興趣的現象上有很大的價值。目前已有數種不同的雙向分群種類被提出。其中針對要找出一致性數值的雙向分群問題，大部份前人的方法都需要對微陣列資料中每兩個基因計算其最大維度集。但在微陣列資料中，基因的個數是遠遠大於實驗條件的個數。對每兩個基因進行計算的這個步驟是沒有效率的。在另一方面，針對要找出一致性進展的雙向分群問題，之前的學者提出了Co-gclustering這個方法。它可以同時地找出包含規則模式及相反規則模式的雙向分群。然而，此方法的時間複雜度會隨著實驗條件的個數而呈指數成長，降低了此方法的效率。因此，在此博士論文中，為了要有效率地解決微陣列資料庫中雙向分群的問題，首先，我們提出了一個條件列舉樹的方法（Condition-Enumeration Tree，CE-Tree），來探勘一致性數值的雙向分群。接著，我們提出了一個上下位元樣式的方法（Up-Down Bit Pattern，UDB），來探勘一致性進展的雙向分群。在我們提出的第一個方法CE-Tree中，為了探勘雙向分群，我們並不對每兩個基因計算其最大維度集，而是只對每兩個條件計算其最大維度集。為了有效率地找出雙向分群，我們使用了一種特殊的全域深度優先區域廣度優先方法，來生成我們提出的CE-Tree。從在真實資料上的實驗結果來看，我們顯示出CE-Tree方法可以比前人提出的方法更有效率地找出雙向分群。在我們提出的第二個方法UDB中，我們利用上下位元樣式來記錄每個基因在哪些實驗條件組合下，會呈現上升或下降的規則現象。接著，我們在這些上下位元樣式上，使用了位元運算及一個啟發式的概念，來有效率地找出分群結果。和之前學者提出的Co-gclustering方法比較起來，我們的UDB方法將時間複雜度從指數成長降到了多項式成長。從在真實資料上的實驗結果來看，我們顯示出UDB方法在效率上會比Co-gclustering方法更好。
Abstract
Because of the Human Genome Project, enormous quantities of biological data, e.g., microarray data, are generated. Since the amount of biological data is very large, data mining techniques can be used to help biologists efficiently analyze the biological data. For microarray data, biclustering, which performs simulataneous clustering of rows (e.g., genes) and columns (e.g., conditions), has proved of great value for finding interesting patterns. There were several types of biclusters proposed. To mine biclusters with coherent values, most of the previous methods need to compute Maximum Dimension Sets (MDSs) for every two genes in the microarray data. Since the number of genes is far larger than the number of conditions, this step is inefficient. On the other hand, to mine biclusters with coherent evolutions, the Co-gclustering method was proposed which could simultaneously find biclusters with both coregulated and negative-coregulated patterns. However, its time complexity is exponential to the number of conditions, which is not efficient. Therefore, in this dissertation, to efficiently solve the problem of biclustering for microarray databases, first, we propose a Condition Enumeration Tree (CE-Tree) method which mines biclusters with coherent values. Second, we propose an Up-Down Bit Pattern (UDB) method which mines biclusters with coherent evolutions. In the first proposed method, CE-Tree, to mine biclusters, instead of generating MDSs for every two genes, we generate only MDSs for every two conditions. Then, we expand the CE-Tree in a special local breadth-first within global depth-first manner to efficiently find the clustering result. From the experimental results on real data, we have shown that the CE-Tree method could mine biclusters more efficiently than several previous methods. In the second proposed method, UDB, we utilize up-down bit patterns to record the condition pairs where one gene is upregulated or downregulated. Then, we utilize bit operations and apply a heuristic idea on these up-down bit patterns to efficiently find the clustering result. As compared to the Co-gclustering method, the UDB method reduces the time complexity from exponential time to polynomial time. From the experimental results on real data, we have shown that the UDB method is more efficient than the Co-gclustering method.

目次 Table of Contents
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Computational Biology . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Clustering forMicroarray Databases . . . . . . . . . . . . . . . . . . 3 1.3 RelatedWork of Biclustering . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . 14 1.4.1 A Condition-Enumeration Tree Method of Biclustering with Coherent Values . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4.2 An Up-Down Bit Pattern Method of Biclustering with Coherent Evolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 18 2. A Survey of Biclustering Methods . . . . . . . . . . . . . . . . . . . 19 2.1 Methods of Biclustering with Coherent Values . . . . . . . . . . . . . 19 2.1.1 Cheng and Church’s Method . . . . . . . . . . . . . . . . . . . 19 2.1.2 The pClustering Method . . . . . . . . . . . . . . . . . . . . . 21 2.1.3 The zCluster Method . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.4 TheMicroCluster Method . . . . . . . . . . . . . . . . . . . . 28 2.2 Methods of Biclustering with Coherent Evolutions . . . . . . . . . . . 30 2.2.1 The BiModule Method . . . . . . . . . . . . . . . . . . . . . . 30 2.2.2 The OP-Cluster Method . . . . . . . . . . . . . . . . . . . . . 33 2.2.3 Cheung et al.’s Method . . . . . . . . . . . . . . . . . . . . . . 34 2.2.4 The Co-gclustering Method . . . . . . . . . . . . . . . . . . . 35 3. A Condition-Enumeration Tree Method of Biclustering with Coherent Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 A Condition-Enumeration TreeMethod . . . . . . . . . . . . . . . . . 40 3.1.1 Step 1: Generating Condition-PairMDSs . . . . . . . . . . . . 40 3.1.2 Step 2: The Pruning Step . . . . . . . . . . . . . . . . . . . . 44 3.1.3 Step 3: The Joining Step . . . . . . . . . . . . . . . . . . . . . 46 3.1.3.1 The Condition Enumeration Tree . . . . . . . . . . . 46 3.1.3.2 The ⊗ Operation . . . . . . . . . . . . . . . . . . . . 50 3.1.3.3 The Bounding Techniques . . . . . . . . . . . . . . . 55 3.1.4 Improving the Joining Process . . . . . . . . . . . . . . . . . . 60 3.1.4.1 The Signature Table . . . . . . . . . . . . . . . . . . 61 3.1.4.2 The New Joining Process . . . . . . . . . . . . . . . 65 3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.1 Experimental Data . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.2 Accuracy of the CE-TreeMethod . . . . . . . . . . . . . . . . 70 3.2.3 Efficiency of the CE-TreeMethod . . . . . . . . . . . . . . . . 72 3.2.3.1 Generating Object-PairMDSs . . . . . . . . . . . . . 72 3.2.3.2 TheMicroCluster Method . . . . . . . . . . . . . . . 75 3.2.4 Discussion on pClusters fromReal Data . . . . . . . . . . . . 78 3.2.5 Efficiency of Pruning and Bounding Techniques . . . . . . . . 82 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4. An Up-Down Bit Pattern Method of Biclustering with Coherent Evolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1 An Up-Down Bit PatternMethod . . . . . . . . . . . . . . . . . . . . 85 4.1.1 Step 1: Determining Up-Down Bit Patterns . . . . . . . . . . 86 4.1.2 Step 2: Clustering Genes Based on Up-Down Bit Patterns . . 89 4.1.3 An Improved Version of Step 2 . . . . . . . . . . . . . . . . . 93 4.1.4 Step 3: Post-processing the Clusters . . . . . . . . . . . . . . 96 4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2.1 Simulation Results on Synthetic Data Sets . . . . . . . . . . . 99 4.2.2 Experimental Results on Real Data Sets . . . . . . . . . . . . 104 4.3 A Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2 The Future Research Direction . . . . . . . . . . . . . . . . . . . . . 112 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

參考文獻 References
[1] “Cloning Technology.” http://library.thinkquest.org/C0123260/cloning/cloning%20technology.htm, 2001. [2] “GenBank Statistics.” http://www.ncbi.nlm.nih.gov/Genbank/, 2009. [3] “Wikipedia: DNA Microarray.” http://en.wikipedia.org/wiki/DNA_microarray, 2010. [4] “Wikipedia: Protein.” http://en.wikipedia.org/wiki/Protein, 2010. [5] “Wikipedia: Protein Microarray.” http://en.wikipedia.org/wiki/Protein_microarray, 2010. [6] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. of the 20th Int. Conf. on Very Large Data Bases, pp. 487–499, 1994. [7] J. S. Aguilar-Ruiz, “Shifting and Scaling Patterns from Gene Expression Data,” Bioinformatics, Vol. 21, No. 20, pp. 3840–3845, Oct. 2005. [8] A. Ben-Dor, B. Chor, R.M. Karp, and Z. Yakhini, “Discovering Local Structure in Gene Expression Data: The Order-Preserving Submatrix Problem,” Journal of Computational Biology, Vol. 10, No. 3–4, pp. 373–384, June 2003. [9] B. J. Breitkreutz, C. Stark, and M. Tyers, “Yeast Grid.” http://biodata.mshri.on.ca/yeast_grid/servlet/, 2006. [10] J. P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and Molecular Pattern Discovery Using Matrix Factorization,” Proc. of the National Academy of Science, pp. 4164–4169, 2004. [11] J. M. Bull, L. A. Smith, L. Pottage, and R. Freeman, “Benchmarking Java Against C and Fortran for Scientific Applications,” Proc. of ACM-ISCOPE Conf. on Java Grande, pp. 97–105, 2001. [12] S. Busygin, G. Jacobsen, and E. Kramer, “Double Conjugated Clustering Applied to Leukemia Microarray Data,” Proc. of the 2nd SIAM Int. Conf. on Data Mining, Workshop on Clustering High Dimensional Data, pp. 1–9, 2002. [13] J. R. Chen and Y. I. Chang, “A Condition-Enumeration Tree Method for Mining Biclusters in DNA Microarray Data Sets,” BioSystems, Vol. 97, No. 1, pp. 44–59, July 2009. [14] J. R. Chen and Y. I. Chang, “An Up-Down Bit Pattern Approach to Coregulated and Negative-Coregulated Gene Clustering of Microarray Data,” accepted by Journal of Computational Biology, April 2010. [15] M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An Overview from Database Perspective,” IEEE Trans. on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 866–883, Dec. 1996. [16] Y. Cheng and G. M. Church, “Biclustering of Expression Data,” Proc. of the 8th Int. Conf. on Intelligent Systems for Molecular Biology, pp. 93–103, 2000. [17] L. Cheung, K. Y. Yip, D.W. Cheung, B. Kao, and M. K. Ng, “On Mining Micro-Array Data by Order-Preserving Submatrix,” Int. Journal of Bioinformatics Research and Applications, Vol. 3, No. 1, pp. 42–64, Feb. 2007. [18] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window,” Proc. of the 4th IEEE Int. Conf. on Data Mining, pp. 59–66, 2004. [19] J. Cohen, “Bioinformatics — An Introduction for Computer Scientists,” ACM Computing Surverys, Vol. 36, No. 2, pp. 122–158, June 2004. [20] F. S. Collins, M. Morgan, and A. Patrinos, “The Human Genome Project: Lessons from Large-Scale Biology,” Science, Vol. 300, No. 5617, pp. 286–290, April 2003. [21] D. J. DeWitt, R. Katz, F. Olken, L. Shapiro, M. Stonebraker, and D.Wood, “Implementation Techniques for Main Memory Database Systems,” ACM SIGMOD Record, Vol. 14, No. 2, pp. 1–8, June 1984. [22] G. Getz, E. Levine, and E. Domany, “Coupled Two-Way Clustering Analysis of Gene Microarray Data,” Proc. of the Natural Academy of Sciences USA, Vol. 97, No. 22, pp. 12079–12084, Oct. 2000. [23] J. A. Hartigan, “Direct Clustering of a Data Matrix,” Journal of the American Statistical Association, Vol. 67, No. 337, pp. 123–129, March 1972. [24] J. L. Houle, W. Cadigan, S. Henry, A. Pinnamaneni, and S. Lundahl, “Database Mining in the Human Genome Initiative.” http://www.biodatabases.com/whitepaper01.html, 2000. [25] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression Data: A Survey,” IEEE Trans. on Knowledge and Data Eng., Vol. 16, No. 11, pp. 1370–1386, Nov. 2004. [26] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer, “Classification and Diagnostic Prediction of Cancers Using Expression Profiling and Artificial Neural Networks,” Nature Medicine, Vol. 7, No. 6, pp. 673–679, June 2001. [27] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka, “Application of Hash to Data Base Machine and Its Architecture,” New Generation Computing, Vol. 1, No. 1, pp. 63–74, 1983. [28] C. T. Lee, “Computational Biology.” http://www.csie.ncnu.edu.tw/~rctlee/biology.html, 2003. [29] J. Liu and W. Wang, “Op-Cluster: Clustering by Tendency in High-Dimensional Space,” Proc. of the 3rd IEEE Int. Conf. on Data Mining, pp. 187–194, 2003. [30] S. C. Madeira and A. L. Oliveira, “Biclustering Algorithms for Biological Data Analysis: A Survey,” IEEE/ACM Trans. on Computational Biology and Bioinformatics, Vol. 1, No. 1, pp. 24–45, Jan.–March 2004. [31] P. Merz, “Analysis of Gene Expression Profiles: An Application of Memetic Algorithms to the Minimum Sum-of-Squares Clustering Problem,” Biosystems, Vol. 72, No. 1–2, pp. 99–109, Nov. 2003. [32] S. Minato, “Zero-Suppressed BDDs for Set Manipulation in Combinatorial Problems,” Proc. of IEEE/ACM Design Automation Conf., pp. 272–277, 1993. [33] Y. Okada, W. Fujibuchi, and P. Horton, “A Biclustering Method for Gene Expression Module Discovery Using a Closed Itemset Enumeration Algorithm ,” IPSJ Digital Courier, Vol. 3, pp. 183–192, 2007. [34] M. L. Pearson and D. S‥oll, “The Human Genome Project: A Paradigm for Information Management in the Life Sciences,” The FASEB Journal, Vol. 5, No. 1, pp. 35–39, Jan. 1991. [35] J. Pei, X. Zhang, M. Cho, H. Wang, and P. S. Yu, “Maple: A Fast Algorithm for Maximal Pattern-based Clustering,” Proc. of the 3rd IEEE Int. Conf. on Data Mining, pp. 259–266, 2003. [36] E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller, “Rich Probabilistic Models for Gene Expression,” Bioinformatics, Vol. 17, No. 1, pp. 243–252, 2001. [37] Q. Sheng, Y. Moreau, and B. D. Moor, “Biclustering Microarray Data by Gibbs Sampling,” Bioinformatics, Vol. 19, No. 2, pp. 196–205, Oct. 2003. [38] M. P. Tan, E. N. Smith, J. R. Broach, and C. A. Floudas, “Microarray Data Mining: A Novel Optimization-Based Approach to Uncover Biologically Coherent Structures,” BMC Bioinformatics, Vol. 9, No. 268, pp. 1–21, June 2008. [39] A. Tanay, R. Sharan, and R. Shamir, “Discovering Statistically Significant Biclusters in Gene Expression Data,” Bioinformatics, Vol. 18, No. 1, pp. 136–144, July 2002. [40] S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, “Systematic Determination of Genetic Network Architecture,” Nature Genetics, Vol. 22, No. 3, pp. 281–285, July 1999. [41] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown, “Clustering Methods for the Analysis of DNA Microarray Data,” Technical Report, Department of Health Research and Policy, Department of Genetics and Department of Biochemestry, Stanford University, 1999. [42] H. Wang, W. Wang, J. Yang, and P. S. Yu, “Clustering by Pattern Similarity in Large Data Sets,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 394–405, 2002. [43] S. C. Wang, Science Development. National Science Council, 2003. [44] J. D. Watson, “The Human Genome Project: Past, Present, and Future,” Science, Vol. 248, No. 4951, pp. 44–49, April 1990. [45] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. R. Marks, and J. R. Nevins, “Predicting the Clinical Status of Human Breast Cancer Using Gene Expression Profiles,” Proc. of the National Academy of Science, pp. 11462–11467, 2001. [46] J. Yang, H. Wang, W. Wang, and P. S. Yu, “An Improved Biclustering Method for Analyzing Gene Expression Profiles,” Int. Journal on Artificial Intelligence Tools, Vol. 14, No. 5, pp. 771–789, Oct. 2005. [47] J. Yang, W. Wang, H. Wang, and P. S. Yu, “δ-Clusters: Capturing Subspace Correlation in a Large Data Set,” Proc. of the 18th Int. Conf. on Data Eng., pp. 517–528, 2002. [48] S. Yoon, C. Nardini, L. Benini, and G. D. Micheli, “Discovering Coherent Biclusters from Gene Expression Data Using Zero-Suppressed Binary Decision Diagrams,” IEEE/ACM Trans. on Computational Biology and Bioinformatics, Vol. 2, No. 4, pp. 339–354, Oct.–Dec. 2005. [49] L. Zhao and M. J. Zaki, “MicroCluster: Efficient Deterministic Biclustering of Microarray Data,” IEEE Intelligent Systems, Vol. 20, No. 6, pp. 40–49, Nov./Dec. 2005. [50] Y. Zhao, J. X. Yu, G. Wang, L. Chen, B. Wang, and G. Yu, “Maximal Subspace Coregulated Gene Clustering,” IEEE Trans. on Knowledge and Data Eng., Vol. 20, No. 1, pp. 83–98, Jan. 2008.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內外都一年後公開 withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0614110-161634.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS