國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個以有效率的聯集運算來於DNA微陣列資料集中挖掘封閉式大項目集的方法,An Efficient Union Approach to Mining Closed Large Itemsets in DNA Microarray Datasets

論文名稱 Title	一個以有效率的聯集運算來於DNA微陣列資料集中挖掘封閉式大項目集的方法 An Efficient Union Approach to Mining Closed Large Itemsets in DNA Microarray Datasets
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	94 學年度第 2 學期 The spring semester of Academic Year 94	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	75
研究生 Author	李立文 Li-Wen Lee
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	黃三益 San-Yi Huang
口試委員 Advisory Committee	陳健輝, 李建億 Gen-huey Chen; Chien-I Lee
口試日期 Date of Exam	2006-06-02	繳交日期 Date of Submission	2006-07-05
關鍵字 Keywords	基因表現、封閉式項目集、微陣列、列列舉、大項目集 row enumeration, microarray, large itemsets, gene expression, closed itemsets
統計 Statistics	本論文已被瀏覽 5657 次，被下載 0 次 The thesis/dissertation has been browsed 5657 times, has been downloaded 0 times.

中文摘要
一個DNA 微陣列是在研究許多基因在不同情況下之基因表現程度的一個好用的工具。在微陣列資料集中挖掘關聯式法則，可以讓我們知道基因和基因之間是如何互相影響，以及哪些基因會經常性的一起表現出來。挖掘封閉式大項目集在減少從DNA 微陣列資料集中所挖掘出的結果上有幫助，而所謂一個封閉式項目集就是一個沒有任何其他項目集包含它並且和它有相同support值的項目集。由於在一個DNA 微陣列資料集中基因（一列代表一個基因）的數量遠大於樣本（一行代表一個樣本）的數量，這使得那些使用行列舉（column enumeration）的傳統資料挖掘演算法面臨了極大的挑戰。所謂行列舉是指藉由行的排列組合來列舉項目集。因此，一些使用列列舉（row enumeration）的演算法，如RERII，被提出來解決這一個問題，所謂列列舉是指藉由列的排列組合來列舉項目集。雖然RERII演算法比其他行列舉的演算法節省更多的記憶體空間並有更好的效能，但是它在列列舉樹中的每個節點都需要相當複雜的交集運算來產生封閉式項目集。在這篇論文中，我們提出了一個基於聯集運算的演算法，UMiner，來於DNA 微陣列中挖掘封閉式大項目集。我們的方法是一個列列舉的方法。首先，我們將轉置表中的每一列依序加入一個字首樹，而一個轉置表記錄了一個項目在原來的表中出現位置的相關資訊。然後，走訪這棵字首樹來產生一個記錄著一個節點以及此節點下在字首樹中子節點範圍之資訊的列與節點對照表。接著我們藉由對列與節點對照表中項目群組裡的項目集做聯集運算，來產生封閉式項目集。由於我們不使用交集運算來為每個列舉節點產生封閉式項目集，因此我們可以降低在列列舉樹中每個節點上所需要的時間複雜度。再則，我們開發了四種刪減的技術來減少在列列舉樹中可能的封閉式項目集。藉由以聯集運算來取代複雜的交集運算並且設計四種刪減技術來減少列列舉樹中分枝的數量，我們的方法可以非常有效率地找出封閉式大項目集。在我們的效能分析中，我們使用了三種真實臨床資料，分別是乳癌、肺癌、以及白血病。從實驗結果中，我們展現出我們的UMiner方法不論在封閉式項目集中項目的平均數量為何，在各種support值下都比RERII方法有更快的速度。此外，從我們的模擬實驗中，我們亦展現出我們的方法在資料集裡列中的平均項目數量增加時，所增加的執行時間遠比RERII方法還要少很多。
Abstract
A DNA microarray is a very good tool to study the gene expression level in different situations. Mining association rules in DNA microarray datasets can help us know how genes affect each other, and what genes are usually co-expressed. Mining closed large itemsets can be useful for reducing the size of the mining result from the DNA microarray datasets, where a closed itemset is an itemset that there is no superset whose support value is the same as the support value of this itemset. Since the number of genes stored in columns is much larger than the number of samples stored in rows in a DNA microarray dataset, traditional mining methods which use column enumeration face a great challenge, where the column enumeration means that enumerating itemsets from the combinations of items stored in columns. Therefore, several row enumeration methods, e.g., RERII, have been proposed to solve this problem, where row enumeration means that enumerating itemsets from the combinations of items stored in rows. Although the RERII method saves more memory space and has better performance than the other row enumeration methods, it needs complex intersection operations at each node of the row enumeration tree to generate the closed itemsets. In this thesis, we propose a new method, UMiner, which is based on the union operations to mine the closed large itemsets in the DNA microarray datasets. Our approach is a row enumeration method. First, we add all tuples in the transposed table to a prefix tree, where a transposed table records the information about where an item appears in the original table. Next, we traverse this prefix tree to create a row-node table which records the information about a node and the related range of its child nodes in the prefix tree created from the transposed table. Then we generate the closed itemset by using the union operations on the itemsets in the item groups stored in the row-node table. Since we do not use the intersection operations to generate the closed itemset for each enumeration node, we can reduce the time complexity that is needed at each node of the row enumeration tree. Moreover, we develop four pruning techniques to reduce the number of candidate closed itemsets in the row enumeration tree. By replacing the complex intersection operations with the union operations and designing four pruning techniques to reduce the number of branches in the row enumeration tree, our method can find closed large itemsets very efficiently. In our performance study, we use three real datasets which are the clinical data on breast cancer, lung cancer, and AML-ALL. From the experiment results, we show that our UMiner method is always faster than the RERII method in all support values, no matter what the average length of the closed large itemsets is. Moreover, in our simulation result, we also show that the processing time of our method increases much more slowly than that of the RERII method as the average number of items in the rows of a dataset increases.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Data Mining in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Closed Large Itemsets . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Mining Association Rules in the DNA Microarray Datasets . . 4 1.1.3 Related Work of Mining Closed Large Itemsets . . . . . . . . 5 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 CARPENTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 The Row Enumeration Tree . . . . . . . . . . . . . . . . . . . 17 2.1.2 The Transposed Table and the X-Conditional Transposed Table 18 2.1.3 The CARPENTER Method . . . . . . . . . . . . . . . . . . . 19 2.1.4 Pruning Techniques . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 FARMER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 The Transposed Table and the X-Conditional Transposed Table 21 2.2.2 The FARMER Method . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Pruning Techniques . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 RERII and REPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 The Row Enumeration Tree of RERII . . . . . . . . . . . . . . 24 2.3.2 Pruning Techniques of the RERII Method . . . . . . . . . . . 25 3. A Union Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 Creating the Prefix Tree . . . . . . . . . . . . . . . . . . . . . 28 3.1.2 Creating a Row-Node Table . . . . . . . . . . . . . . . . . . . 31 3.1.3 Exploring the Row Enumeration Tree . . . . . . . . . . . . . . 35 3.2 Pruning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 The Real DNA Microarray Dataset . . . . . . . . . . . . . . . . . . . 48 4.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4 Simulation Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Summery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

參考文獻 References
[1] C. C. Aggarwal, “Towards Long Pattern Generation in Dense Databases,” ACM SIGKDD Explorations Newsletter, Vol. 3, No. 1, pp. 20–26, July 2001. [2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. of the 20th Int. Conf. on Very Large Databases, pp. 487–499, 1994. [3] N. Bolshakova, F. Azuaje, and P. Cunninghan, “An Integrated Tool for Microarray Data Clustering and Cluster Validity Assessment,” Bioinformatics, Vol. 21, No. 4, pp. 451–455, Feb. 2005. [4] G. Cong, K. L. Tan, A. K. H. Tung, and F. Pan, “Mining Frequent Closed Pattern in Microarray Data,” Proc. of the 4th IEEE Int. Conf. on Data Mining, pp. 363–366, 2004. [5] G. Cong, A. K. H. Tung, X. Xu, F. Pan, and J. Yang, “Farmer: Finding Interesting Rule Groups in Microarray Datasets,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 143–154, 2004. [6] E. Georgii, L. Richter, U. Ruckert, and S. Kramer, “Analyzing Microarray Data Using Quantitative Association Rules,” Bioinformatics, Vol. 21, No. 2, pp. 123–129, Sept. 2005. [7] K. Hakamada, M. Okamoto, and T. Hanai, “Novel Technique for Preprocessing High Dimensional Time-course Data from DNA Microarray: Mathematical Model-based Clustering,” Bioinformatics, Vol. 22, No. 7, pp. 843–848, Jan. 2006. [8] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 1–12, 2000. [9] L. Ji and K. L. Tan, “Identifying Time–Lagged Gene Clusters Using Gene Expression Data,” Bioinformatics, Vol. 21, No. 4, pp. 509–516, April 2005. [10] J. Y. Koo, I. Sohn, S. Kim, and J. W. Lee, “Structured Polychotomous Machine Diagnosis of Multiple Cancer Types Using Gene Expression,” Bioinformatics, Vol. 22, No. 8, pp. 950–958, Feb. 2006. [11] P. Kotala, P. Zhou, S. Mudivarthy, W. Perrizo, and E. Deckard, “Gene Expression Profiling of DNA Microarray Data Using Peano Count Ttrees,” Proc. of the 1st Virtual Conf. on Genomics and Bioinformatics, 2001. [12] J. Li, H. Liu, S. K. Ng, and L. Wong, “Discovery of Significant Rules for Classifying Cancer Diagnosis Data,” Bioinformatics, Vol. 19, No. 2, pp. 93–102, Sept. 2003. [13] J. J. Liu, G. Cutler, W. Li, Z. Pan, S. Peng, T. Hoey, L. Chen, and X. B. Ling, “Multiclass Cancer Classification and Biomarker Discovery Using GA–Based Algorithms,” Bioinformatics, Vol. 21, No. 11, pp. 2691–2697, June 2005. [14] T. Oyama, K. Kitano, K. Satou, and T. Ito, Extraction of Knowledge on Protein-Protein Interaction by Association Rule Discovery,” Bioinformatics, Vol. 18, No. 5, pp. 705–714, May 2002. [15] F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. J. Zaki, “Carpenter: Finding Closed Patterns in Long Biological Datasets,” Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 637–642, 2003. [16] F. Pan, A. K. H. Tung, G. Cong, and X. Xu, “Cobbler: Combining Column and Row Enumeration for Closed Pattern Discovery,” Proc. of the 16th Int. Conf. on Scientific and Statistical Database Management, pp. 21–30, 2004. [17] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering Frequent Closed Itemsets for Association Rules,” Proc. of the 7th Int. Conf. on Database Theory, pp. 398–416, 1999. [18] J. Pei, J. Han, and R. Mao., “Closet: An Efficient Algorithm for Mining Frequent Closed Itemsets,” Proc. of ACM SIGMOD Int. Workshop on Data Mining and Knowledge Discovery, pp. 21–30, 2000. [19] G. Piatetsky-Shapiro and P. Tamayo, “Microarray Data Mining: Facing the Callenges,” ACM SIGKDD Explorations Newsletter, Vol. 5, No. 2, pp. 1–5, Dec. 2003. [20] J. Wang, J. Han, and J. Pei, “Closet+: Searching for the Best Strategies for Mining Frequent Closed Itemsets,” Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 236–245, 2003. [21] X. Wu, Y. Ye, and L. Zhang, “Graphical Mmodeling Based Gene Interaction Analysis for Microarray Data,” ACM SIGKDD Explorations Newsletter, Vol. 5, No. 2, pp. 91–100, Dec. 2003. [22] M. J. Zaki, “Mining Non-redundant Association Rules,” Data Mining and Knowledge Discovery, Vol. 9, No. 3, pp. 223–248, Nov. 2004. [23] M. J. Zaki and C. Hsiao, “Efficient Algorithms for Mining Closed Itemsets and Their Lattice Structure,” IEEE Trans. on Knowledge and Data Eng., Vol. 17, No. 4, pp. 462–478, April 2005.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.118.166.98 論文開放下載的時間是校外不公開 Your IP address is 18.118.166.98 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS