國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個用於微陣列資料集中之高成長率浮現樣式的資料分類方法,A High Growth-Rate Emerging Pattern for Data Classification in Microarray Datasets

論文名稱 Title	一個用於微陣列資料集中之高成長率浮現樣式的資料分類方法 A High Growth-Rate Emerging Pattern for Data Classification in Microarray Datasets
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	95 學年度第 2 學期 The spring semester of Academic Year 95	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	80
研究生 Author	楊宗彬 Tsung-Bin Yang
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	陳健輝 Gen-huey Chen
口試委員 Advisory Committee	李建億, 郭大維, 黃仁竑 Chien-I Lee; Tei-Wei Kuo; Ren-Hung Hwang
口試日期 Date of Exam	2007-07-05	繳交日期 Date of Submission	2007-07-13
關鍵字 Keywords	基因表現、微陣列、浮現樣式、分類、資料挖掘 Gene Expression, Microarray, Emerging Patterns, Classification, Data Mining
統計 Statistics	本論文已被瀏覽 5618 次，被下載 0 次 The thesis/dissertation has been browsed 5618 times, has been downloaded 0 times.

中文摘要
資料分類(Classification)是在資料挖掘(Data Mining)中重要技術之一。此技術已廣泛地應用在生物資訊上，例如，疾病診斷。近來，資料分類的技術已被使用於微陣列(Microarray)資料集，其中一個微陣列是生物資訊研究基因表現(Gene Expression)程度的一個好用的工具。在微陣列資料集的資料分類問題部份，我們考慮兩個生物資料集，其為對給定相同的測試資料集有兩極化反應的相異類別。基本上，分類過程包含了兩階段：(1)訓練階段，以及(2)測試階段。訓練階段的目的是找出這兩個資料集中各具代表的浮現樣式(Emerging Patterns，EPs)，其中EP是滿足某些條件從一個資料集到另一個資料集的成長率的項目集。注意成長率代表兩個資料集的差異程度。在訓練階段之後，把每個資料集收集到的EP視為一個分類器。在測試階段中，測試的樣本將會依照一個近似函數的評估結果預測其將會被分配到哪個資料集，而近似函數是把成長率和support值列入考量。評估分類的準則是精確度。很明顯地，一個分類器的精確度愈高，其效能愈好。因此，一些以浮現樣式為基礎的分類器，如EJEP和NEP的方法，已被提出來達到這個目的。EJEP方法只考慮成長率為無窮大的項目集，因為它認為高成長率可以導致高精確度。然而，EJEP方法不能保留一些有用的EP，其成長率非常大但不是無窮大。在另一方面，真實世界的資料總是參雜雜質。NEP方法考慮雜質而且提供了比EJEP方法更高的精確度。然而，它仍然可能會忽略了某些成長率高的項目集，而可能造成低的精確度。因此，在這篇論文中，我們提出一個高成長率浮現樣式(High Growth-rate EP，HGEP)的方法來改進EJEP和NEP方法的缺點。除了考慮EJEP方法中具有無窮大成長率的項目集與NEP方法中的雜質樣式，我們的HGEP方法考慮當成長率是有限的項目集，其成長率大於它所有子集的成長率。如此一來，高成長率的項目集導致高的相似度，而高的相似度可預測測試資料集到正確的類別。因此，我們的HGEP可以提供高的精確度。在我們的效能分析中，我們使用數個真實資料集去評估它們的平均精確度。此外，我們也做了模擬測試。從實驗結果中，我們顯示出我們方法的平均精確度比NEP方法來得好。
Abstract
Data classification is one of important techniques in data mining. This technique has been applied widely in many applications, e.g., disease diagnosis. Recently, the data classification technique has been be used for microarray datasets, where a microarray is a very good tool to study the gene expression levels in Bioinformatics. In the part of data classification problem for microarray datasets, we consider two biology datasets which reflect two extreme different classes for the given same sets of tests. Basically, the classification process contains two phases: (1) the training phase, and (2) the testing phase. The propose of the training phase is to find the representative Emerging Patterns (EPs) in each of these two datasets, where an EP is an itemset which satisfies some conditions of the growth rate from one dataset to another dataset. Note that the growth rate represents the differences between these two datasets. After the training phase, we take the collections of EPs in each dataset as a classifier. A test sample in the testing phase will be predicted to one of the two datasets based on the result of a similarity function, which takes the growth rate and the support into consideration. The evaluating criteria of a classifier is the accuracy. Obviously, the higher the accuracy of a classifier is, the better the performance is. Therefore, several EP-based classifiers, e.g., the EJEP and the NEP strategies, have been proposed to achieve this goal. The EJEP strategy considers only those itemsets whose growth rates are infinite, since it claims that the high growth rates may result in the high accuracy. However, the EJEP strategy will not keep those useful EPs whose growth rates are very high but not infinite. On the other hand, the real-world data always contains noises. The NEP strategy considers noises and provides the higher accuracy than the EJEP strategy. However, it still may miss some itemsets with high growth rates, which may result in the low accuracy. Therefore, in this thesis, we propose a High Growth-rate EP (HGEP) strategy to improve the disadvantages of the NEP and the EJEP strategies. In addition to considering itemsets whose growth rates are infinite in the EJEP strategy and noise patterns in the NEP strategy, our HGEP strategy considers those itemsets which have the growth rate higher than all its proper subsets when the growth rates are finite. In this way, the itemsets with high growth rates could result in high similarity, and the high similarity predicts the sets of tests into the correct class. Therefore, our HGEP can provide high accuracy. In our performance study, we use several real datasets to evaluate the average accuracy of them. Moreover, we also do simulation study of increasing noises. From the experiment results, we show that the average accuracy of our HGEP strategy is higher than that of the NEP strategy.

目次 Table of Contents
TABLE OF CONTENTS Page ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Microarrays in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Disease Diagnosis and Data mining . . . . . . . . . . . . . . . . . . . 2 1.3 Emerging Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Emerging Patterns and EP-based Classifiers . . . . . . . . . . . . . . 7 1.4.1 Strategies of Two-Class Classification . . . . . . . . . . . . . . 7 1.4.2 EP-based Classifiers . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2. A Survey of EP-based Strategies . . . . . . . . . . . . . . . . . . . . 15 2.1 Strategies for Mining Emerging Patterns . . . . . . . . . . . . . . . . 15 2.1.1 Emerging Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Essential Jumping Emerging Pattern . . . . . . . . . . . . . . 17 2.1.3 Noise-tolerant Emerging Pattern . . . . . . . . . . . . . . . . . 20 2.2 Classifiers Based on Emerging Patterns . . . . . . . . . . . . . . . . . 21 2.2.1 Classification by Aggregating Emerging Patterns . . . . . . . . 22 2.2.2 Jumping Emerging Pattern-Classifier . . . . . . . . . . . . . . 22 2.2.3 Prediction by Collective Likelihood of Emerging Patterns . . . 23 3. The HGEP Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 The High Growth Rate Emerging Pattern . . . . . . . . . . . . . . . 25 3.2 The Contrast Pattern Tree Structure . . . . . . . . . . . . . . . . . . 29 3.2.1 The Ordered List . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 The Contrast Pattern Tree (CP-tree) . . . . . . . . . . . . . . 30 3.2.3 The Construction of the Contrast Pattern Tree . . . . . . . . 33 3.2.4 The Difference Between CP-trees and FP-trees . . . . . . . . . 36 3.3 The Mining HGEPs Process . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.1 The Merging Process . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.2 Algorithms for Mining HGEPs . . . . . . . . . . . . . . . . . . 39 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1 The Real Microarray Datasets . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

參考文獻 References
[1] “Wikipedia website,” http://en.wikipedia.org/. [2] H. Alhammady and K. Ramamohanarao, “The Application of Emerging Patterns for Improving the Quality of Rare-Class Classification,” Proc. of Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 207–211, 2004. [3] H. Alhammady and K. Ramamohanarao, “Using Emerging Patterns and Decision Trees in Rare-Class Classification,” Proc. of IEEE Int. Conf. Data Mining, pp. 315–318, 2004. [4] H. Alhammady and K. Ramamohanarao, “Mining Emerging Patterns and Classification in Data Streams,” Proc. of 2005 IEEE/WIC/ACM Int. Conf. on Web Intelligence, pp. 272–275, 2005. [5] H. Alhammady and K. Ramamohanarao, “Fast Discovery and the Generalization of Strong Jumping Emerging Patterns for Building Compact and Accurate Classifiers,” IEEE Trans. on Knowledge and Data Eng., Vol. 18, No. 6, pp. 721–737, June 2006. [6] H. Alhammady and K. Ramamohanarao, “Using Emerging Patterns to Construct Weighted Decision Trees,” IEEE Trans. on Knowledge and Data Eng., Vol. 18, No. 7, pp. 865–876, July 2006. [7] J. Bailey, T. Manoukian, and K. Ramamohanarao, “Fast Algorithms for Mining Emerging Patterns,” Proc. of the 6th European Conf. on Principles and Practice of Knowledge Discovery in Databases, pp. 39–50, 2002. [8] J. Bailey, T. Manoukian, and K. Ramamohanarao, “Classification Using Constrained Emerging Patterns,” Proc. of the 4th Int. Conf. on Web-Age Information Management, pp. 226–237, 2003. [9] A. L. Boulesteix, G. Tutz, and K. Strimmer, “A Cart-Based Approach to Discover Emerging Patterns in Microarray Data,” Bioinformatics, Vol. 19, No. 18, pp. 2465–2472, Dec. 2003. [10] Y. T. Chang, “Keyword of the Post Genome Era,” http://www.nyu.edu/classes/ ytchang/book/english.html. [11] G. Dong and J. Li, “Efficient Mining of Emerging Patterns: Discovering Trends and Differences,” Proc. of Int. Conf. Knowledge Discovery and Data Mining, pp. 43–52, 1999. [12] G. Dong and J. Li, “Mining Border Descriptions of Emerging Patterns from Dataset Pairs,” Knowledge and Information Systems, Vol. 8, No. 2, pp. 178–202, Aug. 2005. [13] G. Dong, J. Li, and L. Wong, “The Use of Emerging Patterns in the Analysis of Gene Expression Profiles for the Diagnosis and Understanding of Diseases,” Chapter in New Generation of Data Mining Applications, IEEE Press/Wiley, 2005. [14] G. Dong, J. Li, and X. Zhang, “Discovering Jumping Emerging Patterns and Experiments on Real Datasets,” Proc. of the 9th Int. Conf. on Heterogeneous and Internet Databases, July 1999. [15] G. Dong, X. Zhang, L. Wong, and J. Li, “CAEP: Classification by Aggregating Emerging Patterns,” Proc. of the 2nd Int. Conf. Discovery Science, pp. 30–42, 1999. [16] H. Fan, “Efficient Mining of Interesting Emerging Patterns and Their Effective Use in Classification,” PhD Thesis, Dept. of Computer Science and Software Eng., the University of Melbourne, 2004. [17] H. Fan, M. Fan, K. Ramamohanarao, and M. Liu, “Further Improving Emerging Patterns Based Classifiers via Bagging,” Proc. of the 10th of Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 91–96, April 2006. [18] H. Fan and K. Ramamohanarao, “An Efficient Single-Scan Algorithm for Mining Essential Jumping Emerging Patterns for Classification,” Proc. of the 6th Pacific- Asia Conf. on Knowledge Discovery and Data Mining, pp. 456–462, 2002. [19] H. Fan and K. Ramamohanarao, “A Bayesian Approach to Use Emerging Patterns for Classification,” Proc. of the 14th Australasian Database Conf., pp. 39– 48, 2003. [20] H. Fan and K. Ramamohanarao, “Efficiently Mining Interesting Emerging Patterns for Classification,” Proc. of the 4th Int. Conf. on Web-Age Information Management, pp. 189–201, 2003. [21] H. Fan and K. Ramamohanarao, “Noise Tolerant Classification by Chi Emerging Patterns,” Proc. of the 8th Pacific-Asia Conf. on Knowledge and Data Mining, pp. 201–206, 2004. [22] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, Vol. 286, No. 5439, pp. 531–537, Oct. 1999. [23] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns Without Candidate Generation: A Frequent-Pattern Tree Approach,” Data Mining and Knowledge Discovery, Vol. 8, No. 1, pp. 53–87, Jan. 2004. [24] H. Inakoshi, T. Ando, A. Sato, and S. Okamoto, “Discovery of Emerging Patterns from Nearest Neighbors,” Proc. of IEEE Int. Conf. on Machine Learning and Cybernetics, Vol. 4, pp. 1920–1925, Nov. 2002. [25] J. Li, “Mining Emerging Patterns to Construct Accurate and Efficient Classifiers,” PhD Thesis, Dept. of Computer Science and Software Eng., the University of Melbourne, 2000. [26] J. Li, G. Dong, and K. Ramamohanarao, “Instance-Based Classification by Emerging Patterns,” Proc. of the 4th European Conf. on Principles and Practice of Knowledge Discovery in Databases, pp. 191–200, 2000. [27] J. Li, G. Dong, and K. Ramamohanarao, “Making Use of the Most Expressive Jumping Emerging Patterns for Classification,” Knowledge and Information Systems, Vol. 3, No. 2, pp. 131–145, May 2001. [28] J. Li, G. Dong, K. Ramamohanarao, and L. Wong, “DeEP: A New Instance- Based Lazy Discovery and Classification System,” Machine Learning, Vol. 54, No. 2, pp. 99–124, Feb. 2004. [29] J. Li, H. Liu, J. R. Downing, A. E.-J. Yeoh, and L.Wong, “Simple Rules Underlying Gene Expression Profiles of More Than Six Subtypes of Acute Lymphoblastic Leukemia (ALL) Patients,” Bioinformatics, Vol. 19, No. 1, pp. 71–78, Jan. 2003. [30] J. Li, H. Liu, S. K. Ng, and L. Wong, “Discovery of Significant Rules for Classifying Cancer Diagnosis Data,” Bioinformatics, Vol. 19, No. 2, pp. 93–102, Oct. 2003. [31] J. Li, T. Manoukian, G. Dong, and K. Ramamohanarao, “Incremental Maintenance on the Border of the Space of Emerging Patterns,” Data Mining and Knowledge Discovery, Vol. 9, No. 1, pp. 89–116, July 2004. [32] J. Li, K. Ramamohanarao, and G. Dong, “Emerging Patterns and Classification,” Proc. of the 6th Asian Computing Science Conf., pp. 15–32, 2000. [33] J. Li, K. Ramamohanarao, and G. Dong, “The Space of Jumping Emerging Patterns and Its Incremental Maintenance Algorithm,” Proc. of the 17th Int. Conf. on Machine Learning, pp. 551–558, 2000. [34] J. Li and L. Wong, “Emerging Patterns and Gene Expression Data,” Genome Informatics, Vol. 12, pp. 3–13, Dec. 2001. [35] J. Li and L. Wong, “Geography of Differences Between Two Classes of Data,” Proc. of the 6th European Conf. on Principles of Data Mining and Knowledge Discovery, pp. 325–337, 2002. [36] J. Li and L. Wong, “Identifying Good Diagnostic Gene Groups from Gene Expression Profiles Using the Concept of Emerging Patterns,” Bioinformatics, Vol. 18, pp. 725–734, May 2002. [37] D. Newman, S. Hettich, C. Blake, and C. Merz, “UCI Repository of Machine Learning Databases,” http://www.ics.uci.edu/mlearn/MLRepository.html, 1998. [38] B. Palace, “DataMining: What is Data Mining,” http://www.anderson.ucla.edu/ faculty/jason.frand/teacher/technologies/palace/datamining.htm. [39] K. Ramamohanarao and J. Bailey, “Discovery of Emerging Patterns and Their Use in Classification,” Proc. of the 16th Australian Conf. on Artificial Intelligence, pp. 1–12, 2003. [40] K. Ramamohanarao, J. Bailey, and H. Fan, “Efficient Mining of Contrast Patterns and Their Applications to Classification,” Proc. of the 3rd Int. Conf. on Intelligent Sensing and Information Processing, pp. 37–47, Dec. 2005. [41] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray,” Science, Vol. 270, No. 5235, pp. 467–470, Oct. 1995. [42] Q. Sun, X. Zhang, and K. Ramamohanarao, “Noise Tolerance of EP-Based Classifiers,” Proc. of the 16th Australian Conf. on Artificial Intelligence, pp. 796–806, 2003. [43] Z. Wang, H. Fan, and K. Ramamohanarao, “Exploiting Maximal Emerging Patterns for Classification,” Proc. of the 17th Australian Joint Conf. on Artificial Intelligence, pp. 1062–1068, 2004. [44] C. Yan, “Bioinformatics: Problems and Solutions,” http://www.cs.usu.edu/ cyan/CS5890/. [45] E.-J. Yeoh, M. E. Ross, S. A. Shurtleff, W. K. Williams, D. Patel, R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V. Relling, A. Patel, C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui, W. E. Evans, C. Naeve, L. Wong, and J. R. Downing, “Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene Expression Profiling,” Cancer Cell, Vol. 1, No. 2, pp. 133–143, March 2002.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.116.42.208 論文開放下載的時間是校外不公開 Your IP address is 18.116.42.208 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS