Responsive image
博碩士論文 etd-0713107-180703 詳細資訊
Title page for etd-0713107-180703
論文名稱
Title
一個用於微陣列資料集中之高成長率浮現樣式的資料分類方法
A High Growth-Rate Emerging Pattern for Data Classification in Microarray Datasets
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
80
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2007-07-05
繳交日期
Date of Submission
2007-07-13
關鍵字
Keywords
基因表現、微陣列、浮現樣式、分類、資料挖掘
Gene Expression, Microarray, Emerging Patterns, Classification, Data Mining
統計
Statistics
本論文已被瀏覽 5618 次,被下載 0
The thesis/dissertation has been browsed 5618 times, has been downloaded 0 times.
中文摘要
資料分類(Classification)是在資料挖掘(Data Mining)中重要技術之一。此技術已廣泛地應用在生物資訊上,例如,疾病診斷。近來,資料分類的技術已被使用於微陣列(Microarray)資料集,其中一個微陣列是生物資訊研究基因表現(Gene Expression)程度的一個好用的工具。在微陣列資料集的資料分類問題部份,我們考慮兩個生物資料集,其為對給定相同的測試資料集有兩極化反應的相異類別。基本上,分類過程包含了兩階段:(1)訓練階段,以及(2)測試階段。訓練階段的目的是找出這兩個資料集中各具代表的浮現樣式(Emerging Patterns,EPs),其中EP是滿足某些條件從一個資料集到另一個資料集的成長率的項目集。注意成長率代表兩個資料集的差異程度。在訓練階段之後,把每個資料集收集到的EP視為一個分類器。在測試階段中,測試的樣本將會依照一個近似函數的評估結果預測其將會被分配到哪個資料集,而近似函數是把成長率和support值列入考量。評估分類的準則是精確度。很明顯地,一個分類器的精確度愈高,其效能愈好。因此,一些以浮現樣式為基礎的分類器,如EJEP和NEP的方法,已被提出來達到這個目的。EJEP方法只考慮成長率為無窮大的項目集,因為它認為高成長率可以導致高精確度。然而,EJEP方法不能保留一些有用的EP,其成長率非常大但不是無窮大。在另一方面,真實世界的資料總是參雜雜質。NEP方法考慮雜質而且提供了比EJEP方法更高的精確度。然而,它仍然可能會忽略了某些成長率高的項目集,而可能造成低的精確度。因此,在這篇論文中,我們提出一個高成長率浮現樣式(High Growth-rate EP,HGEP)的方法來改進EJEP和NEP方法的缺點。除了考慮EJEP方法中具有無窮大成長率的項目集與NEP方法中的雜質樣式,我們的HGEP方法考慮當成長率是有限的項目集,其成長率大於它所有子集的成長率。如此一來,高成長率的項目集導致高的相似度,而高的相似度可預測測試資料集到正確的類別。因此,我們的HGEP可以提供高的精確度。在我們的效能分析中,我們使用數個真實資料集去評估它們的平均精確度。此外,我們也做了模擬測試。從實驗結果中,我們顯示出我們方法的平均精確度比NEP方法來得好。
Abstract
Data classification is one of important techniques in data mining. This technique has
been applied widely in many applications, e.g., disease diagnosis. Recently, the data
classification technique has been be used for microarray datasets, where a microarray
is a very good tool to study the gene expression levels in Bioinformatics. In the
part of data classification problem for microarray datasets, we consider two biology
datasets which reflect two extreme different classes for the given same sets of tests.
Basically, the classification process contains two phases: (1) the training phase, and
(2) the testing phase. The propose of the training phase is to find the representative
Emerging Patterns (EPs) in each of these two datasets, where an EP is an itemset
which satisfies some conditions of the growth rate from one dataset to another dataset.
Note that the growth rate represents the differences between these two datasets. After
the training phase, we take the collections of EPs in each dataset as a classifier. A
test sample in the testing phase will be predicted to one of the two datasets based on
the result of a similarity function, which takes the growth rate and the support into
consideration. The evaluating criteria of a classifier is the accuracy. Obviously, the
higher the accuracy of a classifier is, the better the performance is. Therefore, several
EP-based classifiers, e.g., the EJEP and the NEP strategies, have been proposed to
achieve this goal. The EJEP strategy considers only those itemsets whose growth
rates are infinite, since it claims that the high growth rates may result in the high
accuracy. However, the EJEP strategy will not keep those useful EPs whose growth
rates are very high but not infinite. On the other hand, the real-world data always
contains noises. The NEP strategy considers noises and provides the higher accuracy
than the EJEP strategy. However, it still may miss some itemsets with high growth
rates, which may result in the low accuracy. Therefore, in this thesis, we propose
a High Growth-rate EP (HGEP) strategy to improve the disadvantages of the NEP
and the EJEP strategies. In addition to considering itemsets whose growth rates
are infinite in the EJEP strategy and noise patterns in the NEP strategy, our HGEP
strategy considers those itemsets which have the growth rate higher than all its proper
subsets when the growth rates are finite. In this way, the itemsets with high growth
rates could result in high similarity, and the high similarity predicts the sets of tests
into the correct class. Therefore, our HGEP can provide high accuracy. In our
performance study, we use several real datasets to evaluate the average accuracy
of them. Moreover, we also do simulation study of increasing noises. From the
experiment results, we show that the average accuracy of our HGEP strategy is
higher than that of the NEP strategy.
目次 Table of Contents
TABLE OF CONTENTS
Page
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Microarrays in Bioinformatics . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Disease Diagnosis and Data mining . . . . . . . . . . . . . . . . . . . 2
1.3 Emerging Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Emerging Patterns and EP-based Classifiers . . . . . . . . . . . . . . 7
1.4.1 Strategies of Two-Class Classification . . . . . . . . . . . . . . 7
1.4.2 EP-based Classifiers . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2. A Survey of EP-based Strategies . . . . . . . . . . . . . . . . . . . . 15
2.1 Strategies for Mining Emerging Patterns . . . . . . . . . . . . . . . . 15
2.1.1 Emerging Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Essential Jumping Emerging Pattern . . . . . . . . . . . . . . 17
2.1.3 Noise-tolerant Emerging Pattern . . . . . . . . . . . . . . . . . 20
2.2 Classifiers Based on Emerging Patterns . . . . . . . . . . . . . . . . . 21
2.2.1 Classification by Aggregating Emerging Patterns . . . . . . . . 22
2.2.2 Jumping Emerging Pattern-Classifier . . . . . . . . . . . . . . 22
2.2.3 Prediction by Collective Likelihood of Emerging Patterns . . . 23
3. The HGEP Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 The High Growth Rate Emerging Pattern . . . . . . . . . . . . . . . 25
3.2 The Contrast Pattern Tree Structure . . . . . . . . . . . . . . . . . . 29
3.2.1 The Ordered List . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 The Contrast Pattern Tree (CP-tree) . . . . . . . . . . . . . . 30
3.2.3 The Construction of the Contrast Pattern Tree . . . . . . . . 33
3.2.4 The Difference Between CP-trees and FP-trees . . . . . . . . . 36
3.3 The Mining HGEPs Process . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 The Merging Process . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Algorithms for Mining HGEPs . . . . . . . . . . . . . . . . . . 39
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 The Real Microarray Datasets . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
參考文獻 References
[1] “Wikipedia website,” http://en.wikipedia.org/.
[2] H. Alhammady and K. Ramamohanarao, “The Application of Emerging Patterns
for Improving the Quality of Rare-Class Classification,” Proc. of Pacific-Asia
Conf. on Knowledge Discovery and Data Mining, pp. 207–211, 2004.
[3] H. Alhammady and K. Ramamohanarao, “Using Emerging Patterns and Decision
Trees in Rare-Class Classification,” Proc. of IEEE Int. Conf. Data Mining,
pp. 315–318, 2004.
[4] H. Alhammady and K. Ramamohanarao, “Mining Emerging Patterns and Classification
in Data Streams,” Proc. of 2005 IEEE/WIC/ACM Int. Conf. on Web
Intelligence, pp. 272–275, 2005.
[5] H. Alhammady and K. Ramamohanarao, “Fast Discovery and the Generalization
of Strong Jumping Emerging Patterns for Building Compact and Accurate Classifiers,”
IEEE Trans. on Knowledge and Data Eng., Vol. 18, No. 6, pp. 721–737,
June 2006.
[6] H. Alhammady and K. Ramamohanarao, “Using Emerging Patterns to Construct
Weighted Decision Trees,” IEEE Trans. on Knowledge and Data Eng.,
Vol. 18, No. 7, pp. 865–876, July 2006.
[7] J. Bailey, T. Manoukian, and K. Ramamohanarao, “Fast Algorithms for Mining
Emerging Patterns,” Proc. of the 6th European Conf. on Principles and Practice
of Knowledge Discovery in Databases, pp. 39–50, 2002.
[8] J. Bailey, T. Manoukian, and K. Ramamohanarao, “Classification Using Constrained
Emerging Patterns,” Proc. of the 4th Int. Conf. on Web-Age Information
Management, pp. 226–237, 2003.
[9] A. L. Boulesteix, G. Tutz, and K. Strimmer, “A Cart-Based Approach to Discover
Emerging Patterns in Microarray Data,” Bioinformatics, Vol. 19, No. 18,
pp. 2465–2472, Dec. 2003.
[10] Y. T. Chang, “Keyword of the Post Genome Era,” http://www.nyu.edu/classes/
ytchang/book/english.html.
[11] G. Dong and J. Li, “Efficient Mining of Emerging Patterns: Discovering Trends
and Differences,” Proc. of Int. Conf. Knowledge Discovery and Data Mining,
pp. 43–52, 1999.
[12] G. Dong and J. Li, “Mining Border Descriptions of Emerging Patterns from
Dataset Pairs,” Knowledge and Information Systems, Vol. 8, No. 2, pp. 178–202,
Aug. 2005.
[13] G. Dong, J. Li, and L. Wong, “The Use of Emerging Patterns in the Analysis
of Gene Expression Profiles for the Diagnosis and Understanding of Diseases,”
Chapter in New Generation of Data Mining Applications, IEEE Press/Wiley,
2005.
[14] G. Dong, J. Li, and X. Zhang, “Discovering Jumping Emerging Patterns and
Experiments on Real Datasets,” Proc. of the 9th Int. Conf. on Heterogeneous
and Internet Databases, July 1999.
[15] G. Dong, X. Zhang, L. Wong, and J. Li, “CAEP: Classification by Aggregating
Emerging Patterns,” Proc. of the 2nd Int. Conf. Discovery Science, pp. 30–42,
1999.
[16] H. Fan, “Efficient Mining of Interesting Emerging Patterns and Their Effective
Use in Classification,” PhD Thesis, Dept. of Computer Science and Software
Eng., the University of Melbourne, 2004.
[17] H. Fan, M. Fan, K. Ramamohanarao, and M. Liu, “Further Improving Emerging
Patterns Based Classifiers via Bagging,” Proc. of the 10th of Pacific-Asia Conf.
on Knowledge Discovery and Data Mining, pp. 91–96, April 2006.
[18] H. Fan and K. Ramamohanarao, “An Efficient Single-Scan Algorithm for Mining
Essential Jumping Emerging Patterns for Classification,” Proc. of the 6th Pacific-
Asia Conf. on Knowledge Discovery and Data Mining, pp. 456–462, 2002.
[19] H. Fan and K. Ramamohanarao, “A Bayesian Approach to Use Emerging Patterns
for Classification,” Proc. of the 14th Australasian Database Conf., pp. 39–
48, 2003.
[20] H. Fan and K. Ramamohanarao, “Efficiently Mining Interesting Emerging Patterns
for Classification,” Proc. of the 4th Int. Conf. on Web-Age Information
Management, pp. 189–201, 2003.
[21] H. Fan and K. Ramamohanarao, “Noise Tolerant Classification by Chi Emerging
Patterns,” Proc. of the 8th Pacific-Asia Conf. on Knowledge and Data Mining,
pp. 201–206, 2004.
[22] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,
H. Coller, M. L. Loh, J. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S.
Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction
by Gene Expression Monitoring,” Science, Vol. 286, No. 5439, pp. 531–537,
Oct. 1999.
[23] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns Without Candidate
Generation: A Frequent-Pattern Tree Approach,” Data Mining and Knowledge
Discovery, Vol. 8, No. 1, pp. 53–87, Jan. 2004.
[24] H. Inakoshi, T. Ando, A. Sato, and S. Okamoto, “Discovery of Emerging Patterns
from Nearest Neighbors,” Proc. of IEEE Int. Conf. on Machine Learning and
Cybernetics, Vol. 4, pp. 1920–1925, Nov. 2002.
[25] J. Li, “Mining Emerging Patterns to Construct Accurate and Efficient Classifiers,”
PhD Thesis, Dept. of Computer Science and Software Eng., the University
of Melbourne, 2000.
[26] J. Li, G. Dong, and K. Ramamohanarao, “Instance-Based Classification by
Emerging Patterns,” Proc. of the 4th European Conf. on Principles and Practice
of Knowledge Discovery in Databases, pp. 191–200, 2000.
[27] J. Li, G. Dong, and K. Ramamohanarao, “Making Use of the Most Expressive
Jumping Emerging Patterns for Classification,” Knowledge and Information
Systems, Vol. 3, No. 2, pp. 131–145, May 2001.
[28] J. Li, G. Dong, K. Ramamohanarao, and L. Wong, “DeEP: A New Instance-
Based Lazy Discovery and Classification System,” Machine Learning, Vol. 54,
No. 2, pp. 99–124, Feb. 2004.
[29] J. Li, H. Liu, J. R. Downing, A. E.-J. Yeoh, and L.Wong, “Simple Rules Underlying
Gene Expression Profiles of More Than Six Subtypes of Acute Lymphoblastic
Leukemia (ALL) Patients,” Bioinformatics, Vol. 19, No. 1, pp. 71–78, Jan. 2003.
[30] J. Li, H. Liu, S. K. Ng, and L. Wong, “Discovery of Significant Rules for Classifying
Cancer Diagnosis Data,” Bioinformatics, Vol. 19, No. 2, pp. 93–102, Oct.
2003.
[31] J. Li, T. Manoukian, G. Dong, and K. Ramamohanarao, “Incremental Maintenance
on the Border of the Space of Emerging Patterns,” Data Mining and
Knowledge Discovery, Vol. 9, No. 1, pp. 89–116, July 2004.
[32] J. Li, K. Ramamohanarao, and G. Dong, “Emerging Patterns and Classification,”
Proc. of the 6th Asian Computing Science Conf., pp. 15–32, 2000.
[33] J. Li, K. Ramamohanarao, and G. Dong, “The Space of Jumping Emerging
Patterns and Its Incremental Maintenance Algorithm,” Proc. of the 17th Int.
Conf. on Machine Learning, pp. 551–558, 2000.
[34] J. Li and L. Wong, “Emerging Patterns and Gene Expression Data,” Genome
Informatics, Vol. 12, pp. 3–13, Dec. 2001.
[35] J. Li and L. Wong, “Geography of Differences Between Two Classes of Data,”
Proc. of the 6th European Conf. on Principles of Data Mining and Knowledge
Discovery, pp. 325–337, 2002.
[36] J. Li and L. Wong, “Identifying Good Diagnostic Gene Groups from Gene
Expression Profiles Using the Concept of Emerging Patterns,” Bioinformatics,
Vol. 18, pp. 725–734, May 2002.
[37] D. Newman, S. Hettich, C. Blake, and C. Merz, “UCI Repository of Machine
Learning Databases,” http://www.ics.uci.edu/mlearn/MLRepository.html,
1998.
[38] B. Palace, “DataMining: What is Data Mining,” http://www.anderson.ucla.edu/
faculty/jason.frand/teacher/technologies/palace/datamining.htm.
[39] K. Ramamohanarao and J. Bailey, “Discovery of Emerging Patterns and Their
Use in Classification,” Proc. of the 16th Australian Conf. on Artificial Intelligence,
pp. 1–12, 2003.
[40] K. Ramamohanarao, J. Bailey, and H. Fan, “Efficient Mining of Contrast Patterns
and Their Applications to Classification,” Proc. of the 3rd Int. Conf. on
Intelligent Sensing and Information Processing, pp. 37–47, Dec. 2005.
[41] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative Monitoring
of Gene Expression Patterns with a Complementary DNA Microarray,” Science,
Vol. 270, No. 5235, pp. 467–470, Oct. 1995.
[42] Q. Sun, X. Zhang, and K. Ramamohanarao, “Noise Tolerance of EP-Based Classifiers,”
Proc. of the 16th Australian Conf. on Artificial Intelligence, pp. 796–806,
2003.
[43] Z. Wang, H. Fan, and K. Ramamohanarao, “Exploiting Maximal Emerging Patterns
for Classification,” Proc. of the 17th Australian Joint Conf. on Artificial
Intelligence, pp. 1062–1068, 2004.
[44] C. Yan, “Bioinformatics: Problems and Solutions,” http://www.cs.usu.edu/
cyan/CS5890/.
[45] E.-J. Yeoh, M. E. Ross, S. A. Shurtleff, W. K. Williams, D. Patel, R. Mahfouz,
F. G. Behm, S. C. Raimondi, M. V. Relling, A. Patel, C. Cheng, D. Campana,
D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui, W. E. Evans, C. Naeve, L. Wong,
and J. R. Downing, “Classification, Subtype Discovery, and Prediction of Outcome
in Pediatric Acute Lymphoblastic Leukemia by Gene Expression Profiling,”
Cancer Cell, Vol. 1, No. 2, pp. 133–143, March 2002.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 18.116.42.208
論文開放下載的時間是 校外不公開

Your IP address is 18.116.42.208
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code