國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個以類別為樹根的FP樹的分類方法,A Class-rooted FP-tree Approach to Data Classification

論文名稱 Title	一個以類別為樹根的FP樹的分類方法 A Class-rooted FP-tree Approach to Data Classification
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	93 學年度第 2 學期 The spring semester of Academic Year 93	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	85
研究生 Author	陳建宏 Chien-hung Chen
指導教授 Advisor	張玉盈 Ye-in Chang
召集委員 Convenor	黃三益 San-yi Huang
口試委員 Advisory Committee	李建憶 Chien-i Lee
口試日期 Date of Exam	2005-06-17	繳交日期 Date of Submission	2005-06-29
關鍵字 Keywords	關聯性法則、相關屬性、資料探勘、決策樹、分類 data mining, correlated attributes, association rules, classification, decision trees
統計 Statistics	本論文已被瀏覽 5702 次，被下載 0 次 The thesis/dissertation has been browsed 5702 times, has been downloaded 0 times.

中文摘要
分類（classification）是資料探勘中極為重要的一個問題，它是個有用的預測技術。分類問題的目標，是從給定的訓練資料中建構出一個分類器（classifier），並預測分類不明的新資料。分類學廣泛的被應用在許多領域，例如醫學的診斷和天氣預測。決策樹是最普遍的分類器模型，它可以產生可理解的規則，並且分類時不需要繁複的計算。然而，決策樹有個重大的缺點：它一次只考慮一個屬性。在現實世界，某些資料庫裡的屬性間是互相關聯的。因此，藉由發現屬性間的關聯性，我們可以改善決策樹的正確性。CAM方法採用了關聯性法則探勘（mining association rules）的技術來發覺屬性間的相關，例如Apriori方法。然而，傳統的關聯性法則探勘方法，在分類問題上的應用不夠有效率，並且會產生五個問題： (1)組合爆炸問題、(2)非法的候選集、(3)不合適的最小門檻、(4)被忽略的分類項目、(5)不含分類項目的代表集。FP-growth方法解決了前兩個問題，但是依然面臨其他三個問題；此外，還產生了另一個問題：樹上有一些在分類問題裡不需要的點，造成龐大的樹。此外，CAM方法需要大量的讀取資料庫造成了沉重的負荷，而屬性組合問題（attribute combination problem）將會造成錯誤的產生。因此，在這篇論文中，我們提出一個有效率且正確的決策樹分類器建構方法，解決了先前提到的六個問題，並且降低CAM方法讀取資料庫鎖造成的負擔。我們建構了一個稱為以分類為樹根的FP樹（class-rooted FP-tree）的資料結構，這個資料結構相似於FP-樹，而樹的樹根一定是分類項目。不同於FP-growth方法，其用靜態指定的最小門檻值，我們的方法以動態的方式來決定此最小門檻值，這可以避免在分類問題上對於最大代表集的一些錯誤判斷。在建構決策樹的步驟，我們提供可以減少讀取資料庫次數的過濾策略。我們也解決了CAM方法的屬性組合問題。從我們的模擬中顯示，我們提出的以分類為樹根的FP樹的探勘方法，比FP-growth方法節省儲存空間。我們的模擬也顯示我們的方法改善了CAM方法讀取資料庫的次數與正確性。因此，我們的探勘技術能夠應用在現有的決策樹建構方法上，並提供更高的正確性。
Abstract
Classification, an important problem of data mining, is one of useful techniques for prediction. The goal of the classification problem is to construct a classifier from a given database for training, and to predict new data with the unknown class. Classification has been widely applied to many areas, such as medical diagnosis and weather prediction. The decision tree is the most popular model among classifiers, since it can generate understandable rules and perform classification without requiring any computation. However, a major drawback of the decision tree model is that it only examines a single attribute at a time. In the real world, attributes in some databases are dependent on each other. Thus, we may improve the accuracy of the decision tree by discovering the correlation between attributes. The CAM method applies the method of mining association rules, like the Apriori method, for discovering the attribute dependence. However, traditional methods for mining association rules are inefficient in the classification applications and could have five problems: (1) the combinatorial explosion problem, (2) invalid candidates, (3) unsuitable minimal support, (4) the ignored meaningful class values, and (5) itemsets without class data. The FP-growth avoids the first two problems. However, it is still suffered from the remaining three problems. Moreover, one more problem occurs: Unnecessary nodes for the classification problem which make the FP-tree incompact and huge. Furthermore, the workload of the CAM method is expensive due to too many times of database scanning, and the attribute combination problem causes some misclassification. Therefore, in this thesis, we present an efficient and accurate decision tree building method which resolves the above six problems and reduces the overhead of database scanning in the CAM method. We build a structure named class-rooted FP-tree which is a tree similar to the FP-tree, except the root of the tree is always a class item. Instead of using a static minimal support applied in the FP-growth method, we decide the minimal support dynamically, which can avoid some misjudgement of large itemsets used for the classification problem. In the decision tree building phase, we provide a pruning strategy that can reduce the times of database scanning. We also solve the attribute combination problem in the CAM method and improve the accuracy. From our simulation, we show that the performance of the proposed class-rooted FP-tree mining method is better than that of other mining association rule methods in terms of storage usage. Our simulation also shows the performance improvement of our method in terms of the times of database scanning and classification accuracy as compared with the CAM method. Therefore, the mining strategy of our proposed method is applicable to any method for building decision tree, and provides high accuracy in the real world.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 DataMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Classifier Model 1: The Decision Tree Classifiers . . . . . . . . . . . . 6 1.4 Classifier Model 2: Mining Association Rules . . . . . . . . . . . . . . 10 1.5 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 20 2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1 Methods for Building Decision Trees . . . . . . . . . . . . . . . . . . 21 2.1.1 The ID3Method . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 The C4.5Method . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.3 Modified Methods for Building Decision Trees . . . . . . . . . 23 2.2 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 The AprioriMethod . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.2 The FP-growthMethod . . . . . . . . . . . . . . . . . . . . . 27 2.3 Associative classification . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 The CAMMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3. The Class-rooted FP-tree Approach . . . . . . . . . . . . . . . . . . 35 3.1 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 35 ii Page 3.2 Building the Decision Tree Classifier . . . . . . . . . . . . . . . . . . 36 3.3 TheMining Part . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.1 Building the Class Lists . . . . . . . . . . . . . . . . . . . . . 40 3.3.2 Building the Class-rooted FP-trees . . . . . . . . . . . . . . . 44 3.3.3 Mining from the Class-rooted FP-trees . . . . . . . . . . . . . 50 3.4 Attribute Selection and Splitting . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Attribute Selection in the Decision Tree Building Phase . . . . 59 3.4.2 Splitting the Training Database . . . . . . . . . . . . . . . . . 61 4. Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1 The PerformanceModel . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Simulation Results of Attribute Selection Workload . . . . . . . . . . 66 4.3 Simulation Results of Predictive Accuracy . . . . . . . . . . . . . . . 67 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 71 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

參考文獻 References
[1] R. L. Ackoff, "From Data to Wisdom," Journal of Applied Systems Analysis, Vol. 16, No. 1, pp. 3-9, 1989. [2] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules,"Proc. of the 20th Int. Conf. on Very Large Databases, pp. 487-499, 1994. [3] H. Alhammady and K. Ramamohanarao, "Using Emerging Patterns and Decision Trees in Rare-class Classification," Proc. of IEEE Int. Conf. on Data Mining, pp. 315-318, 2004. [4] K. Alsabti, S. Ranka, and V. Singh, "CLOUDS: A Decision Tree Classifier for Large Database," Proc. of the 4th Int. Conf. on Knowledge Discovery and Data Mining, pp. 2-8, 1998. [5] G. Bellinger, D. Castro, and A. Mills, "Data, Information, Knowledge, and Wisdom," http://www.systems-thinking.org/dikw/dikw.htm. [6] M. J. A. Berry and G. Linoff, Data Mining Techniques for Marketing, Sales, and Customer Support. Wiley computer publishing, 1997. [7] C. L. Blake and C. J. Merz, "UCI Repository of Machine Learning Databases," http://www.ics.uci.edu/?mlearn/MLRepository.html. [8] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Wadsworth, 1984. [9] D. R. Carvalho and A. A. Freitas, "A Hybrid Decision Tree/Genetic Algorithm Method for Data Mining," Information Sciences, Vol. 163, No. 1-3, pp. 13-35, 2004. [10] J. Chen, H. Li, and S. Tang, "Association Rules Enhanced Classification of Underwater Acoustic Signal," Proc. of the 2001 IEEE Int. Conf. on Data Mining, pp. 582-583, 2001. [11] M. S. Chen, J. Han, and P. S. Yu, "Data Mining: An Overview from a Database Perspective," IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 6, pp. 866-883, 1996. [12] J. Dougherty, R. Kohavi, and M. Sahami, "Supervised and Unsupervised Discretization of Continuous Features," Proc. of the 12th Int. Conf. on Machine Learning, pp. 194-202, 1995. [13] U. M. Fayyad and K. B. Irani, "Multi-interval Discretization of Continuous Valued Attributes for Classification Learning," Proc. of the 13th Int. Joint Conf. on Artificial Intelligence, pp. 1022-1027, 1993. [14] U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1966. [15] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Y. Loh, "BOAT - Optimistic Decision Tree Construction," Proc. of the 1999 ACM SIGMOD Int. Conf. on Management of Data, pp. 169-180, 1999. [16] J. Gehrke, R. Ramakrishnan, and V. Ganti, "RainForest - A Framework for Fast Decision Tree Construction of Large Database," Proc. of the 24th Int. Conf. on Very Large Databases, pp. 416-427, 1998. [17] J. Han, J. Pei, and Y. Yin, "Mining Frequent Patterns Without Candidate Generation," Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, Vol. 29, No. 2, pp. 1-12, 2000. [18] E. B. Hunt, J. Marin, and P. J. Stone, Experiments in Induction. Academic Press, 1966. [19] M. V. Joshi, G. Karypis, and V. Kumar, "ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets," Proc. of the 12th Int. Parallel Processing Symp., pp. 573-579, 1998. [20] C. Kamath, "The Role of Parallel and Distributed Processing in Data Mining," Tech. Rep. UCRL-JC-142468, Newsletter of the IEEE Technical Committee on Distributed Processing, 2001. [21] M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han, "Generalization and Decision Tree Induction: Efficient Classification in Data Mining," Proc. of the 7th Int. Workshop on Research Issues on Data Eng., pp. 111-120, 1997. [22] R. Kufrin, "Generating C4.5 Production Rules in Parallel," Proc. of the 14th National Conf. on Artificial Intelligence, pp. 565-570, 1997. [23] Y. S. Lee and S. S. Lin, "Improving Classification Accuracy Using Association Analysis," Proc. of the 14th Workshop on Object-Oriented Technology and Applications, pp. 95-100, 2003. [24] Y. S. Lee and S. J. Yen, "Classification Based on Attribute Dependency," Proc. of the 6th Int. Conf. on Data Warehousing and Knowledge Discovery, pp. 259-268, 2004. [25] Y. S. Lee, S. J. Yen, and C. W. Fang, "Discovery Categorical-Type Dependencies for Improving the Accuracy of Decision Trees," Proc. of Int. Computer Symp., pp. 1326-1333, 2002. [26] Y. S. Lee, S. J. Yen, and C. W. Fang, "Discovery Numerical-Type Dependencies for Improving the Accuracy of Decision Trees," Proc. of the Int. Computer Symp., pp. 1845-1852, 2002. [27] W. Li, J. Han, and J. Pei, "CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules," Proc. of the 2001 IEEE Int. Conf. on Data Mining, pp. 369-376, 2001. [28] C. X. Ling and H. Zhang, "Toward Bayesian Classifiers with Accurate Probabilities," Proc. of Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 123-134, 2002. [29] B. Liu, W. Hsu, and Y. Ma, "Integrating Classification and Association Rule Mining," Proc. of the 4th Int. Conf. on Knowledge Discovery and Data Mining, pp. 80-86, 1998. [30] B. Liu, Y. Ma, and C. K. Wong, "Improving an Association Rule Based Classifier," Proc. of the 4th European Conf. on Principles and Practice of Knowledge Discovery in Databases, pp. 504-509, 2000. [31] W. Y. Loh, "Regression Trees with Unbiased Variable Selection and Interaction Detection," Statistica Sinica, Vol. 12, No. 2, pp. 361-386, 2002. [32] W. Y. Loh and Y. S. Shih, "Split Selection Methods for Classification Trees," Statistica Sinica, Vol. 7, No. 4, pp. 815-840, 1997. [33] H. Mannila, "Data Mining: Machine Learning, Statistics, and Databases," Proc. of the 8th Int. Conf. on Scientific and Scientific Database Management, pp. 2-9, 1996. [34] M. Mehta, R. Agrawal, and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining," Proc. of the 5th Int. Conf. on Extending Database Technology, pp. 18-32, 1996. [35] M. Mehta, J. Rissanen, and R. Agrawal, "MDL-based Decision Tree Pruning," Proc. of the 1st Int. Conf. on Knowledge Discovery and Data Mining, pp. 216-221, 1995. [36] J. R. Quinlan, "Introduction of Decision Trees," Machine Learning, Vol. 1, No. 1, pp. 81-106, 1986. [37] J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [38] J. R. Quinlan, "Improved Use of Continuous Attributes in C4.5," Journal of Artificial Intelligence Research, Vol. 4, No. 1, pp. 77-90, 1996. [39] R. Rastogi and K. Shim, "Public: A Decision Tree Classifier that Integrates Building and Pruning," Proc. of the 24rd Int. Conf. on Very Large Databases, pp. 404-415, 1998. [40] N. Ren and M. R. Zargham, "Rule Extraction for Securities Analysis Based on Decision Tree Classification Model," Proc. of Int. Conf. on Information and Knowledge Eng., pp. 145-151, 2004. [41] S. Ruggieri, "Efficient C4,5," IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 2, pp. 438-444, May 2002. [42] N. L. Sarda and N. V. Srinivas, "An Adaptive Algorithm for Incremental Mining of Association Rules," Proc. of the 14th IEEE Int. Conf. on Data Eng., pp. 240-245, 1998. [43] V. Schetinin, D. Partridge, W. J. Krzanowski, R. M. Everson, J. E. Fieldsend, T. C. Bailey, and A. Hernandez, "Experimental Comparison of Classification Uncertainty for Randomised and Bayesian Decision Tree Ensembles," Proc. of the 5th Int. Conf. on Data Eng. and Automated Learning, pp. 726-732, 2004. [44] J. Shafer, R. Agrawal, and M. Mehta, "SPRINT: A Scalable Parallel Classifier for Data Mining," Proc. of the 22nd Very Large Database Conf. on Extending Database Technology, pp. 544-555, 1996. [45] A. Silbersrchatz, M. Stonebraker, and J. D. Ullman, "Database Research: Achievements and Opportunities into the 21st Century," Report NSF Workshop Future of Database System Research, May 1995. [46] A. Srivastava, E. Han, V. Kumar, and V. Singh, "Parallel Formulations of Decision-tree Classification Algorithms," Data Mining and Knowledge Discovery, Vol. 3, No. 3, pp. 237-261, 1999. [47] T. Takamitsu, T. Miura, and I. Shioya, "Pre-pruning Decision Trees by Local Association Rules," Proc. of the 5th Int. Conf. on Intelligent Data Eng. and Automated Learning, pp. 148-151, 2004. [48] H. Wang and C. Zaniolo CMP: A Fast Decision Tree Classifier Using Multivariate Predictions, pp. 449-460, 2000. [49] K. Wang, S. Zhou, and Y. He, "Growing Decision Trees on Support-Less Association Rules," Proc. of the 6th ACM SIGMOD Int. Conf. on Knowledge Discovery and Data Mining, pp. 265-269, 2000. [50] S. M. Weiss, Predictive Data Mining. Morgan Kaufmann, 1998. [51] X. Yin and J. Han, "CPAR: Classification Based on Predictive Association Rules," Proc. of the 2003 SIAM Int. Conf. on Data Mining, 2003. [52] M. J. Zaki, C. T. Ho, and R. Agrawal, "Parallel Classification for Data Mining on Shared-Memory Multiprocessors," Proc. of the 15th Int. Conf. on Data Eng., pp. 198-205, 1999. [53] J. Zhang and V. Honavar, "Learning Decision Tree Classifiers from Attribute Value Taxonomies and Partially Specified Data," Proc. of the 20th Int. Conf. on Machine Learning, pp. 369-376, 2003. [54] H. Zhao and S. Ram, "Constrained Cascade Generalization of Decision Trees," IEEE Trans. on Knowledge and Data Eng., Vol. 16, No. 6, pp. 727-739, 2004.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.119.160.154 論文開放下載的時間是校外不公開 Your IP address is 18.119.160.154 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS