國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於群組基因演算法之屬性分群改良方法 ,Improved Approaches for Attribute Clustering Based on the Group Genetic Algorithm

論文名稱 Title	基於群組基因演算法之屬性分群改良方法 Improved Approaches for Attribute Clustering Based on the Group Genetic Algorithm
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	99 學年度第 2 學期 The spring semester of Academic Year 99	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	78
研究生 Author	林峰世 Feng-Shih Lin
指導教授 Advisor	洪宗貝 Tzung-Pei Hong
召集委員 Convenor	李宗南 Chung-Nan Lee
口試委員 Advisory Committee	林葭華, 王學亮 Cha-Hwa Lin; Shyue-Liang Wang
口試日期 Date of Exam	2011-07-22	繳交日期 Date of Submission	2011-09-09
關鍵字 Keywords	特徵選取、基因演算法、群組基因演算法、屬性分群、資料探勘 feature selection, genetic algorithm, grouping genetic algorithm, data mining, Attribute clustering
統計 Statistics	本論文已被瀏覽 5674 次，被下載 478 次 The thesis/dissertation has been browsed 5674 times, has been downloaded 478 times.

中文摘要
特徵選取是資料探勘及機器學習上重要的一個預處理技術，尤其是在分析龐大維度的資料時。一個適當的特徵選取演算法，不僅可以降低整個資料探勘或機器學習過程的複雜度，也會提高整個結果的正確性。在過去，使用基因演算法做特徵分群的技術已被提出做為特徵選取的一個方法。當一筆資料缺乏所選取特徵的屬性值時，在同一個群聚內其它屬性的值可以做為替代。過去由於基因演算法的操作方式，會讓同一個分群結果有多種不同的表示方式，造成解空間變大，所需要耗費的運算成本也因此跟著提高。因此在這篇論文中，我們提出了兩個分組遺傳演算法的屬性分群方法，以提高屬性分群的效率。第一種方法使用傳統的分組遺傳演算法來找尋一個適合特徵選取的屬性分群，考慮分群結果的所有組合的分類正確性以及各群數量的平衡來評估一個結果的好壞。第二種方法則使用群聚中心當作群組代表的概念，並提出新的染色體結構，用以提高整體求解的速度，並提供使用者在求解過程中有更多的操控性。最後，我們針對所提出的方法以及傳統基因演算法做實驗，比較兩者的實際運作狀況。
Abstract
Feature selection is a pre-processing step in data-mining and machine learning, and plays an important role for analyzing high-dimensional data. Appropriately selected features can not only reduce the complexity of the mining or learning process, but also improve the accuracy of results. In the past, the concept of performing the task of feature selection by attribute clustering was proposed. If similar attributes could be clustered into groups, attributes could be easily replaced by others in the same group when some attribute values were missed. Hong et al. also proposed several genetic algorithms for finding appropriate attribute clusters. Their approaches, however, suffered from the weakness that multiple chromosomes would represent the same attribute clustering result (feasible solution) due to the combinatorial property, thus causing a larger search space than needed. In this thesis, we thus attempt to improve the performance of the GA-based attribute-clustering process based on the grouping genetic algorithm (GGA). Two GGA-based attribute clustering approaches are proposed. In the first approach, the general GGA representation and operators are used to reduce the redundancy of chromosome representation for attribute clustering. In the second approach, a new encoding scheme with corresponding crossover and mutation operators are designed, and an improved fitness function is proposed to achieve better convergence speed and provide more flexible alternatives than the first one. At last, experiments are made to compare the efficiency and the accuracy of the proposed approaches and the previous ones.

目次 Table of Contents
Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Contributions 3 1.3 Thesis Organization 4 Chapter 2 Review of Related Work 5 2.1 Feature Selection 5 2.2 Attribute Dependency Measure 7 2.3 Attribute Clustering Based on Genetic Algorithms 8 2.4 Genetic Algorithms On the Grouping Problems 9 2.5 Grouping Genetic Algorithm 10 2.5.1 Chromosome Representation 11 2.5.2 Crossover 13 2.5.3 Mutation and Inversion 16 Chapter 3 GGA-Based Attribute Clustering 17 3.1 Chromosome Representation 18 3.2 Initial Population 20 3.3 Fitness and Selection 21 3.4 Crossover 23 3.5 Mutation and Inversion 24 3.6 The Proposed Algorithm 25 3.7 An Example 27 Chapter 4 Center-Based GGA on Attribute Clustering 34 4.1 Chromosome Representation 35 4.2 Initial Population 36 4.3 Fitness Function 37 4.4 CGGA Operators 44 4.5 The Proposed Algorithm for Clustering Attributes 47 4.6 An Example 49 Chapter 5 Experimental Results 55 5.1 Experimental Results of the First Approach 56 5.2 Experimental Results of the Second Approach 57 Chapter 6 Conclusion and Future Work 61 Reference 63

參考文獻 References
[1] A. Ben-Dor and Z. Yakhini, “Clustering gene expression patterns,” Journal of Computational Biology, Vol. 6, pp. 281-297, 1999. [2] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. [3] A. L. Blum and P. Langley, “Selection of relevant features and examples in machine learning,” Artificial Intelligence, Vol. 97, pp. 245-271, 1997. [4] A. L. Blum and R. L. Rivest, “Training a 3-node neural networks is NP-complete,” Neural Networks, Vol. 5, pp. 117-127, 1992. [5] E.C. Brown, and R.T. Sumichrast, “Evaluating performance advantages of grouping genetic algorithms," Data mining: opportunities and challenges, vol. 18, pp.1-12, 2005. [6] R. M. Cole, Clustering with Genetic Algorithms, University of Western Australia, Master Thesis, pp. 2-3, 1998. [7] M. Dash, and H. Liu, “Feature selection for classification," Intelligent data analysis, Vol. 1, pp.131-156, 1997. [8] M. Dash, K. Choi, P. Scheuermann and H. Liu, “Feature selection for clustering – a filter solution,” The Second International Conference on Data Mining, pp. 115-122, 2002. [9] E. Falkenauer, “A New Representation and Operators for Genetic Algorithms Applied to Grouping Problems”. Evolutionary Computation, Vol. 2, pp. 123-144, 1994. [10] E. Falkenauer, “A hybrid grouping genetic algorithm for bin packing," Journal of heuristics, Vol. 2, pp.5-30, 1996. [11] K. Gao, M. Liu, K. Chen, N. Zhou and J. Chen, “Sampling-based tasks scheduling in dynamic grid environment,” The Fifth WSEAS International Conference on Simulation, Modeling and Optimization, pp.25-30, 2005. [12] D. E. Goldberg, Genetic Algorithms in Search, Optimization & Machine Learning, Addison Wesley, 1989. [13] J. J. Grefenstette, “Optimization of control parameters for genetic algorithms,” IEEE Transactions on System Man, and Cybernetics, Vol. 16, No. 1, pp. 122-128, 1986. [14] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, Vol. 3, pp. 1,157-1,182, 2003. [15] J. Han, X. Hu and T.Y. Lin, “Feature selection based on rough set and information entropy,” The IEEE International Conference on Granular Computing, Vol. 1, pp. 153-158, 2005. [16] J.Han, L.V.S. Lakshmanan, and R.T. Ng, “Constraint-based, multidimensional data mining,” Computer, Vol. 32, pp. 46-50, 1999. [17] J. H. Holland. Adaptation in Natural and Artificial Systems, University of Michigan Press, 1975. [18] A. Homaifar, S. Guan, and G. E. Liepins, “A new approach on the traveling salesman problem by genetic algorithms,” The Fifth International Conference on Genetic Algorithms, 1993. [19] T. P. Hong and Y. L. Liou, “Attribute Clustering in High Dimensional Feature Spaces,” The International Conference on Machine Learning and Cybernetics, pp.19-22, 2007. [20] T. P. Hong and Y. L. Liou, “Attribute clustering with unknown cluster numbers,” The 2008 IEEE International Conference on Systems, Man, and Cybernetics, pp. 2772 - 2776, Singapore, 2008. [21] T. P. Hong, P. C. Wang and Y. C. Lee, “An effective attribute clustering approach for feature selection and replacement,” Cybernetics and Systems, Vol. 40, No. 8, pp. 657-669, 2009. [22] T. P. Hong, P. C. Wang, and C. K. Ting, “An evolutionary attribute clustering and selection method based on feature similarity,” The IEEE Congress on Evolutionary Computation, Spain, 2010. [23] K. Hu, L. Diao, Y. Lu, and C. Shi, “A heuristic optimal reduct algorithm,” Lecture Notes in Computer Science, Vol. 1983, Springer, Berlin, pp. 139-144, 2000. [24] Y.S. Kim, W.N. Street, and F. Menczer, “Feature selection in data mining," Data mining: opportunities and challenges, pp.80-105, 2003. [25] R. Kohavi, and G.H. John, “Wrappers for feature subset selection," Artificial intelligence, Vol. 97, pp.273-324, 1997. [26] P. Kudová, “Clustering Genetic Algorithm,” The Eighteenth International Conference on Database and Expert Systems Applications, pp. 138 – 142, 2007. [27] Y. Li, S. C. K. Shiu and S. K. Pal, “Combining feature reduction and case selection in building CBR classifiers,” IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 415- 429, 2006. [28] U. Maulik and S. Bandyopadhyay, “Genetic algorithm-based clustering technique,” The Journal of the Pattern Recognition Society, pp. 1455-1465, 2000. [29] M. Mitchell, An Introduction to Genetic Algorithms, MIT press, 1996. [30] Z. Pawlak, “Rough set,” International Journal of Computer and Information Sciences, Vol. 11, No. 1, pp. 341-356, 1982. [31] Z. Pawlak, “Why rough sets?,” The Fifth IEEE International Conference on Fuzzy Systems, Vol. 2, pp. 738-743, 1996. [32] M. Sarkar, B. Yegnanarayana and D. Khemani, “A cluster algorithm using an evolutionary programming-based approach,” Pattern Recognition Letters, Vol. 18, pp. 975-986, 1997. [33] H. Q. Sun and Z. Xiong, “Finding minimal reducts from incomplete information systems,” The Second International Conference on Machine Learning and Cybernetics, Vol. 1, pp. 350-354, 2003. [34] L. Y. Tseng and S. B. Yang, “Genetic algorithms for clustering, feature selection and classification,” International Conference on Neural Networks, Vol. 3, pp. 1612 – 1616, 1997. [35] J. X. Wei, H. Liu, Y. H. Sun and X.N. Su, “Application of Genetic Algorithm in Document Clustering,” International Conference on Information Technology and Computer Science, Vol. 1, pp. 145 – 148, 2009. [36] J. Wróblewski, “Finding minimal reducts using genetic algorithms,” The Second Annual Join Conference on Information Sciences, 1995. [37] L. Yu and H. Liu, “Efficient feature selection via analysis of relevance and redundancy,” Journal of Machine Learning Research, Vol. 5, pp. 1205-1224, 2004. [38] J. Zhang, J. Wang, D. Li, H. He, and J. Sun, “A new heuristic reduct algorithm based on rough sets theory,” Lecture Notes in Computer Science, Vol. 2762, Springer, Berlin, pp. 247-253, 2003. [39] M. Zhang and J. T. Yao, “A rough sets based approach to feature selection,” The IEEE Annual Meeting of Fuzzy Information, pp. 434-439, 2004.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0909111-070933.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS