國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於基因演算法之自動屬性分群與特徵選取,Automatic Attribute Clustering and Feature Selection Based on Genetic Algorithms

論文名稱 Title	基於基因演算法之自動屬性分群與特徵選取 Automatic Attribute Clustering and Feature Selection Based on Genetic Algorithms
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	97 學年度第 2 學期 The spring semester of Academic Year 97	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	83
研究生 Author	王博正 Po-Cheng Wang
指導教授 Advisor	洪宗貝 Tzung-Pei Hong
召集委員 Convenor	林文揚 Wen-Yang Lin
口試委員 Advisory Committee	林葭華, 李宗南 Cha-Hwa Lin; Chung-Nan Lee
口試日期 Date of Exam	2009-07-16	繳交日期 Date of Submission	2009-08-21
關鍵字 Keywords	折減子集、k-means、基因演算法、特徵分群、特徵選取 k-means, reduct, genetic algorithms, feature clustering, feature selection
統計 Statistics	本論文已被瀏覽 5694 次，被下載 1366 次 The thesis/dissertation has been browsed 5694 times, has been downloaded 1366 times.

中文摘要
特徵選取的技術在資料探勘與機器學習的預處理程序扮演著相當重要的角色，一組優良的特徵集合在分類問題上不只可以擁有著高準確率並且可以減少探勘所需的時間。當訓練用的資料集合具備大量的特徵個數時，其所需要的計算時間通常是相當可觀。在這篇論文內，我們提出了三個基於基因演算法的屬性分群方法以做為特徵選取的技術。第一個方法使用正整數來進行染色體的編碼，並以每個基因值代表著這個特徵所歸屬的群聚。我們利用不同群聚內的特徵組合出來的集合之分類準確度與各群間特徵數量的平衡度來評估一個染色體的好壞。第二個方法則使用新的適應度函數來評估染色體的好壞，此新的適應度函數可大量減低掃描資料庫所需的時間。最後第三個方法改變原本的染色體編碼方式，不同基因存放不同群集的中心點，此新的編碼方式可以使基因演算法更快速的達到收斂速度。最後實驗將我們所提的方法與k-means所求得的分群結果做比較與討論。本篇論文所提出的三個演算法可在準確率與計算時間之間達成折衷。此外當使用傳統的特徵選取技術所得到的結果，若在推論時無法取得所選取特徵之數值時，則之前推導出的分類規則將無法被利用。然而利用我們所提出的演算法可以有效的解決特徵值遺失的情形，遺漏掉的特徵我們可以用與其相同的群集中的其他屬性來替換使用。因此我們所提出的特徵分群方法比以往的特徵選取技術有更大的彈性。
Abstract
Feature selection is an important pre-processing step in mining and learning. A good set of features can not only improve the accuracy of classification, but also reduce the time to derive rules. It is executed especially when the amount of attributes in a given training data is very large. This thesis thus proposes three GA-based clustering methods for attribute clustering and feature selection. In the first method, each feasible clustering result is encoded into a chromosome with positive integers and a gene in the chromosome is for an attribute. The value of a gene represents the cluster to which the attribute belongs. The fitness of each individual is evaluated using both the average accuracy of attribute substitutions in clusters and the cluster balance. The second method further extends the first method to improve the time performance. A new fitness function based on both the accuracy and the attribute dependency is proposed. It can reduce the time of scanning the data base. The third approach uses another encoding method for representing chromosomes. It can achieve a faster convergence and a better result than the second one. At last, the experimental comparison with the k-means clustering approach and with all combinations of attributes also shows the proposed approach can get a good trade-off between accuracy and time complexity. Besides, after feature selection, the rules derived from only the selected features may usually be hard to use if some values of the selected features cannot be obtained in current environments. This problem can be easily solved in our proposed approaches. The attributes with missing values can be replaced by other attributes in the same clusters. The proposed approaches thus provide flexible alternatives for feature selection.

目次 Table of Contents
Chinese Abstract .................................................................................................... I Eglissh Abstract .................................................................................................. III CHAPTER 1 Introduction ................................................................................. 1 1.1 Background and Motivation ........................................................................................ 1 1.2 Contributions................................................................................................................ 4 1.3 Thesis Organization ..................................................................................................... 5 CHAPTER 2 Review of Related Works ............................................................ 6 2.1 The Concept of Clustering ........................................................................................... 6 2.2 Reduct .......................................................................................................................... 7 2.3 Genetic Algorithms ...................................................................................................... 9 CHAPTER 3 GA-Based Clustering with Accuracy Measurement .............. 12 3.1 Notation...................................................................................................................... 12 3.2 Chromosome Representation ..................................................................................... 13 3.3 Initial Population ........................................................................................................ 14 3.4 Fitness and Selection.................................................................................................. 14 3.5 Genetic Operators ...................................................................................................... 18 3.6 The Proposed Algorithm ............................................................................................ 21 3.7 An Example ................................................................................................................ 23 CHAPTER 4 GA-Based Clustering with Similarity ..................................... 29 4.1 Notation...................................................................................................................... 29 4.2 Fitness Function ......................................................................................................... 30 4.3 The Proposed Algorithm ............................................................................................ 34 4.4 An Example ................................................................................................................ 36 CHAPTER 5 GA-Based Clustering with a Novel Encoded Scheme ............ 42 5.1 Notation...................................................................................................................... 42 5.2 Chromosome Representation ..................................................................................... 43 5.3 Initial Population ........................................................................................................ 44 5.4 Fitness and Selection.................................................................................................. 45 5.5 Genetic Operators ...................................................................................................... 47 5.6 The Proposed Algorithm ............................................................................................ 47 5.7 An Example ................................................................................................................ 49 CHAPTER 6 Experimental Results ................................................................ 55 6.1 Experimental Results of Method(I) ........................................................................... 55 6.2 Experimental Results of Method(II) .......................................................................... 63 6.3 Experimental Results of Method(III) ......................................................................... 65 CHAPTER 7 Conclusions and Future Works ................................................ 68 References ........................................................................................................ 69

參考文獻 References
[1] Q. A. Al-Radaideh, M. N. Sulaiman, M. H. Selamat and H. Ibrahim “Approximate reduct computation by rough sets based attribute weighting,” IEEE International Conference on Granular Computing, Vol. 2, pp. 383-386, 2005. [2] A. L. Blum and R. L. Rivest, “Training a 3-node neural networks is NP-complete,” Neural Networks, Vol. 5, pp. 117-127, 1992. [3] K. J. Cios, L. A. Kurgan, R. Tadeusiewicz, M. Ogiela and L.S. Goodenday, "Knowledge Discovery Approach to Automated Cardiac SPECT Diagnosis," Artificial Intelligence in Medicine 23, no. 2 (2001) : 149-169. [4] D. E. Goldberg, “Genetic Algorithms in Search, Optimization & Machine Learning,” Addison Wesley, 1989. [5] J. J. Grefenstette, “Optimization of control parameters for genetic algorithms,” IEEE Transactions on System Man, and Cybernetics, Vol. 16, No. 1, pp. 122-128, 1986. [6] I. Graham and P. L. Jones, Expert Systems - Knowledge, Uncertainty and Decision, Boston, Chapman and Computing, pp.117-158, 1988. [7] K. Gao, M. Liu, K. Chen, N. Zhou and J. Chen, “Sampling-based tasks scheduling in dynamic grid environment,” The Fifth WSEAS International Conference on Simulation, Modeling and Optimization, pp.25-30, 2005. [8] J. H. Holland. Adaptation in Natural and Artificial Systems, University of Michigan Press, 1975. [9] A. Homaifar, S. Guan, and G. E. Liepins, “A new approach on the traveling salesman problem by genetic algorithms,” The Fifth International Conference on Genetic Algorithms, 1993. [10] J. Han, X. Hu and T. Y. Lin, “Feature selection based on rough set and information entropy,” The IEEE International Conference on Granular Computing, Vol. 1, pp. 153-158, 2005. [11] T. P. Hong and Y. L. Liou, "Attribute Clustering in High Dimensional Feature Spaces", The International Conference on Conference on Machine Learning and Cybernetics, pp.19-22, 2007. [12] Y. Li, S. C. K. Shiu and S. K. Pal, “Combining feature reduction and case selection in building CBR classifiers,” IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 415- 429, 2006. [13] M. Mitchell, An Introduction to Genetic Algorithms, MIT press, 1996. [14] Z. Michalewicz, “Evolutionary Computation: Practical Issues,” International Conference on Evolutionary Computation, pp. 30-39, 1996 [15] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, 1994. [16] J. MacQueen, L. M. LeCam, and J. Neyman, “Some methods for classification and analysis of multivariate observations,” The Fifth Berkeley Symposium on Mathematics, Statistics, and Probability, pp. 281-297, 1967. [17] Z. Pawlak, “Rough set,” International Journal of Computer and Information Sciences, Vol. 11, No. 1, pp. 341-356, 1982. [18] Z. Pawlak, “Why rough sets?” The Fifth IEEE International Conference on Fuzzy Systems, Vol. 2, pp. 738-743, 1996. [19] E. Sanchez, et al, Genetic Algorithms and Fuzzy Logic Systems: Soft Computing Perspectives (Advances in Fuzzy Systems - Applications and Theory, Vol. 7), World-Scientific, 1997. [20] S. Z. Selim and M. A. Ismail, “K-means type algorithms: A generalized convergence theorem and characterization of local optimality,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 6, pp. 81-87, 1984. [21] A. Skowron and C. Rauszer, “The discernibility matrices and functions in information systems,” Fundamenta Informaticae, Vol. 15, pp. 331-362, 1991. [22] H. Q. Sun, Z. Xiong, “Finding minimal reducts from incomplete information systems,” The Second International Conference on Machine Learning and Cybernetics, Vol. 1, pp. 350-354, 2003. [23] J. Wroblewski, “Finding minimal reducts using genetic algorithms,” The Second Annual Join Conference on Information Sciences, pp. 186-189, 1995.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內外都一年後公開 withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0821109-092325.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS