姓名歐陽正(Jeng Ouyang) 電子郵件信箱E-mail 資料不公開 畢業系所電機工程學系研究所(Electrical Engineering) 畢業學位碩士(Master) 畢業時期97學年第2學期 論文名稱(中)一個以相似度為基礎的資料縮減方法 論文名稱(英)A Similarity-based Data Reduction Approach 檔案本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。

etd-0907109-164128.pdf

請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限

電子論文：校內校外完全公開論文語文/頁數中文/47 統計本論文已被瀏覽 5185 次，被下載 2284 次 摘要(中)由於資訊技術的快速成長，所需處理的資料數量也急遽增加，因此設計一個有效的資料縮減方法是相當重要的一件任務，這也是本篇論文的核心。在本論文裡，我們提出一個以相似度為基礎的自建構式模糊分群演算法來進行資料縮減。此自建構式模糊分群演算法根據資料在統計上的特性，將相似的資料歸為同一個群聚。當所有的資料被輸入至此演算法一遍即可完成分群，並得到每一個群聚的平均值和標準差，而這些平均值的集合就是資料萃取後的結果。最後使用這些少量的新代表點來取代原始的大量資料。此演算法有兩個最大的優點，第一是速度較快且對於記憶體的需求較低。第二是使用者不必事先決定要取出多少代表點。在實驗的部份，我們也以多組真實的資料去驗證此演算法在速度上比其他的方法快而且有更好的縮減率，並透過三種分類器來測試我們所提方法之資料萃取效果。 摘要(英)Finding an efficient data reduction method for large-scale problems is an imperative task. In this paper, we propose a similarity-based self-constructing fuzzy clustering algorithm to do the sampling of instances for the classification task. Instances that are similar to each other are grouped into the same cluster. When all the instances have been fed in, a number of clusters are formed automatically. Then the statistical mean for each cluster will be regarded as representing all the instances covered in the cluster. This approach has two advantages. One is that it can be faster and uses less storage memory. The other is that the number of new representative instances need not be specified in advance by the user. Experiments on real-world datasets show that our method can run faster and obtain better reduction rate than other methods. 關鍵字(中)大型資料 模糊相似度 資料分群 資料過濾 資料萃取 資料取樣 資料縮減 關鍵字(英)fuzzy similarity Large-scale dataset data reduction prototype reduction instance-filtering instance-abstraction 論文目次目錄

摘要 i

Abstract ii

目錄 iii

圖目錄 iv

表目錄 v

第一章 導論 1

第二章 文獻探討 4

2.1資料縮減方法 4

2.2分類器模型 7

第三章 研究方法 11

3.1自建構式模糊分群演算法 11

3.2範例 16

第四章 實驗與結果 20

4.1資料描述 20

4.2實驗一 21

4.3實驗二 25

4.4實驗三 26

4.5實驗四 32

第五章 結論 35

參考文獻 36參考文獻[1] M. B. de Almeida, A. de Padua Braga, and J. P. Braga, “SVM-KM: speeding SVMs learning with a priori cluster selection and k-means,” in Proceedings of the 6th Brazilian Symposium on Neural Networks, pp. 162–167, November 2000.

[2] A. Asuncion and D. Newman, “UCI machine learning repository,” 2007. [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html

[3] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. Müller, E. Säckinger, P. Simard, and V. Vapnik, “Comparison of classifier methods: a case study in handwriting digit recognition,” in International Conference on Pattern Recognition ICPR94. Jerusalem, Israel: IEEE Computer. Society Press, vol. 2, pp. 77–87, September 1994.

[4] L. Breiman, J. H. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Trees, 1st ed. Boca Raton, FL, USA: Chapman & Hall/CRC, January 1984.

[5] H. Brighton and C. Mellish, “Reduction techniques for instance-based learning algorithms,” Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153–172, April 2002.

[6] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, September 1995.

[7] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, January 1967

[8] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machinesand other kernel-based learning methods. New York, NY, USA: Cambridge University Press, March 2000.

[9] P. Datta and D. Kibler, “Symbolic nearest mean classiﬁers,” in Proceedings of the 14th International Conference on Machine Learning, pp. 82–87, July 1997.

[10] G. W. Gates, “The reduced nearest neighbor rule,” IEEE Transactions on Information Theory, vol. 18, no. 3, pp. 431–433, May 1972.

[11] S.-W. Kim and B. J. Oommen, “Enhancing prototype reduction schemes with recursion: A method applicable for “large” data sets,” IEEE Transactions on Systems, Man, and Cybernetics, part B: Cybernetics, vol. 34, no. 3, pp. 1384–1397, June 2004.

[12] G. J. Klir and B. Yuan, Fuzzy sets and fuzzy logic: theory and applications, 1st ed. Upper Saddle River, NJ, USA: Prentice Hall PTR, May 1995.

[13] R. Koggalage and S. K. Halgamuge, “Reducing the number of training samples for fast support vector machine classification,” Neural Information Processing - Letters and Reviews, vol. 2, no. 3, pp. 57–65, March 2004.

[14] U. H.-G. Kreßel, “Pairwise classification and support vector machines,” in Advances in Kernel Methods: Support Vector Learning. Cambridge, MA, USA: MIT Press, pp. 255–268, 1999.

[15] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. F. M. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37, January 2008.

[16] W. Lam, C.-K. Keung, and C. X. Ling, “Learning goodprototypes for classification using filtering and abstraction of instances,” Pattern Recognition, vol. 35, no. 7, pp. 1491–1506, July 2002.

[17] Y.-J. Lee and S.-Y. Huang, “Reduced support vector machines: A statistical theory,” IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 1–13, January 2007.

[18] K.-M. Lin and C.-J. Lin, “A study on reduced support vector machines,” IEEE Transactions on Neural Networks, vol. 14, no. 6, pp. 1449–1559, November 2003.

[19] Y. Linde, A. Buzo, and R. Gray, “An algorithm for vector quantizer design,” IEEE Transaction on Communications, vol. 28, no. 1, pp. 84–95, January 1980.

[20] M. Lozano, J. M. Sotoca, J. S. Sánchez, F. Pla, E. Pękalska, and R. P. W. Duin,“Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces,” Pattern Recognition, vol. 39, no. 10, pp. 1827–1838, October 2006.

[21] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967.

[22] E. Marchiori, “Hit miss networks with applications to instance selection,” Journal of Machine Learning Research, vol. 9, pp. 997–1017, June 2008.

[23] D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Machine Learning, Neural and Statistical Classification. Englewood Cliffs, N.J.: Prentice Hall, 1994. [Online]. Available: Data available at http://www.ncc.up.pt/liacc/ML/statlog/datasets.html

[24] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, “Largemargin DAGs for multiclass classification,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, vol. 12, pp. 547–553, 2000.

[25] E. Pękalska, R. P. W. Duin, and P. Paclík, “Prototype selection for dissimilarity-based classifiers,” Pattern Recognition, vol. 39, no. 2, pp. 189–208, February 2006.

[26] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, March 1986.

[27] J. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed. San Mateo, CA, USA: Morgan Kaufmann, January 1993.

[28] G. L. Ritter, H. B. Woodruff, S. R. Lowry, and T. L. Isenhour, “An algorithm for a selective nearest neighbor decision rule,” IEEE Transactions on Information Theory, vol. 21, no. 6, pp. 665–669, November 1975.

[29] J. Sánchez, “High training set size reduction by space partitioning and prototype abstraction,” Pattern Recognition, vol. 37, no. 7, pp. 1561–1564, July 2004.

[30] S. J. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. K. Chan, “Cost-based modeling for fraud and intrusion detection: Results from the jam project,” in Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, pp. 130–144, January 2000.

[31] N. A. Syed, H. Liu, and K. K. Sung, “A study of support vectors onmodel independent example selection,” in Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 272–276, August 1999.

[32] V. Vapnik, The nature of statistical learning theory, 2nd ed. New York, NY, USA: Springer, November 1999.

[33] J. G. Wang, P. Neskovic, and L. N. Cooper, “Training data selection for support vector machines,” in Proceedings of the 1st International Conference on Advances in Natural Computation, pp. 554–564, August 2005.

[34] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 2, no. 3, pp. 408–421, July 1972.

[35] D. R. Wilson and T. R. Martinez, “Reduction techniques for instance-based learning algorithms,” Machine Learning, vol. 38, no. 3, pp. 257–286, March 2000.

[36] S. Zheng, X. Lu, N. Zheng, and W. Xu, “Unsupervised clustering based reduced support vector machines,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 821–824, April 2003.口試委員吳志宏 - 召集委員

歐陽振森 - 委員

蔡賢亮 - 委員

賴智錦 - 委員

李錫智 - 指導教授

口試日期2009-07-23 繳交日期2009-09-07