國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個以相似度為基礎的資料縮減方法,A Similarity-based Data Reduction Approach

論文名稱 Title	一個以相似度為基礎的資料縮減方法 A Similarity-based Data Reduction Approach
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	97 學年度第 2 學期 The spring semester of Academic Year 97	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	47
研究生 Author	歐陽正 Jeng Ouyang
指導教授 Advisor	李錫智 johnw@nuk.edu.tw
召集委員 Convenor	吳志宏 Chih-hung Wu
口試委員 Advisory Committee	蔡賢亮, 賴智錦, 歐陽振森 Hsien-leing Tsai; Chih-chin Lai; Chen-sen Ouyang
口試日期 Date of Exam	2009-07-23	繳交日期 Date of Submission	2009-09-07
關鍵字 Keywords	大型資料、模糊相似度、資料分群、資料過濾、資料萃取、資料取樣、資料縮減 fuzzy similarity, Large-scale dataset, data reduction, prototype reduction, instance-filtering, instance-abstraction
統計 Statistics	本論文已被瀏覽 5881 次，被下載 2671 次 The thesis/dissertation has been browsed 5881 times, has been downloaded 2671 times.

中文摘要
由於資訊技術的快速成長，所需處理的資料數量也急遽增加，因此設計一個有效的資料縮減方法是相當重要的一件任務，這也是本篇論文的核心。在本論文裡，我們提出一個以相似度為基礎的自建構式模糊分群演算法來進行資料縮減。此自建構式模糊分群演算法根據資料在統計上的特性，將相似的資料歸為同一個群聚。當所有的資料被輸入至此演算法一遍即可完成分群，並得到每一個群聚的平均值和標準差，而這些平均值的集合就是資料萃取後的結果。最後使用這些少量的新代表點來取代原始的大量資料。此演算法有兩個最大的優點，第一是速度較快且對於記憶體的需求較低。第二是使用者不必事先決定要取出多少代表點。在實驗的部份，我們也以多組真實的資料去驗證此演算法在速度上比其他的方法快而且有更好的縮減率，並透過三種分類器來測試我們所提方法之資料萃取效果。
Abstract
Finding an efficient data reduction method for large-scale problems is an imperative task. In this paper, we propose a similarity-based self-constructing fuzzy clustering algorithm to do the sampling of instances for the classification task. Instances that are similar to each other are grouped into the same cluster. When all the instances have been fed in, a number of clusters are formed automatically. Then the statistical mean for each cluster will be regarded as representing all the instances covered in the cluster. This approach has two advantages. One is that it can be faster and uses less storage memory. The other is that the number of new representative instances need not be specified in advance by the user. Experiments on real-world datasets show that our method can run faster and obtain better reduction rate than other methods.

目次 Table of Contents
目錄摘要 i Abstract ii 目錄 iii 圖目錄 iv 表目錄 v 第一章導論 1 第二章文獻探討 4 2.1資料縮減方法 4 2.2分類器模型 7 第三章研究方法 11 3.1自建構式模糊分群演算法 11 3.2範例 16 第四章實驗與結果 20 4.1資料描述 20 4.2實驗一 21 4.3實驗二 25 4.4實驗三 26 4.5實驗四 32 第五章結論 35 參考文獻 36

參考文獻 References
[1] M. B. de Almeida, A. de Padua Braga, and J. P. Braga, “SVM-KM: speeding SVMs learning with a priori cluster selection and k-means,” in Proceedings of the 6th Brazilian Symposium on Neural Networks, pp. 162–167, November 2000. [2] A. Asuncion and D. Newman, “UCI machine learning repository,” 2007. [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html [3] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. Müller, E. Säckinger, P. Simard, and V. Vapnik, “Comparison of classifier methods: a case study in handwriting digit recognition,” in International Conference on Pattern Recognition ICPR94. Jerusalem, Israel: IEEE Computer. Society Press, vol. 2, pp. 77–87, September 1994. [4] L. Breiman, J. H. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Trees, 1st ed. Boca Raton, FL, USA: Chapman & Hall/CRC, January 1984. [5] H. Brighton and C. Mellish, “Reduction techniques for instance-based learning algorithms,” Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 153–172, April 2002. [6] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, September 1995. [7] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, January 1967 [8] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machinesand other kernel-based learning methods. New York, NY, USA: Cambridge University Press, March 2000. [9] P. Datta and D. Kibler, “Symbolic nearest mean classiﬁers,” in Proceedings of the 14th International Conference on Machine Learning, pp. 82–87, July 1997. [10] G. W. Gates, “The reduced nearest neighbor rule,” IEEE Transactions on Information Theory, vol. 18, no. 3, pp. 431–433, May 1972. [11] S.-W. Kim and B. J. Oommen, “Enhancing prototype reduction schemes with recursion: A method applicable for “large” data sets,” IEEE Transactions on Systems, Man, and Cybernetics, part B: Cybernetics, vol. 34, no. 3, pp. 1384–1397, June 2004. [12] G. J. Klir and B. Yuan, Fuzzy sets and fuzzy logic: theory and applications, 1st ed. Upper Saddle River, NJ, USA: Prentice Hall PTR, May 1995. [13] R. Koggalage and S. K. Halgamuge, “Reducing the number of training samples for fast support vector machine classification,” Neural Information Processing - Letters and Reviews, vol. 2, no. 3, pp. 57–65, March 2004. [14] U. H.-G. Kreßel, “Pairwise classification and support vector machines,” in Advances in Kernel Methods: Support Vector Learning. Cambridge, MA, USA: MIT Press, pp. 255–268, 1999. [15] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. F. M. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg, “Top 10 algorithms in data mining,” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37, January 2008. [16] W. Lam, C.-K. Keung, and C. X. Ling, “Learning goodprototypes for classification using filtering and abstraction of instances,” Pattern Recognition, vol. 35, no. 7, pp. 1491–1506, July 2002. [17] Y.-J. Lee and S.-Y. Huang, “Reduced support vector machines: A statistical theory,” IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 1–13, January 2007. [18] K.-M. Lin and C.-J. Lin, “A study on reduced support vector machines,” IEEE Transactions on Neural Networks, vol. 14, no. 6, pp. 1449–1559, November 2003. [19] Y. Linde, A. Buzo, and R. Gray, “An algorithm for vector quantizer design,” IEEE Transaction on Communications, vol. 28, no. 1, pp. 84–95, January 1980. [20] M. Lozano, J. M. Sotoca, J. S. Sánchez, F. Pla, E. Pękalska, and R. P. W. Duin,“Experimental study on prototype optimisation algorithms for prototype-based classification in vector spaces,” Pattern Recognition, vol. 39, no. 10, pp. 1827–1838, October 2006. [21] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967. [22] E. Marchiori, “Hit miss networks with applications to instance selection,” Journal of Machine Learning Research, vol. 9, pp. 997–1017, June 2008. [23] D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Machine Learning, Neural and Statistical Classification. Englewood Cliffs, N.J.: Prentice Hall, 1994. [Online]. Available: Data available at http://www.ncc.up.pt/liacc/ML/statlog/datasets.html [24] J. C. Platt, N. Cristianini, and J. Shawe-Taylor, “Largemargin DAGs for multiclass classification,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, vol. 12, pp. 547–553, 2000. [25] E. Pękalska, R. P. W. Duin, and P. Paclík, “Prototype selection for dissimilarity-based classifiers,” Pattern Recognition, vol. 39, no. 2, pp. 189–208, February 2006. [26] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, March 1986. [27] J. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed. San Mateo, CA, USA: Morgan Kaufmann, January 1993. [28] G. L. Ritter, H. B. Woodruff, S. R. Lowry, and T. L. Isenhour, “An algorithm for a selective nearest neighbor decision rule,” IEEE Transactions on Information Theory, vol. 21, no. 6, pp. 665–669, November 1975. [29] J. Sánchez, “High training set size reduction by space partitioning and prototype abstraction,” Pattern Recognition, vol. 37, no. 7, pp. 1561–1564, July 2004. [30] S. J. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. K. Chan, “Cost-based modeling for fraud and intrusion detection: Results from the jam project,” in Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, pp. 130–144, January 2000. [31] N. A. Syed, H. Liu, and K. K. Sung, “A study of support vectors onmodel independent example selection,” in Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 272–276, August 1999. [32] V. Vapnik, The nature of statistical learning theory, 2nd ed. New York, NY, USA: Springer, November 1999. [33] J. G. Wang, P. Neskovic, and L. N. Cooper, “Training data selection for support vector machines,” in Proceedings of the 1st International Conference on Advances in Natural Computation, pp. 554–564, August 2005. [34] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 2, no. 3, pp. 408–421, July 1972. [35] D. R. Wilson and T. R. Martinez, “Reduction techniques for instance-based learning algorithms,” Machine Learning, vol. 38, no. 3, pp. 257–286, March 2000. [36] S. Zheng, X. Lu, N. Zheng, and W. Xu, “Unsupervised clustering based reduced support vector machines,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 821–824, April 2003.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0907109-164128.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS