國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,非對稱性分類分析解決策略之效能比較,Empirical Evaluations of Different Strategies for Classification with Skewed Class Distribution

論文名稱 Title	非對稱性分類分析解決策略之效能比較 Empirical Evaluations of Different Strategies for Classification with Skewed Class Distribution
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	92 學年度第 2 學期 The spring semester of Academic Year 92	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	72
研究生 Author	凌士雄 Shih-Shiung Ling
指導教授 Advisor	魏志平, 鄭滄祥 Chih-Ping Wei; Tsang-Hsiang Cheng
召集委員 Convenor	黃三益 San -Yi Huang
口試委員 Advisory Committee	張德民 Te -Min Chang
口試日期 Date of Exam	2004-07-28	繳交日期 Date of Submission	2004-08-09
關鍵字 Keywords	分類分析、非對稱性分配、決策樹歸納技術、多專家分類器、增加少數法、減少多數法 Classification Analysis, Decision Tree Induction, Multi-classifier Committee Approach, Under-sampling, Over-sampling, Skewed Class Distribution
統計 Statistics	本論文已被瀏覽 5711 次，被下載 3992 次 The thesis/dissertation has been browsed 5711 times, has been downloaded 3992 times.

中文摘要
由於應用常見的分類分析技術在類別數量分佈平均的資料集合上，即可建構出預測效能良好的分類模式。然而，在如信用卡詐欺偵測的實務運用上，資料集合內卻常存在著類別間數量分佈極不平均的非對稱性分配問題，因此以一般的分類分析技術所建構出的分類模式，常有嚴重的類別預測偏向問題，使得預測模式無法對數量稀少的目標資料做出正確的類別預測。減少多數法、增加少數法及多專家分類器等處理策略是目前文獻上常用以解決資料集合的非對稱性分配問題的方法，但卻少有文獻比較這些處理策略間的效能差異。因此本研究收集了十組具有非對稱性分配問題的資料集合，分別先以減少多數法、增加少數法及多專家分類器等策略處理資料集合內的非對稱性，再利用常見的C4.5決策樹建構分類器，進而比較各種非對稱處理策略間的效能差異，藉以瞭解各種處理策略的特性與適用的情境。本研究收集了十組具有非對稱性問題的資料集合，並利用十摺交互驗證法(10-fold cross-validation)的實證評估方法，以分類精確度、回應率及F1衡量等三種效標，比較不同處理策略的效能差異。實證結果顯示，多專家分類器處理策略在各種效標下皆能有效地提昇分類器對少數類別資料的分類效能；倘若實務應用著重於分類器回應率的效能表現，則利用增加少數法將較能有效地提昇分類器的分類效能；若實務應用著重於分類器精確度的表現，則建議直接以原資料集合建構分類器。
Abstract
Existing classification analysis techniques (e.g., decision tree induction,) generally exhibit satisfactory classification effectiveness when dealing with data with non-skewed class distribution. However, real-world applications (e.g., churn prediction and fraud detection) often involve highly skewed data in decision outcomes. Such a highly skewed class distribution problem, if not properly addressed, would imperil the resulting learning effectiveness. In this study, we empirically evaluate three different approaches, namely the under-sampling, the over-sampling and the multi-classifier committee approaches, for addressing classification with highly skewed class distribution. Due to its popularity, C4.5 is selected as the underlying classification analysis technique. Based on 10 highly skewed class distribution datasets, our empirical evaluations suggest that the multi-classifier committee generally outperformed the under-sampling and the over-sampling approaches, using the recall rate, precision rate and F1-measure as the evaluation criteria. Furthermore, for applications aiming at a high recall rate, use of the over-sampling approach will be suggested. On the other hand, if the precision rate is the primary concern, adoption of the classification model induced directly from original datasets would be recommended.

目次 Table of Contents
第一章緒論1 第一節研究背景1 第二節研究動機與目的2 第三節論文架構3 第二章文獻探討4 第一節分類分析技術4 一、決策樹4 二、倒傳遞類神經網路5 三、最近鄰居分類法7 第二節非對稱性問題的處理策略8 一、減少多數法8 二、增加少數法10 三、多專家分類器11 第三章實證資料集合13 第四章實證評估33 第一節減少多數法方法建立33 第二節增加少數法方法建立34 第三節評估程序與評估指標36 第四節實證結果分析39 第五章結論67 第一節綜合結論與貢獻67 第二節未來研究方向68 參考文獻69

參考文獻 References
中文 [吳旭志01] 吳旭志,賴淑真譯，Michael J.A. Berry 以及Gordon S. Linoff 著,「DataMining 資料採礦理論與實務顧客關係管理的技巧與科學」, 數博網資訊股份有限公司, 2001。 [邱義堂99] 邱義堂, 「通訊資料庫之資料探勘：客戶流失預測之研究」, 國立中山大學資訊管理研究所論文, 1999年。 [袁繼銓03] 袁繼銓, 「以類神經網路預測燒傷病患住院日之研究」, 國立中山大學資訊管理研究所論文, 2003年。 [張勳騰99] 張勳騰, 「通信資料庫之資料探勘：目標行銷之應用」, 國立中山大學資訊管理研究所碩士論文, 1999 年。 [許哲銘99] 許哲銘, 「時間序列型態之知識探索」, 國立中山大學資訊管理研究所碩士論文, 1999 年。 [彭文正01] 彭文正譯，Michael J.A. Berry 以及Gordon S. Linoff 著，「DataMining 資料採礦客戶關係管理暨電子行銷之應用」, 數博網資訊股份有限公司, 2001。 [楊傑能01] 楊傑能, 「一個找尋型態鑑別問題決策邊界區域的新方法」, 國立中山大學機械工程研究所碩士論文, 2001年。 [楊景婷02] 楊景婷, 「時間序列分類分析方法：技術發展與評估」, 國立中山大學資訊管理研究所論文, 2002年。 [葉怡成01] 葉怡成, 「應用類神經網路」, 儒林圖書公司, 2001年。 [熊正輝00] 熊正輝, 「以類神經網路為工具預估癌症末期病人之存活」, 財團法人安寧照顧基金會研究成果, 2000年。 [齊玉美03] 齊玉美, 「不對稱性分類分析之研究」, 國立中山大學資訊管理研究所論文, 2003年。英文 [AKA91] Aha, D., Kibler, D., and Albert, M. K., “Instance-Based Learning Algorithms,” Machine Learning, Vol. 6, No. 1, 1991, pp.37-66. [BL97] Berry, M. J. A. and Linoff, G., Data Mining Techniques: For Marketing Sale and Customer Support, John Wiley & Sons, Inc., 1997. [CBH02] Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P., “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, Vol. 16, 2002, pp.321-357. [CFPS99] Chan, P. K., Fan, W., Prodromidis, A. L. and Stolfo, S. J., “Distributed Data Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, November/December 1999, pp67-74. [CH67] Cover, T. M. and Hart, P. E., “Nearest Neighbor Pattern Classification,” IEEE Transactions on Information Theory, Vol. IT-13, No. 1, 1967, pp.21-27. [CN89] Clark, P. and Niblett, T., “The CN2 Induction Algorithm,” Machine Learning, Vol. 3, 1989, pp.261-283. [DBB91] DeRouin, E., Brown, J., Beck, H., Fausett, L., and Schneider, M., “Neural Network Training on Unequally Represented Classes,” Intelligent Engineering Systems Through Artificial Neural Networks, C. H. Dagli, S. R. T. Kumara, and Y. C. Shin (Eds.), ASME Press, New York, 1991, pp.135-145. [EN96] Ezawa, K. J. and Norton, S. W., “Constructing Bayesian Networks to Predict Uncollectible Telecommunications Accounts,” IEEE Expert, Vol. 11, No.5, 1996, pp.45-51. [G99] Gerritsen, R., “Assessing Loan Risks: A Data Mining Case Study,” IT Professional, Vol. 1, No. 6, 1999, pp.16-21. [H68] Hart, P. E., “The Condensed Nearest Neighbor Rule,” IEEE Transactions on Information Theory, IT-14, 1968, pp.515-516. [H95] Hall, C., “The Devil’s in the Details: Techniques, Tools, and Applications for Database Mining and Knowledge Discovery—PartⅡ,” Intelligent Software Strategies, Vol. XI, No.9, 1995, pp.1-16. [H96] Hall, C., “Intelligent Data Mining at IBM: New Products and Applications.”, Intelligent Software Strategies, Vol. XⅡ, No.5, 1996, pp.1-11. [HFT95] Han, J., Fu, Y. and Tang, S., “Advances of the DBLearn System for Knowledge Discovery in Large Databases,” Proceedings of 1995 International Joint Conference on Artificial Intelligence (IJCAI’95), Montreal, Canada, August 1995, pp.2049-2050. [HMH97] Honda, T., Motizuki, H., Ho, T. B., and Okumura, M., “Generating Decision Trees from an Unbalanced Data Set,” Proceedings of the 9th European Conference on Machine Learning (ECML), 1997, pp.68-77. [HP98] Ha, S. H. and Park, S. C., “Application of Data Mining Tools to Hotel Data Mart on the Intranet for Database Marketing,” Expert Systems With Applications, Vol. 15, 1998, pp.1-31. [J00A] Japkowicz, N., “ The Class Imbalance Problem: Significance and Strategies,” Proceedings of the International Conference on Artificial Intelligence, Las Vegas, June 2000. [K93] Kononenko, I., “Inductive and Bayesian Learning in Medical Diagnosis,” Applied Artificial Intellifence, Vol. 7, 1993, pp.317-337. [KM97] Kubat, M. and Matwin, S., “Addressing the curse of imbalanced training sets: one-sided selection,” Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 179-186. [L99] Lavrac, N., “Selected Techniques for Data Mining in Medicine,” Artificial Intelligence in Medicine, Vol. 16, 1999, pp.3-23. [LC94] Lewis, D. and Catlett, J., “Heterogeneous Uncertainty Sampling for Supervised Learning,” Proceedings of the 11th International Conference on Machine Learning, 1994, pp.144-156. [LL98] Ling, C. X. and Li, C. “Data mining for direct marketing: Problems and solutions,” Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 1998, pp. 73-79. [LM01] Lin, F. Y. and McClean, S., “A Data Mining Approach to the Prediction of Corporate Failure,” Knowledge-Based Systems, Vol. 14, No. 3-4, 20001, pp.189-195. [RHW86] Rumelhart,D. E., Hinton, G. E., and Williams, R. J., “Learning Internal Representations by Back-propagating Errors,” Parallel Distributed Processing: Explorations in the Microstructure of Cognition, MIT Press, Vol. 1, 1986, pp.318-362. [Q86] Quinlan, J. R., “Induction of Decision Tree,” Machine Learning, Vol. 1, No. 1, 1986, pp.81-106. [Q93] Quinlan,J. R., C4.5: Programs for Machine Learning, MorganKaufmann, San Mateo, CA, 1993. [SFL97] Stolfo, S. J., Fan, D. W., Lee, W., Prodromidis, A. L. and Chan, P. K., “Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results,” Proceedings of AAAI-97 Workshop on AI Methods in Fraud and Risk Management, 1997. [SS96] Solberg, A. H. S. and Solberg, R., “A Large-Scale Evaluation of Features For Automatic Detection of Oil Spills in ERS SAR Images,” IEEE Symp. Geosc. Rem. Sens (EGARSS), 1996, pp.1484-1486. [T76] Tomek, I., “Two Modifications of CNN,” IEEE Transactions on Systems, Man and Communications, SMC-6, 1976, pp.769-772. [WBS97] Wong, B. K., Bonovich, T. A., and Selvi, Y., “Neural Network Applications in Business: A Review and Analysis of the Literature (1988-95),” Decision Support Systems, Vol. 19, 1997, pp. 301-320. [WC02] Wei, C. and Chiu, I., “Turning Telecommunications Call Details to Churn Prediction: A Data Mining Approach, ” Expert Systems with Applications, Vol. 23, No. 2, 2002, pp.103-112. [YCB99] Yang, Y., Carbonell, J. G., Brown, R. D., Pierce, T., Archibald, B. T. and Liu, X., “Learning Approaches for Detecting and Tracking News Events,” IEEE Intelligent Systems, Vol. 14, No. 4, July-Aug. 1999, pp.32-43. [Z92] Zhang, J., “Selecting Typical Instanced in Instance-Based Learning,” Proceedings of the 9th International Machine Learning Workshop, 1992, pp.470-479.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內外都一年後公開 withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0809104-235914.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS