Responsive image
博碩士論文 etd-0212103-235138 詳細資訊
Title page for etd-0212103-235138
論文名稱
Title
不對稱性分類分析之研究
Classification Analysis Techniques for Skewed Class
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
49
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2003-01-27
繳交日期
Date of Submission
2003-02-12
關鍵字
Keywords
分群式多專家分類器、資料探勘、隨機多專家分類器、決策樹、非對稱性分配、分類分析
Data Mining, Classification Analysis, Skewed Class Distribution Problem, Clustering-based Multi-classifier Class-combiner, Decision Tree Induction, Multi-classifier Class-combiner Approach
統計
Statistics
本論文已被瀏覽 5779 次,被下載 5424
The thesis/dissertation has been browsed 5779 times, has been downloaded 5424 times.
中文摘要
中文摘要
雖然資料探勘中之分類分析技術針對類別分佈對稱的資料集合可以建構出具有良好分類效能的分類預測模式,然而在實務的運用上(如流失客戶預測與信用卡詐欺偵測),資料集合卻常有類別資料分佈極不平均的「非對稱性分配」(Skewed Distribution)問題,使得分類預測模式無法針對量少的目標資料進行正確類別預測。多專家分類器、減少多數法及增加少數法是目前文獻中用以解決資料集合的非對稱性分配問題的三種主要的方法。本研究將利用資料分群法改良文獻中的多專家分類器而提出分群式多專家分類器的建構法,並嘗試利用最近距離法、最遠距離法、最近平均距離法及最遠平均距離法改善文獻中減少多數法對「非對稱性分配」問題的處理效能。
本研究收集了燒燙傷醫療資料及精品量販店客戶消費資料兩個具有「非對稱性分配」問題的實際資料集合並採用以決策樹為基礎的分類器,測試本研究所提出用以解決「非對稱性分配」問題五種方法的分類效能,並以文獻中的多專家分類器建構法作為比較基準。利用十次取樣驗證實驗的實驗結果顯示,在兩個收集得的資料集合上,採用類別調整適當比例(如1:2)的分群式多專家法所建構的分類預測模式具有最佳的分類效能。

關鍵字:資料探勘、分類分析、非對稱性分配、決策樹、隨機多專家分類器、分群式多專家分類器。



Abstract
Abstract
Existing classification analysis techniques (e.g., decision tree induction, backpropagation neural network, k-nearest neighbor classification, etc.) generally exhibit satisfactory classification effectiveness when dealing with data with non-skewed class distribution. However, real-world applications (e.g., churn prediction and fraud detection) often involve highly skewed data in decision outcomes (e.g., 2% churners and 98% non-churners). Such a highly skewed class distribution problem, if not properly addressed, would imperil the resulting learning effectiveness and might result in a “null” prediction system that simply predicts all instances as having the majority decision class as the training instances (e.g., predicting all customers as non-churners). In this study, we extended the multi-classifier class-combiner approach and proposed a clustering-based multi-classifier class-combiner technique to address the highly skewed class distribution problem in classification analysis. In addition, we proposed four distance-based methods for selecting a subset of instances having the majority decision class for lowering the degree of skewness in a data set. Using two real-world datasets (including mortality prediction for burn patients and customer loyalty prediction), empirical results suggested that the proposed clustering-based multi-classifier class-combiner technique generally outperformed the traditional multi-classifier class-combiner approach and the four distance-based methods.

Keywords: Data Mining, Classification Analysis, Skewed Class Distribution Problem, Decision Tree Induction, Multi-classifier Class-combiner Approach, Clustering-based Multi-classifier Class-combiner Approach



目次 Table of Contents
目 錄
第一章 緒論 7
第一節 研究背景 7
第二節 研究動機與目的 10
第三節 論文結構 11
第二章 文獻探討 12
第一節 分類分析技術探討 12
第二節 解決非對稱資料分類預測的相關方法 17
第三章 非對稱資料分類預測方法之改良 20
第一節 分群式多專家分類器 20
第二節 距離式減少多數資料挑選法 21
第四章 實證評估 24
第一節 資料搜集 24
第二節 評估準則與程序 28
第三節 實證結果分析─燒燙傷醫療資料 31
一、多專家模式之分類器的最佳對稱比例選擇 31
二、距離式減少多數法之最佳對稱比例選擇 33
三、分類器的效能比較分析 35
第四節 實證結果分析─量販精品資料 36
一、多專家模式之分類器的最佳對稱比例選擇 36
二、距離式減少多數法之最佳對稱比例選擇 38
三、分類器的效能比較分析 40
第五章 結論 42
第一節 綜合結論與貢獻 42
第二節 研究限制 43
第三節 未來研究方向 43

參考文獻 References
中文文獻:
[彭文正01] 彭文正譯,Michael J.A. Berry以及 Gordon S. Linoff著,「Data Mining資料採礦 客戶關係管理暨電子行銷之應用」,數博網資訊股份有限公司,2001。
[蔣博文01] 蔣博文,「DATA數位行銷」,英德瑞國際股份有限公司,No.4 (2001/7∼8月份)。
[張勳騰99] 張勳騰,「通信資料庫之資料探勘:目標行銷之應用」,國立中山大學資訊管理研究所碩士論文,1999年。
[邱義堂01] 邱義堂,「通信資料庫之資料探勘:客戶流失預測之研究」,國立中山大學資訊管理研究所碩士論文,2001年。
[許哲銘99] 許哲銘,「時間序列型態之知識探索」,國立中山大學資訊管理研究所碩士論文,1999年。
[林龍樹00] 林龍樹,「用戶流失率評估方法與流程介紹」,中華電信研究所,2001年4月。
[楊傑能01] 楊傑能,「一個找尋型態鑑別問題決策邊界區域的新方法」,國立中山大學機械工程研究所碩士論文,2001年。
[葉怡成98] 葉怡成,「類神經網路模式應用與實作」,儒林圖書有限公司,1998年1月。
[龔良明98] 龔良明,「衍生性群集分析方法之探訂:理論與應用」,國立中山大學資訊管理研究所碩士論文,1998年。
[IBM97] IBM,「資料探挖-找出隱藏在資料庫中的寶藏」,資訊傳真周刊,256期,1997年8月,pp.24。
[陳文華99] 陳文華,「應用資料倉儲系統建立CRM」,資訊與電腦,1999年5月,pp.122-127。
[張德民99] 張德民,「資料探勘:從搜尋金星火山到偵察考試作弊」,資訊傳真周刊,336期,1999年3月,pp.10。

英文文獻:
[AIS93] Agrawal, R., Imielinski, T. and Swami, A., “Mining Association Rules Between Sets of Items in Large Databases,” Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington DC, 1993, pp.207-216.
[AS94] Agrawal, R. and Srikant, R., “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, 1994, pp.487-499.
[AS95] Agrawal, R. and Srikant, R., “Mining Sequential Patterns: Generalizations and Performance Improvements,” Research Report RJ 9994, IBM Almaden Research Center, San Jose, California, Dec, 1995.
[AS95T] Agrawal, R. and Srikant, R., “Mining Sequential Patterns,” Proceedings of 1995 International Conference on Data Engineering, Taipei, Taiwan, March 1995.
[BL97] Berry, M. J. A. and Linoff, G., Data Mining Techniques: For Marketing Sale and Customer Support, John Wiley & Sons, Inc., 1997.
[CFPS99] Chan, P. K., Fan, W., Prodromidis, A. L., and Stolfo, S. J., “Distributed Data Mining in Credit Card Fraud Detection,” IEEE Intelligent Systems, Vol. 14, No. 6, 1999, pp.67-74.
[CHC01] Chae, Y. M., Ho, S. H., Cho, K. W., Lee, D. H., and Ji, S. H., “Data Mining Approach to Policy Analysis in A Health Insurance Domain,” International Journal of Medical Informatics, Vol. 62, No.2-3, July 2001, pp.103-111.
[CN89] Clark, P. and Niblett, T., “The CN2 Induction Algorithm,” Machine Learning, Vol. 3, 1989, pp.261-283.
[DBB91] DeRouin, E., Brown, J., Beck, H., Fausett, L., and Schneider, M., “Neural Network Training on Unequally Represented Classes,” Intelligent Engineering Systems Through Artificial Neural Networks, C. H. Dagli, S. R. T. Kumara, and Y. C. Shin (Eds.), ASME Press, New York, 1991, pp.135-145.
[E93] Everitt, B. S., Cluster Aanlysis, John Wiiley & Sons, Inc., 1993.
[EM97] Estivill-Castro, V. and Murray, A. T., “Spatial Clustering for Data Mining with Generic Algorithms,” Technical Report FIT-TR-97-10, Queensland University of Technology, Faculty of Information Management, September 1997.
[F96] Frederick, E., R., “Learning from Customer Defections,” Harvard Business Review, March 1996.
[H68] Hart, P. E., “The Condensed Nearest Neighbor Rule,” IEEE Transactions on Information Theory, IT-14, 1968, pp.515-516.
[HFT95] Han, J., Fu, Y. and Tang, S., “Advances of the DBLearn System for Knowledge Discovery in Large Databases,” Proc. of 1995 Int’l Joint Conf. on Artificial Intelligence (IJCAI’95), Montreal, Canada, Aug, 1995, pp.2049-2050.
[HMH97] Honda T., Motizuki H., Ho T. B. , and Okumura M. , “Generating Decision Trees from an Unbalanced Data Set,” Poster papers presented at the 9th European Conference on Machine Learning (ECML), edited by Maarten van Someren and Gerhard Widmer, 1997, pp 68-77.
[JD88] Jain, A. K. and Dubes, R. C., Algorithms for Clustering Data, Prentice-Hall, Inc., 1988.
[K89] Kohonen, T., Self-Organization and Associative Memory, Springer, 1989.
[K95] Kohonen, T., Self-Organizing Maps, Springer, 1995.
[KM97] Kubat, M. and Matwin, S., “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection,” Proceedings of the 14th International Conference on Machine Learning, 1997.
[KR90] Kaufman, L. and Rousseeuw, P. J., “Finding Groups in Data: An Introduction to Cluster Analysis,” John Wiley & Sons, Inc.,New York, NK, 1990.
[NH94] Ng, R. and Han, J., “Efficient and Effective Clustering Methods for Spatial Data Mining,” Proceedings of the 20th Conference on Very Large Data Bases, Santiago, Chile, 1994.
[PR01] Peppers, D. and Rogers, M., One to One B2B: Customer Development Strategies for the Business-to-Business World, Cahners Business Information, Inc., 2001.
[Q86] Quinlan, J. R., “Induction of Decision Tree,” Machine Learning, Vol. 1, 1986, pp.81-106.
[Q93] Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993.
[RHW86] Rumelhart, D. E., Hinton, G. E., and Williams, R. J., “Learning Internal Representations by Error Propagation,” In Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1, D. E. Rumelhart and J. L. McClelland (Eds.), MIT Press, Cambridge, MA, 1986, 318-362.
[SSF96] Salvatore, J., Stolfo, D., Fan, W., Lee, W. and Prodromidis, A. L., “Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results,” 1996.
[TS98] Thomas, S. and Sarawagi, S., "Mining Generalized Association Rules and Sequential Patterns Using SQL Queries,” Proc. of the 4th Int'l Conference on Knowledge Discovery in Databases and Data Mining, New York, Aug, 1998.
[T76] Tomek, I., “Two Modifications of CNN,” IEEE Transactions on Systems, Man and Cybernetics, Vol. 6, 1976, pp.769-772.
[WB98] Westphal, C. and Blaxton, T., Data Mining Solutions, John Wiley & Sons, Inc., 1998.
[WHK98] Wei, C. P., Hu, P. J., and Kung, L. M., “Multiple-Level Clustering Analysis for Data Mining Applications,” Proceedings of 4th Informs Joint Conference on Information Systems and Technology, May, 1999.
[WPS01] Wei, C., Piramuthu, S. and Shaw, M. J., “Knowledge Discovery and Data Mining,” Chapter 41 in Handbook of Knowledge Management, Vol. 2, C. W. Holsapple (Ed.), Springer-Verlag, Berlin, Germany, 2003, pp.157-189.
[ZRL96] Zhang, T., Ramarkrishnan, R. and Livny, M., “BIRCH: An Efficient Data Clustering Method for Very Large Database,” Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, Canada, 1996.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內外都一年後公開 withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code