國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以區域聯防為基礎之垃圾郵件防治研究,Anti-Spam Study: an Alliance-based Approach

論文名稱 Title	以區域聯防為基礎之垃圾郵件防治研究 Anti-Spam Study: an Alliance-based Approach
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	94 學年度第 2 學期 The spring semester of Academic Year 94	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	91
研究生 Author	邱郁芬 Yu-fen Chiu
指導教授 Advisor	鄭炳強, 陳嘉玫 Jeng-Bing Chiang; Chia-Mei Chen
召集委員 Convenor	楊竹星 Chu-Sing Yang
口試委員 Advisory Committee	蔡正發 Cheng-Fa Tsai
口試日期 Date of Exam	2006-07-25	繳交日期 Date of Submission	2006-09-12
關鍵字 Keywords	強化學習、約略集合理論、文件分類、垃圾郵件、XCS分類元系統 Rough set theory, Reinforcement learning, XCS classifier system, Spam, Text classification
統計 Statistics	本論文已被瀏覽 5932 次，被下載 20 次 The thesis/dissertation has been browsed 5932 times, has been downloaded 20 times.

中文摘要
垃圾郵件帶來的威脅日趨嚴重，顯示出垃圾郵件過濾技術的價值所在。現今的過濾技術多為機器學習與資料探勘的結合，這些技術強調能達到極高的準確度，但其誤判率卻不一定很低；在實際狀況中，誤判率造成的損失通常都是難以彌補的。許多垃圾郵件防治方案只是針對某些現行的技術提出改善，而混用多種演算法的研究又相當少見，於是本研究提出了區域聯防的架構，結合約略集合理論、基因演算法與XCS分類元系統，期望能廣為散播關於垃圾郵件的即時資訊，使郵件伺服器得以聯手防堵氾濫成災的垃圾郵件。約略集合理論在處理不精確也不完整的資料方面有卓越的能耐，並且是有助於交換分享的規則導向演算法；又因約略集合理論計算最佳reduct組合屬於NP-hard的問題，所以需藉助基因演算法可在大量資料中快速搜尋、比對、演化出最佳解的特性，產生垃圾郵件的過濾規則。XCS中的強化學習能幫助各個郵件伺服器了解最適合自身的郵件分類準則。以區域聯防為基礎的垃圾郵件過濾成果，經過一些統計方法評估後證實有不錯的表現，並有以下兩個結論： (1)從別台郵件伺服器交換來的過濾規則，確實對阻擋掉更多的垃圾郵件有貢獻。 (2)混用多種演算法的垃圾郵件防治方案能同時改善準確度與誤判率。
Abstract
The growing problem of spam has generated a need for reliable anti-spam filters. There are many filtering techniques along with machine learning and data miming used to reduce the amount of spam. Such algorithms can achieve very high accuracy but with some amount of false positive tradeoff. Generally false positives are prohibitively expensive in the real world. Much work has been done to improve specific algorithms for the task of detecting spam, but less work has been report on leveraging multiple algorithms in email analysis. This study presents an alliance-based approach to classify, discovery and exchange interesting information on spam. Furthermore, the spam filter in this study is build base on the mixture of rough set theory (RST), genetic algorithm (GA) and XCS classifier system. RST has the ability to process imprecise and incomplete data such as spam. GA can speed up the rate of finding the optimal solution (i.e. the rules used to block spam). The reinforcement learning of XCS is a good mechanism to suggest the appropriate classification for the email. The results of spam filtering by alliance-based approach are evaluated by several statistical methods and the performance is great. Two main conclusions can be drawn from this study: (1) the rules exchanged from other mail servers indeed help the filter blocking more spam than before. (2) a combination of algorithms improves both accuracy and reducing false positives for the problem of spam detection.

目次 Table of Contents
Chapter 1 Introduction....................................1 1.1 Recent Reports on Spam................................2 1.2 Problem Definition and Motivation.....................4 1.3 Reader’s Guide.......................................7 Chapter 2 Related Works...................................9 2.1 Spam Filtering Techniques Review......................9 2.2 Rough Sets Theory....................................17 2.3 Genetic Algorithm....................................22 2.4 XCS Classifier System................................25 Chapter 3 Alliance-based Approach........................30 3.1 Single-server System.................................32 3.2 System Architecture..................................39 3.3 Performance Criteria.................................46 Chapter 4 Evaluation and Validation......................49 4.1 Design of Experiments................................49 4.2 Steps of Experiments.................................60 4.3 The Respective Performances..........................63 4.4 The Overall Performance..............................67 Chapter 5 Conclusions and Future Work....................74 Appendix A–The Configuration of .procmailrc.............76 Appendix B–Miscellaneous Notation and System Parameters.77 Bibliography.............................................78

參考文獻 References
[1] Mark Levitt and Robert P. Mahowald, "Worldwide email usage 2005-2009 forecast: Email's future depends on keeping its value high and its cost low, " Tech. Rep, pp. 36, 22 Dec, 2005. [2] IDC, "IDC_ROI_Calculator for Anti-Spam solution," 2004. Available: http://www.surfcontrol.com/resources/asroi/IDC_ROI_Calculator.htm [3] SophosLabs, "Sophos reveals latest ‘dirty dozen’ spam relaying countries", Tech. Rep, 12 October, 2005. [4] The Spamhause Project, "The 10 worst spam origin countries," Spamhaus, Tech. Rep. 20051030, 30 October, 2005. [5] James Carpinter and Ray Hunt, "Tightening the net: A review of current and next generation spam filtering tools, " Presented at Asia Pacific Regional Internet Conference on Operational Technologies, 2005. [6] MessageLabs, "MessageLabs intelligence report: 2006 quarter 2 summary report, " Tech. Rep, pp. 17, June 2006. [7] Hassan, Y. Tazaki, E. "Rule extraction based on rough set theory combined with genetic programming and its application to medical data analysis," Presented at Intelligent Information Systems Conference, the Seventh Australian and New Zealand, 2001. [8] Pivotal Veracity, "Anti-Spam Methods & Checks," [9] Bart Massey, Mick Thomure, Raya Budrevich and Scott Long. "Learning spam: Simple techniques for freely-available software," 2003. [10] A. Chouchoulas, "A rough set approach to text classification," 1999. Available: http://www.bedroomlan.org/~alexios/files/alexios_msc_thesis.pdf [11] P. Alina Lazar, "An overview of heuristic knowledge discovery for large data sets Using genetic algorithms and rough sets," pp. 7, 2002. [12] A. Hassanien, "Rough set approach for attribute reduction and rule generation: a case of patients with suspected breast cancer," J. Am. Soc. Inf. Sci. Technol., vol. 55, pp. 954-962, 2004. [13] Z. Pawlak, "Rough sets," Int. J. Inf. Comput. Sci., 11. 1982. [14] L. A. Zadeh, "Fuzzy sets," Inf Control, 8. 1965. [15] Z. Pawlak, J. Grzymala-Busse, R. Slowinski and W. Ziarko, "Rough sets," Commun ACM, vol. 38, pp. 88-95, 1995. [16] B. Walczak and D. L. Massart, "Tutorial Rough sets theory," Chemometrics Intellig. Lab. Syst., vol. 47, pp. 1-16, 1999. [17] Z. Zheng, G. Wang and Y. Wu, "Objects'combination based simple computation of attribute core," Intelligent Control, 2002.Proceedings of the 2002 IEEE International Symposium on, pp. 514-519, 2002. [18] S. Fujimori, T. Kaiya and T. Inoue, "Analysis of discharge currents with discernibility matrices," Electrical Insulating Materials, 1998.Proceedings of 1998 International Symposium on, pp. 649-652, 1998. [19] S. Vinterbo and A. Ohrn, "Minimal approximate hitting sets and rule templates," International Journal of Approximate Reasoning, vol. 25, pp. 123-143, 2000. [20] J. Wroblewski, "Finding minimal reducts using genetic algorithm (extended version)," Proceedings of Second Joint Annual Conference on Information Sciences, USA, pp. 186-189, 1995. [21] Binbin Qu and Yansheng Lu, "A rough sets & genetic based approach for rule induction," in 2004, pp. 4300-4303 Vol.5. [22] G. Chakraborty and B. Chakraborty, "A rough-GA hybrid algorithm for rule extraction from large data," in 2004, pp. 85-90. [23] Tian-Le Tan, Ping Li and Zhi-Huan Song, "Matrix computation for dynamic modification of rough set information system," in 2003, pp. 1692-1697 Vol.3. [24] Sen Guo, Zhi-Yan Wang, Zhi-Cheng Wu and He-Ping Yan, "A novel dynamic incremental rules extraction algorithm based on rough set theory," in 2005, pp. 1902-1907 Vol. 3. [25] Tong Lingyun and An Liping, "Incremental learning of decision rules based on rough set theory," in 2002, pp. 420-425 vol.1. [26] Tianrui Li, Ning Yang, Yang Xu and Jun Ma, "An incremental algorithm for mining classification rules in incomplete information systems," in 2004, pp. 446-449 Vol.1. [27] J. H. Holland, "Adaptation in Natural and Artificial Systems [M]," Ann Arbor: University of Michigan Press, vol. 183, 1975. [28] L. Khoo and L. Zhai, "A prototype genetic algorithm-enhanced rough set-based rule induction system," Comput. Ind., vol. 46, pp. 95-106, August. 2001. [29] R. L. Haupt and S. E. Haupt, Practical Genetic Algorithms. Wiley-Interscience, 2004, [30] S. W. Wilson, "State of XCS classifier system research," in Learning Classifier Systems, from Foundations to Applications, 2000, pp. 63-82. [31] Stewart W. Wilson, "Classifier Fitness Based on Accuracy," Evolutionary Computation, Vol. 3, No.2, pp. 175, 1995. [32] M. V. Butz, T. Kovacs, P. L. Lanzi and S. W. Wilson, "Toward a theory of generalization and learning in XCS," Evolutionary Computation, IEEE Transactions on, vol. 8, pp. 28-46, 2004. [33] J. Hidalgo, "Evaluating cost-sensitive unsolicited bulk email categorization," in SAC '02: Proceedings of the 2002 ACM Symposium on Applied Computing, 2002, pp. 615-620. [34] H. Katirai, "Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes," September 10, 1999. 1999. [35] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. Spyropoulos and P. Stamatopoulos, "A memory-based approach to anti-spam filtering," 2001. [36] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos and C. D. Spyropoulos, "An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages," in SIGIR '00: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000, pp. 160-167. [37] E. Riloff and W. Lehnert, "Information extraction as a basis for high-precision text classification," ACM Trans. Inf. Syst., vol. 12, pp. 296-333, 1994. [38] H. Drucker, Donghui Wu and V. N. Vapnik, "Support vector machines for spam categorization," Neural Networks, IEEE Transactions on, vol. 10, pp. 1048-1054, 1999. [39] Aleksander Øhrn, "Discernibility and Rough Sets in Medicine: Tools and Applications", December 1999. [40] T. Kovacs, "Evolving Optimal Populations with XCS Classier Systems," Research Papers CSRP-96-17, the University of Birmingham, School of Computer Science, 1996. [41] E. Bernado-Mansilla and Tin Kam Ho, "Domain of competence of XCS classifier system in complexity measurement space," Evolutionary Computation, IEEE Transactions on, vol. 9, pp. 82-104, 2005. [42] Mo-Yi Tzeng, "A Spam Filter Based on Rough Sets Theory," July 2005. [43] Doug Herbers, "Collaborative E-mail Filtering," 2005. [44] F. D. Garcia,J.-H.Hoepman and J. van Nieuwenhuizen, "Spam Filter Analysis," Presented at Proceedings of 19th IFIP International Information Security Conference, WCC2004-SEC, 2004. [45] Lorrie Faith, Brain A. LaMacchia. “Spam!”, Commun ACM, vol. 41, pp. 74-83, 8. 1998.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內一年後公開，校外永不公開 campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.12.162.179 論文開放下載的時間是校外不公開 Your IP address is 3.12.162.179 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS