Responsive image
博碩士論文 etd-0807108-164951 詳細資訊
Title page for etd-0807108-164951
論文名稱
Title
以強化學習與協同合作為基礎的垃圾郵件防治研究
A Spam Filter Based on Reinforcement and Collaboration
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
63
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2008-07-25
繳交日期
Date of Submission
2008-08-07
關鍵字
Keywords
垃圾郵件、資料探勘與人工智慧、約略集合理論、協同合作
Spam, Data Mining and Artificial Intelligence, Collaborative, Rough Set Theory
統計
Statistics
本論文已被瀏覽 5876 次,被下載 12
The thesis/dissertation has been browsed 5876 times, has been downloaded 12 times.
中文摘要
垃圾郵件氾濫的問題,對使用者是而言是很大的困擾。因此需要有一個穩定、可靠的反垃圾郵件過濾器,協助使用者過濾垃圾郵件。目前大部份研究均著重於過濾規則的產生,以建構於單一獨立的郵件伺服器上來過濾垃圾郵件。然而新的垃圾郵件會伴隨著新的規則產生,以躲避垃圾郵件過濾器。本研究提出透過協同合作方式,藉以發現、交換垃圾郵件訊息,並強化郵件伺服器本身的垃圾郵件法則,透過法則的產生、強化、交換以建立一套自動過濾垃圾郵件機制。
本研究實際以約略集合理論為基礎,建立一個以法則為基礎的垃圾郵件過濾器,找出垃圾郵件特徵規則,產生垃圾郵件法則,並持續強化、交換垃圾郵件法則以過濾垃圾郵件。經過實驗評估後,得到以下結論:
(1) 透過擷取郵件的標題特徵,可過濾中文垃圾郵件。
(2) 垃圾郵件過濾法則必須要持續強化更新,才可以維持垃圾郵件法則的有效性。
(3) 透過協同合作交換的垃圾郵件過濾法則,可阻擋更多的垃圾郵件。
Abstract
Growing volume of spam mails have not only decreased the productivity of people but also become a security threat on the Internet. Mail servers should have abilities to filter out spam mails which change time by time precisely and manage increasing spam rules which generated by mail servers automatically and effectively. Most paper only focused on single aspect (especially for spam rule generation) to prevent spam mail. However, in real word, spam prevention is not just applying data mining algorithm for rule generation. To filter out spam mails correctly in a real world, there are still many issues should be considered in addition to spam rule generation.
In this paper, we integrate three modules to form a complete anti-spam system, they are spam rule generation module, spam rule reinforcement module and spam rule exchange module. In this paper, rule-based data mining approach is used to generate exchangeable spam rules. The feedback of user’s returns is reinforced spam rule. The distributing spam rules are exchanged through machine-readable XML format. The results of experiment draw the following conclusion: (1) The spam filter can filter out the Chinese mails by analyzing the header characteristics. (2) Rules exchanged among mail improve the spam recall and accuracy of mail servers. (3) Rules reinforced improve the effectiveness of spam rule.
目次 Table of Contents
目錄
第一章 緒論 1
第一節 研究背景 1
第二節 研究動機 4
第二章 文獻探討 7
第一節 電子郵件概述 7
電子郵件系統 7
電子郵件架構 8
第二節 垃圾郵件相關研究 11
第三節 Machine Learning Techniques 15
第四節 約略集合理論 17
第三章 系統設計 20
第一節 垃圾郵件特徵 20
第二節 郵件過濾系統 22
郵件內容擷取模組 24
郵件管理模組 25
法則運算模組 25
法則管理模組 27
郵件過濾模組 28
回饋強化模組 29
法則交換模組 30
第三節 垃圾郵件過濾效能分析指標 31
第四章 實驗結果與驗證 34
單一伺服器強化學習 36
郵件伺服器的協同合作 44
第五章 結論 51
第一節 本研究貢獻 51
第二節 未來發展 51
參考文獻 52


圖目錄
圖 1-1 常見垃圾郵件分佈 2
圖 1 2 十大垃圾郵件發佈來源 3
圖 1 3 協同合作示意圖 5
圖 1 4 協同合作架構圖 6
圖 2 1郵件寄送路徑圖 8
圖 2 2 郵件標頭 10
圖 2 3 郵件RECEIVED 示意圖 11
圖 2 4四個演算法比較 15
圖 2 5 明確集合的歸屬函數(A)與模糊集合的歸屬函數(B) 17
圖 2 6 LOWER APPROXIMATION和UPPER APPROXIMATION所代表之集合 18
圖 3 1 RECEIVED範例 20
圖 3 2不具連續性的RECEIVED 21
圖 3 3問題路經示意圖 21
圖 3 4系統模組架構圖 23
圖 3 5 ROSETTER 匯出成XML 格式 27
圖 3 6法則匯入流程圖 28
圖 3 7郵件過濾流程圖 29
圖 3 8回饋強化流程圖 30
圖 4 1系統流程圖 35
圖 4 2 RULEA不同SUPPORT的 SPAM RECALL 39
圖 4 3 RUEA不同SUPPORT的SPAM PRECISION 39
圖 4 4第二週RALEA’不同SUPPORT的SPAM PRECISION 39
圖 4 5第二週RALEA’不同SUPPORT的SPAM RECALL 39
圖 4 6第三週RALEA’不同SUPPORT的SPAM PRECISION 39
圖 4 7第三週RALEA’不同SUPPORT的SPAM RECALL 39
圖 4 8 MAIL A RULEA VS RULEA' SPAM RECALL 40
圖 4 9 MAIL A RULEA VS RULEA' SPAM PRECISION 41
圖 4 10 MAIL A RULEA VS RULEA' SPAM ACCURACY 42
圖 4 11 MAIL A RULEA VS RULEA' SPAM MISS RATE 42
圖 4 12 SUPPORT >= 5 及ACCURACY=0.8 法則比較圖 43
圖 4 13 RULE A’垃圾郵件法則使用比率 44
圖 4 14 RULE A' VS RULE A' ∪ RULE B' 協同合作法則 SPAM RECALL 46
圖 4 15 RULE A' VS RULE A' ∪ RULE B' 協同合作法則 SPAM PRECISION 46
圖 4 16 RULE A' VS RULE A' ∪ RULE B' 協同合作法則SPAM ACCURACY 47
圖 4 17 RULE A' VS RULE A' ∪ RULE B' 協同合作法則MISS RATE 48
圖 4 18 協同合作 RULE B’ 法則趨勢圖 49

表目錄

表 1 1垃圾郵件比例表 2
表 2 1郵件代理程式一覽表 7
表 2 2郵件欄位名稱以及內容說明 10
表 2 3防治垃圾郵件方法一覽表 11
表 2 4過濾垃圾郵件方法一覽表 13
表 2 5 SVM和RST演算法用於入侵偵測系統(IDS)的準確程度比較表 16
表 2 6 NAÏVE BAYES AND RSC 效能比較 16
表 3 1系統模組一覽表 23
表 3 2郵件屬性一覽表 24
表 3 3垃圾郵件決策表 26
表 3 4 ROSETTA 提供之 REDUCT計算方法 26
表 3 5實際郵件與過濾郵件之關係矩陣 31
表 3 6同樣ERROR RATE 比較 32
表 4 1郵件資料一覽表 36
表 4 2郵件資料集一覽表 37
表 4 3郵件A法則數量一覽表 38
表 4 4郵件A與郵件B交換後的郵件法則數量表(RULE A'∪ RULE B') VS 未交換郵件法則(RULE A') 45
表 4 5 協同合作垃圾郵件法則數量表 50
參考文獻 References
[1] 曾鴻儒, “ 2個月 192億封垃圾信,” 自由時報 Http://www.Nccwatch.Org.tw/news/20071217/7728, 2007/12/14.
[2] SpamAssassin, “http://spamassassin.apache.org/”.
[3] The Spamhouse Project, “ http://www.spamhaus.org/,”
[4] Microsoft Corporation, “ Caller ID for E-mail the next step to deterring spam,” February 12, 2004,
[5] Nucleus Research, “ Nucleus Research: Spam Costing US Businesses $712 Per Employee Each Year” , http://nucleusresearch.com/ , 2007.
[6] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos and P. Stamatopoulos, “ Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach,” Arxiv Preprint Cs.CL/0009009, 2000.
[7] R. Associates, “ National Technology Readiness Survey,” Falls Church, VA, 2004.
[8] Y. F. Chiu, C. M. Chen, B. Jeng and H. C. Lin, “ An Alliance-Based Anti-spam Approach,” Natural Computation, 2007.ICNC 2007.Volume IV.Third International Conference on, vol. 4, 2007.
[9] J. Carpinter and R. Hunt, “ Tightening the net: A review of current and next generation spam filtering tools,” Comput. Secur., vol. 25, pp. 566-578, 2006.
[10] L. F. Cranor and B. A. LaMacchia, “ Spam!” Commun ACM, vol. 41, pp. 74-83, 1998.
[11] P. Cunningham, N. Nowlan, S. J. Delany and M. Haahr, “ A Case-Based Approach to Spam Filtering that Can Track Concept Drift,” The ICCBR, vol. 3, pp. 2003-2016,
[12] H. Drucker, D. Wu and V. Vapnik, “ Support vector machines for spam categorization,” Neural Networks, IEEE Transactions on, vol. 10, pp. 1048-1054, 1999.
[13] D. Fallows, “ Internet Users and Spam: What the attitudes and behavior of Internet users can tell us about fighting spam,” Proc.Conference on Email and Anti-Spam, 2004.
[14] J. Goodman, “ IP Addresses in Email Clients,” First Conference on Email and Anti-Spam, 2004.
[15] Y. Hassan and E. Tazaki, “ Rule extraction based on rough set theory combined with genetic programming and its application to medical data analysis,” Intelligent Information Systems Conference, the Seventh Australian and New Zealand 2001, pp. 385-390, 2001.
[16] J. M. G. Hidalgo, “ Evaluating cost-sensitive Unsolicited Bulk Email categorization,” Proceedings of the 2002 ACM Symposium on Applied Computing, pp. 615-620, 2002.
[17] H. Katirai, “ Filtering Junk E-Mail: A Performance Comparison between Genetic Programming and Naive Bayes,” Unpublished Manuscript: Citeseer.Nj.Nec.com/katirai99filtering.Html, vol. 10, 1999.
[18] L. Kunlun and H. Houkuan, “ An architecture of active learning SVMs for spam,” Machine Learning and Cybernetics, 2002. Proceedings. 2002 International Conference on, vol. 2, 2002.
[19] C. Lueg, J. Huang and M. Twidale, “ Mystery Meat: Where does spam come from, and why does it matter?” Proc.15th EICAR Annual Conference.ISBN, pp. 87-987271, 2006.
[20] J. Lyon and M. Wong., “ MTA Authentication Records in DNS,” Draft-Ietf-Marid-Core-03 (Work in Progress), August, 2004.
[21] Z. Pawlak, “ Rough sets,” International Journal of Parallel Programming, vol. 11, pp. 341-356, 1982.
[22] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, “ A Bayesian approach to filtering junk e-mail,” Learning for Text Categorization: Papers from the 1998 Workshop, vol. 62, 1998.
[23] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spyropoulos and P. Stamatopoulos, “ A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists,” Information Retrieval, vol. 6, pp. 49-73, 2003.
[24] K. Tretyakov, “ Machine Learning Techniques in Spam Filtering,” Data Mining Problem-Oriented Seminar, MTAT, vol. 3, pp. 60-79, 2004.
[25] A. Treviño and J. Ekstrom, “ Spam Filtering Through Header Relay Detection,”
[26] M. Tzeng, “ A Spam Filter Based on Rough Sets Theory,” ,2005.
[27] M. Woitaszek, M. Shaaban and R. Czernikowski, “ Identifying junk electronic mail in Microsoft outlook with a support vector machine,” Applications and the Internet, 2003. Proceedings. 2003 Symposium on, pp. 166-169, 2003.
[28] L. A. Zadeh, “ Fuzzy sets [J],” Information and Control, vol. 8, pp. 338-353, 1965.
[29] L. H. Zhang, G. H. Zhang, J. Zhang and Y. C. Bai, “ Intrusion detection using rough set classification,” J. Zhejiang Univ. Sci., vol. 5, pp. 1076-1086, Sep. 2004.
[30] W. Zhao and Z. Zhang, “ An email classification model based on rough set theory,” Active Media Technology, 2005.(AMT 2005).Proceedings of the 2005 International Conference on, pp. 403-408, 2005.
[31] W. Ziarko, “ The discovery, analysis, and representation of data dependencies in databases,” Knowledge Discovery in Databases, pp. 195–209, 1991.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內一年後公開,校外永不公開 campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 35.174.62.162
論文開放下載的時間是 校外不公開

Your IP address is 35.174.62.162
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code