Responsive image
博碩士論文 etd-0830115-201456 詳細資訊
Title page for etd-0830115-201456
論文名稱
Title
巨量資料分析平台建置:效能分析與惡意郵件過濾應用
Big Data Analytics Platform Establishment: Efficiency Analysis and Spam Email Filtering
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
69
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2015-09-11
繳交日期
Date of Submission
2015-09-30
關鍵字
Keywords
Hadoop、TF-IDF、分散式運算、巨量資料、惡意郵件過濾、樸素貝氏分類器
TF-IDF, naive Bayes classifier, distributed computing, spam email filtering, Hadoop, big data
統計
Statistics
本論文已被瀏覽 5714 次,被下載 21
The thesis/dissertation has been browsed 5714 times, has been downloaded 21 times.
中文摘要
巨量資料為資料分析應用帶來了許多新的挑戰,隨著資料量的不斷增加,傳統分析方法已無法勝任如此龐大的運算量,因此Hadoop被提出來解決上述問題,透過MapReduce與HDFS將數台運算節點結合為一個叢集運算平台。本論文建置了一套Hadoop巨量資料分析平台,並進行一系列的效能評估與分析,另外在平台上亦實作出一套惡意郵件過濾系統,藉由Term frequency-inverse document frequency (TF-IDF)與樸素貝氏分類器(Naive Bayes Classifier)的結合來提升惡意郵件過濾的效能。在實驗結果中,惡意郵件過濾的準確率能比Linux常用的SpamAssassin過濾系統更準確,且運算速度約為其10倍,大幅降低了惡意郵件為伺服器帶來的干擾。
Abstract
The era of big data has come. Except for the profit, big data brings more challenges for data analysis. Thus, the Hadoop platform is proposed to analyze big data. Hadoop platform uses MapReduce and HDFS for efficient big data analytics and storage with several commodity computers. In this thesis, we implemented the Hadoop big data platform, and delivered several efficiency analysis. In addition, a spam filtering system is also implemented on Hadoop platform. The spam filtering system is comprised of term frequency-inverse document frequency (TF-IDF) to extract the keyword features from emails and naive Bayes classifier to classifying email as spam or non-spam. In the experiments, we compared with SpamAssassin, which is robust and widely used in Linux. As experimental results show, we can detect most spam that also detected by Spamassassin, and nearly incorrectly classifies non-spam emails as spam. Most importantly, the computing performance is 10 times faster of SpamAssassin.
目次 Table of Contents
審定書 i
誌謝 iii
中文摘要 iv
Abstract v
Contents vi
List of Figures viii
List of Tables x
Chapter 1 1
1.1 Introduction to Big Data 1
1.2 Motivation 3
1.3 Contributions 4
1.4 Organization 5
Chapter 2 6
2.1 Spam Email 8
2.1.1 Advertising Spam 8
2.1.2 Virus Spam 10
2.1.3 Phishing Spam 12
2.2 Spam Filtering Methods 13
2.2.1 Real-time Blackhole Lists 14
2.2.2 Distributed Checksum Clearinghouse 16
2.2.3 Decision Tree 16
2.2.4 Support Vector Machine 18
2.2.5 Apache SpamAssassin 19
Chapter 3 21
3.1 Overview 21
3.2 Hadoop Big Data Analytics Platform 22
3.2.1 Hadoop Distributed File System 23
3.2.2 MapReduce 26
3.2.3 Apache Mahout 27
3.3 Efficiency Analysis 29
3.4 Email Decoding 31
3.5 Term Frequency-Inverse Document Frequency 33
3.6 Naive Bayes Classifier 35
Chapter 4 39
Chapter 5 55
Reference 57
參考文獻 References
[1] Apache Hadoop, http://hadoop.apache.org.
[2] Apache Spark, http://spark.apache.org.
[3] L.-P. Jing, H.-K. Huang, and H.-B. Shi, “Improved feature selection approach TFIDF in text mining.” in Proceedings of International Conference on Machine Learning and Cybernetics, vol. 2, pp. 944-946, 2002.
[4] A. McCallum, and K. Nigam, “A comparison of event models for naive Bayes text classification.” in Proceedings of AAAI Workshop on Learning for Text Categorization, vol. 752, pp. 41-48 1998.
[5] RFC 6471, Overview of best email DNS-based list (DNSBL) operational practice, 2012.
[6] Rhyolite Software, Distributed Checksum Clearinghouses, http://www.rhyolite.com/dcc/.
[7] D. E. Johnson, F. J. Oles, T. Zhang, and T. Goetz, “A decision-tree-based symbolic rule induction system for text categorization,” IBM Systems Journal, vol. 41, no. 3, pp. 428-437, 2002.
[8] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
[9] H. Drucker, D. Wu, and N. V. Vladimir, “Support vector machines for spam categorization.” IEEE Transactions on Neural Networks, vol.10, no. 5, pp. 1048-1054, 1999.
[10] Apache SpamAssassin, http://spamassassin.apache.org.
[11] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Proceedings of IEEE Symposium on Mass Storage Systems and Technologies, pp. 1-10, 2010.
[12] J. Dean and H. Sanjay, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.
[13] Apache Mahout, http://mahout.apache.org
[14] RFC 2047, MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text, 1996.
[15] V. Metsis, I. Androutsopoulos and G. Paliouras, “Spam Filtering with Naive Bayes – Which Naive Bayes?” in Proceedings of the 3rd Conference on Email and Anti-Spam, 2006.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code