國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,巨量資料分析平台建置：效能分析與惡意郵件過濾應用,Big Data Analytics Platform Establishment: Efficiency Analysis and Spam Email Filtering

論文名稱 Title	巨量資料分析平台建置：效能分析與惡意郵件過濾應用 Big Data Analytics Platform Establishment: Efficiency Analysis and Spam Email Filtering
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	104 學年度第 1 學期 The fall semester of Academic Year 104	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	69
研究生 Author	賴向二 Hsiang-Erh Lai
指導教授 Advisor	葉家宏 Chia-Hung Yeh
召集委員 Convenor	胡武誌 Wu-Chih Hu
口試委員 Advisory Committee	吳汶涓, 林智揚 Wen-Chuan Wu; Chih-Yang Lin
口試日期 Date of Exam	2015-09-11	繳交日期 Date of Submission	2015-09-30
關鍵字 Keywords	Hadoop、TF-IDF、分散式運算、巨量資料、惡意郵件過濾、樸素貝氏分類器 TF-IDF, naive Bayes classifier, distributed computing, spam email filtering, Hadoop, big data
統計 Statistics	本論文已被瀏覽 5714 次，被下載 21 次 The thesis/dissertation has been browsed 5714 times, has been downloaded 21 times.

中文摘要
巨量資料為資料分析應用帶來了許多新的挑戰，隨著資料量的不斷增加，傳統分析方法已無法勝任如此龐大的運算量，因此Hadoop被提出來解決上述問題，透過MapReduce與HDFS將數台運算節點結合為一個叢集運算平台。本論文建置了一套Hadoop巨量資料分析平台，並進行一系列的效能評估與分析，另外在平台上亦實作出一套惡意郵件過濾系統，藉由Term frequency-inverse document frequency (TF-IDF)與樸素貝氏分類器(Naive Bayes Classifier)的結合來提升惡意郵件過濾的效能。在實驗結果中，惡意郵件過濾的準確率能比Linux常用的SpamAssassin過濾系統更準確，且運算速度約為其10倍，大幅降低了惡意郵件為伺服器帶來的干擾。
Abstract
The era of big data has come. Except for the profit, big data brings more challenges for data analysis. Thus, the Hadoop platform is proposed to analyze big data. Hadoop platform uses MapReduce and HDFS for efficient big data analytics and storage with several commodity computers. In this thesis, we implemented the Hadoop big data platform, and delivered several efficiency analysis. In addition, a spam filtering system is also implemented on Hadoop platform. The spam filtering system is comprised of term frequency-inverse document frequency (TF-IDF) to extract the keyword features from emails and naive Bayes classifier to classifying email as spam or non-spam. In the experiments, we compared with SpamAssassin, which is robust and widely used in Linux. As experimental results show, we can detect most spam that also detected by Spamassassin, and nearly incorrectly classifies non-spam emails as spam. Most importantly, the computing performance is 10 times faster of SpamAssassin.

目次 Table of Contents
審定書 i 誌謝 iii 中文摘要 iv Abstract v Contents vi List of Figures viii List of Tables x Chapter 1 1 1.1 Introduction to Big Data 1 1.2 Motivation 3 1.3 Contributions 4 1.4 Organization 5 Chapter 2 6 2.1 Spam Email 8 2.1.1 Advertising Spam 8 2.1.2 Virus Spam 10 2.1.3 Phishing Spam 12 2.2 Spam Filtering Methods 13 2.2.1 Real-time Blackhole Lists 14 2.2.2 Distributed Checksum Clearinghouse 16 2.2.3 Decision Tree 16 2.2.4 Support Vector Machine 18 2.2.5 Apache SpamAssassin 19 Chapter 3 21 3.1 Overview 21 3.2 Hadoop Big Data Analytics Platform 22 3.2.1 Hadoop Distributed File System 23 3.2.2 MapReduce 26 3.2.3 Apache Mahout 27 3.3 Efficiency Analysis 29 3.4 Email Decoding 31 3.5 Term Frequency-Inverse Document Frequency 33 3.6 Naive Bayes Classifier 35 Chapter 4 39 Chapter 5 55 Reference 57

參考文獻 References
[1] Apache Hadoop, http://hadoop.apache.org. [2] Apache Spark, http://spark.apache.org. [3] L.-P. Jing, H.-K. Huang, and H.-B. Shi, “Improved feature selection approach TFIDF in text mining.” in Proceedings of International Conference on Machine Learning and Cybernetics, vol. 2, pp. 944-946, 2002. [4] A. McCallum, and K. Nigam, “A comparison of event models for naive Bayes text classification.” in Proceedings of AAAI Workshop on Learning for Text Categorization, vol. 752, pp. 41-48 1998. [5] RFC 6471, Overview of best email DNS-based list (DNSBL) operational practice, 2012. [6] Rhyolite Software, Distributed Checksum Clearinghouses, http://www.rhyolite.com/dcc/. [7] D. E. Johnson, F. J. Oles, T. Zhang, and T. Goetz, “A decision-tree-based symbolic rule induction system for text categorization,” IBM Systems Journal, vol. 41, no. 3, pp. 428-437, 2002. [8] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995. [9] H. Drucker, D. Wu, and N. V. Vladimir, “Support vector machines for spam categorization.” IEEE Transactions on Neural Networks, vol.10, no. 5, pp. 1048-1054, 1999. [10] Apache SpamAssassin, http://spamassassin.apache.org. [11] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Proceedings of IEEE Symposium on Mass Storage Systems and Technologies, pp. 1-10, 2010. [12] J. Dean and H. Sanjay, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008. [13] Apache Mahout, http://mahout.apache.org [14] RFC 2047, MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text, 1996. [15] V. Metsis, I. Androutsopoulos and G. Paliouras, “Spam Filtering with Naive Bayes – Which Naive Bayes?” in Proceedings of the 3rd Conference on Email and Anti-Spam, 2006.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0830115-201456.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS