國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,改良找尋Top-k頻繁項之Misra-Gries演算法與其平行化實作,An Improved Misra-Gries algorithm for Finding Top-k Frequent Items and Its Parallelized Implementation

論文名稱 Title	改良找尋Top-k頻繁項之Misra-Gries演算法與其平行化實作 An Improved Misra-Gries algorithm for Finding Top-k Frequent Items and Its Parallelized Implementation
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	106 學年度第 1 學期 The fall semester of Academic Year 106	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	91
研究生 Author	黃順吉 Shun-ji Huang
指導教授 Advisor	林俊宏 Chun-hung Lin
召集委員 Convenor	陳璽煌 Shi-huang Chen
口試委員 Advisory Committee	賴俊如, 張雲南, 賴威光 Jiunn-ru Lai; Yun-nan Chang; Wei-kuang Lai
口試日期 Date of Exam	2017-09-04	繳交日期 Date of Submission	2017-10-25
關鍵字 Keywords	Apache Storm、平行化處理、即時串流處理、頻繁項目、Misra-Gries演算法、Top-K query Frequent items, Top-K query, Misra-Gries algorithm, real-time streaming data, Apache Storm, parallel processing
統計 Statistics	本論文已被瀏覽 5690 次，被下載 17 次 The thesis/dissertation has been browsed 5690 times, has been downloaded 17 times.

中文摘要
「找尋Top-K頻繁項目」（Top-K Frequent Items）是數據流分析中很熱門的議題，常被應用在許多領域，像是對網路流量的監控或是網頁廣告業者關心商品的點擊率等等。而在解決該問題時，基於硬體成本考量，我們將環境假設為記憶體空間不足，所以無法將每筆資料都儲存下來計算處理；取而代之，我們使用近似解法，只需使用有限的記憶體空間來計算無窮的數據流，但代價是計算的結果會有一個合理的誤差。Misra-Gries演算法就是一個解決Top-K問題的近似解法。然而Misra-Gries演算法在處理資料偏度低的測試資料時正確率會很低。在我們實驗中，Misra-Gries演算法執行偏度為0.4，遵守Zipf規則產生的測資時，正確率只有32%。所以我們修改Misra-Gries演算法更新規則，加入「Sliding Window」概念，提高Top-K輸出結果的正確率，並將此Misra-Gries演算法改良版命名為「Sliding Window Misra-Gries演算法」，簡稱「WMG」。在環境假設記憶體空間不足的條件下，WMG演算法相較於Misra-Gries演算法只增加一個記憶體空間的使用，同樣地執行偏度為0.4的測資，正確率就提高到80%。此外我們以理論分析定義WMG的誤差界限，並以實驗佐證。另一方面，為了使WMG可以應用在大數據環境處理串流資料，我們使用Apache Storm將WMG平行化實作，從實驗結果證實了WMG平行化版本是可以適用於大數據串流環境的，我們可以提高平行運算度，使得在相同時間內處理更多的資料量（Throughput），而且保證平行化後的Top-K正確率不會比Sequential版本差。
Abstract
The problem of finding top-k frequent items over data streams is very popular and heavily studied in data streams analysis, and it is widely used in many applications, such as network traffic monitoring and identifying high-clicker commodity. However, exactly finding the top-k items is difficult, because the quantities of data in data streams are so large that it is costly to store all data and then process them. To deal with this problem, many algorithms are proposed to gain the approximate results with tolerable errors. Misra-Gries algorithm is one of counter-based algorithms that are used to find frequent items and top-k items in data streams. However, when Misra-Gries algorithm process the test data whose distribution is low-skewed, the accuracy rate of top-k result is very low. In our experiment, Misra-Gries algorithm process the test data generated following the Zipfian distribution and the parameter skewness is set to 0.4, then the accuracy rate of top-k result is only 32%. In order to improve the accuracy rate, we not only partially modify the update rule of Misra-Gries algorithm, but also add the concept of “sliding window” to the algorithm. Although the top-k result is approximate, the error is guaranteed not exceed an error bound. The algorithm we proposed is named to “sliding window Misra-Gries algorithm” that is abbreviated to “WMG”. WMG is very simple and use small memory space. WMG which just need one more memory space, compared to Misra-Gries algorithm, process the test date we mentioned above get the accuracy rate 80%. In addition, we present a parallel version of WMG, so that it can process larger quantities of data in data streams. We explain how WMG algorithm is parallelized and implement the parallel version of WMG by Apache Storm. The experiment results show that our parallel design could achieve high throughput, and the accuracy rate of top-k result is not worse than the accuracy rate of the sequential version.

目次 Table of Contents
論文審定書 i 摘要 iii Abstract iv 圖次 viii 表次 ix 第一章序論 10 1.1 研究動機與目的 10 1.2 論文架構 12 第二章問題定義 13 第三章相關研究 15 第四章研究背景 17 4.1 Misra-Gries Algorithm 17 4.2 Space Saving Algorithm 20 第五章改良Misra-Gries演算法 22 5.1 Window定義 23 5.2 WMG演算法 24 5.3 誤差範圍分析 29 5.4 實驗設計與數據說明 31 5.4.1 測試資料 31 5.4.2 參數說明 32 5.4.3 實驗一 33 5.4.4 實驗二 36 5.4.5 實驗三 42 第六章平行化處理 45 6.1 WMG平行化設計 45 6.2 使用工具 47 6.2.1 Storm特點 47 6.2.2 Storm集群組成 48 6.2.3 Storm基本概念 51 6.3 以Storm平行化實作 54 6.3.2 Input Spout 55 6.3.3 WMG Bolt 55 6.3.4 Merge Bolt 56 6.4 實驗環境 57 6.4.1 主機硬體資訊 57 6.4.2 軟體與函式庫版本 58 6.4.3 系統架設平台 58 6.5 實驗設計與數據說明 60 第七章結論 64 參考文獻 65 附錄一：學位考試委員問題與回覆 69 附錄二：英文投稿論文 71

參考文獻 References
[1] Boyer, Robert S and Moore, J Strother, "MJRTY—a fast majority vote algorithm," in Automated Reasoning, Springer, 1991, pp. 105-117. [2] Misra, Jayadev and Gries, David, "Finding repeated elements," Science of computer programming, vol. 2, no. 2, pp. 143-152, 1982. [3] Dimitropoulos, Xenofontas and Hurley, Paul and Kind, Andreas, "Probabilistic lossy counting: an efficient algorithm for finding heavy hitters," ACM SIGCOMM Computer Communication Review, vol. 38, no. 1, pp. 5-5, 2008. [4] Metwally, Ahmed and Agrawal, Divyakant and Abbadi, Amr El, "An integrated efficient solution for computing frequent and top-k elements in data streams," ACM Transactions on Database Systems (TODS), vol. 31, no. 3, pp. 1905-1133, 2006. [5] Charikar, Moses and Chen, Kevin and Farach-Colton, Martin, "Finding frequent items in data streams," in Automata, languages and programming, Springer, 2002, pp. 693-703. [6] Cormode, Graham and Muthukrishnan, Shan, "An improved data stream summary: the count-min sketch and its applications," Journal of Algorithms, vol. 55, no. 1, pp. 58-75, 2005. [7] Babcock, Brian and Babu, Shivnath and Datar, Mayur and Motwani, Rajeev and Widom, Jennifer, "Models and issues in data stream systems," in Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM, 2002, pp. 1-16. [8] Mahajan, Ratul and Floyd, Sally and Wetherall, David, "Controlling high-bandwidth flows at the congested router," in Network Protocols, 2001. Ninth International Conference on, IEEE, 2001, pp. 192-201. [9] Gündüz, Şule, and M. Tamer Özsu, "A web page prediction model based on click-stream tree representation of user behavior," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2003, pp. 535-540. [10] "Apache Storm," [Online]. Available: http://storm.apache.org/. [11] Cormode, Graham and Hadjieleftheriou, Marios, "Finding the frequent items in streams of data," Communications of the ACM, vol. 52, no. 10, pp. 97-105, 2009. [12] Karp, Richard M and Shenker, Scott and Papadimitriou, Christos H, "A simple algorithm for finding frequent elements in streams and bags," ACM Transactions on Database Systems (TODS), vol. 28, no. 1, pp. 51-55, 2003. [13] Demaine, Erik D., Alejandro López-Ortiz, and J. Ian Munro, "Frequency estimation of internet packet streams with limited space," in European Symposium on Algorithms, Springer, 2002, pp. 348-360. [14] Bose, Prosenjit and Kranakis, Evangelos and Morin, Pat and Tang, Yihui, "Bounds for Frequency Estimation of Packet Streams," in SIROCCO, 2003, pp. 33-42. [15] Berinde, Radu and Indyk, Piotr and Cormode, Graham and Strauss, Martin J, "Space-optimal heavy hitters with strong error bounds," ACM Transactions on Database Systems (TODS), vol. 35, no. 4, pp. 26-26, 2010. [16] "Uniform distribution," [Online]. Available: https://en.wikipedia.org/wiki/Uniform_distribution. [17] Yao Lu, Jun Liu, "A real-time top-k query algorithm and parallelized implementation," in 2014 IEEE 3rd International Conference on Cloud Computing and Intelligence Systems, 2014. [18] Xun Yang, Jun Liu, Wenli Zhou, "A Parallel Frequent Item Counting Algorithm," in 2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016. [19] Siyuan Chang, Jun Liu, Fang Liu, Jie Yang, "A Parallel Space Saving Algorithm and Performance Test," in 2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016. [20] Zipf, George Kingsley, "Human behavior and the principle of least effort: An introduction to human ecology," Ravenio Books, 2016. [21] "Zipf's law," [Online]. Available: https://en.wikipedia.org/wiki/Zipf's_law. [22] Agarwal, Pankaj K and Cormode, Graham and Huang, Zengfeng and Phillips, Jeff M and Wei, Zhewei and Yi, Ke, "Mergeable summaries," ACM Transactions on Database Systems (TODS), vol. 38, no. 4, p. 26, 2013. [23] Manerikar, Nishad and Palpanas, Themis, "Frequent items in streaming data: An experimental evaluation of the state-of-the-art," Data Knowledge Engineering, vol. 68, no. 4, pp. 415-430, 2009. [24] 阿里巴巴集團數據平台事業部商家數據業務部, in Storm實戰: 構建大數據實時計算, 中國: 電子工業出版社, 2014. [25] "Apache Zookeeper," [Online]. Available: https://zookeeper.apache.org/. [26] "Storm Concepts," [Online]. Available: http://storm.apache.org/releases/current/Concepts.html. [27] "Apache Ambari," [Online]. Available: https://ambari.apache.org/. [28] "Hortonworks," [Online]. Available: https://hortonworks.com/tutorials/. [29] Manku, Gurmeet Singh and Motwani, Rajeev, "Approximate frequency counts over data streams," in Proceedings of the 28th international conference on Very Large Data Bases, VLDB Endowment, 2002, pp. 346-357.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0925117-160503.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS