國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個以集合檢查來從資料串流中搜尋最大頻繁項目集的方法,A Set-Checking Algorithm for Mining Maximal Frequent Itemsets from Data Streams

論文名稱 Title	一個以集合檢查來從資料串流中搜尋最大頻繁項目集的方法 A Set-Checking Algorithm for Mining Maximal Frequent Itemsets from Data Streams
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	99 學年度第 2 學期 The spring semester of Academic Year 99	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	79
研究生 Author	林佩穎 Pei-Ying Lin
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	陳健輝 Gen-Huey Chen
口試委員 Advisory Committee	林宣華, 李建億 Shian-Hua Lin; Chien-I Lee
口試日期 Date of Exam	2011-06-10	繳交日期 Date of Submission	2011-07-15
關鍵字 Keywords	最大頻繁項目集、晶格、地標視窗模型、項目集、資料串流 Itemset, Landmark Window Model, Lattice, Maximal Frequent Itemset, Data Stream
統計 Statistics	本論文已被瀏覽 5686 次，被下載 996 次 The thesis/dissertation has been browsed 5686 times, has been downloaded 996 times.

中文摘要
在資料探勘的領域中，從資料串流(data stream)中找出最大頻繁項目集(maximal frequent itemset)在線上探勘是一個重要的問題。最大頻繁項目集就是每個項目集(itemset)出現的次數大於等於支持值且此項目集不會是任何一個項目集的子集或超集。傳統資料庫搜尋最大頻繁項目集的演算法無法應用在資料串流中是因為資料串流有以下幾個特性： (1) 連續 (2) 快速 (3) 無資料上限 (4) 即時 (5) 一次搜尋完成。資料串流的探勘有幾個新的挑戰，第一，因為資料串流是連續且沒有限制資料量大小所以我們無法將所有的資料存在主要的記憶體或第二存取空間。第二，傳統探勘資料的方法存取資料是採用多次搜尋，因為資料串流只能一次搜尋所以不適用。第三，為了維持資料快速抵達以及期望快速回報答案，所以探勘資料串流需要快速和即時。為了解決以地標視窗模型(landmark window model)為基礎來從資料串流中搜尋最大頻繁項目集，Mao等人提出INSATNT演算法。在地標視窗模型裡是從一開始的時間到目前的時間搜尋資料，使用地標視窗模型的好處是我們可以很精確的找出我們要的答案。INSTANT演算法因為資料結構簡單，所以可以節省許多記憶體空間，但是在搜尋最大頻繁項目集要花費很多時間。當新的交易進來，INSTANT演算法和舊的交易做比較的次數會很多。在本論文裡，我們提出Set-Checking的方法在資料串流中用地標視窗模型搜尋最大頻繁項目集。我們提出的Set-Checking演算法是以晶格(lattice)的結構去存取我們目前所知道的資訊，所謂的晶格結構就是存在一個子節點是父節點的子集合。在每一個節點，我們會記錄項目集和支持值。當新的交易進來時，我們會針對交易的特性考慮五種集合： (1) equivalent (2) superset (3) subset (4) intersection (5) empty set。根據晶格的結構和以上五種集合的概念，我們可以很快新增我們的交易以及更新支持值。從我們的模擬結果可得知，我們提出的Set-Checking演算法比INSTANT演算法執行時間還要快。
Abstract
Online mining the maximal frequent itemsets over data streams is an important problem in data mining. The maximal frequent itemset is the itemset which the support is large or equal to the minimal support and the itemset is not the subset or superse of each itemset. Previous algorithms to mine the maximal frequent itemsets in the traditional database are not suitable for data streams. Because data streams have some characteristics: (1) continuous (2) fast (3) no data limit (4) real time (5) searching once, mining data streams have many new challenges. First, they are unrealistic to keep the entire stream in the main memory or even in a secondary storage area, since a data stream comes continuously and the amount of data is unbounded. Second, traditional methods of mining on stored datasets by multiple scans are infeasible, since the streaming data is passed only once. Third, mining streams requires fast, real-time processing in order to keep up with the high data arrival rate and mining results are expected to be available within short response time. In order to solve mining maximal frequent itemsets from data streams using the landmark window model, Mao et. al. propose the INSTANT algorithm. In the landmark window model, knowledge discovery is performed based on the values between the beginning time and the present. The advantage of using the landmark window model is that the results are correct as compared to the other models. The structure of the INSTANT algorithm is simple and it can save many memory space. But it takes long time in mining the maximal frequent itemsets. When the new transactions comes, the number of comparisons between the old transactions of INSATNT algorithm is too much. In this thesis, we propose the Set-Checking algorithm to mine frequent itemsets from data streams using the landmark window model. We use the structure of lattice to store our information. The structure of lattice records the subset relationship between the child node and the father node. For every node, we can record the itemset and the support. When the new transaction comes, we consider five relations: (1) equivalent (2) superset (3) subset (4) intersection (5) empty relations. According to the lattice structure of the five sets , we can add the transaction and the renew support efficiently. From our simulation result, we find that the process time of our Set-Checking algorithm is faster than that of the INSTANT algorithm.

目次 Table of Contents
LIST OF FIGURES.............................................................iii LIST OF TABLES................................................................vii 1. Intrduction........................................................................1 1.1 Mining Association Rules........................................2 1.2 Data Stream Mining...................................................2 1.2.1 Applications.........................................................2 1.2.2 Window Models in Data Streams..................4 1.3 Mining Maximal Frequent Itemsets in Data Streams...5 1.4 Related Work...9 1.5 Motivations.................................................................10 1.6 Organization of the Thesis......................................11 2. A Survey of Data Mining Algorithm.............................13 2.1 The Apriori Algoruthm for Association Rules Mining...13 2.2 The BBM Algorithm for Maximal Frequent Itemsets Mining...15 2.3 The MFI-TransSW ALgorithm for Frequent Itemsets Mining in Data Streams...18 2.4 Maximal Frequent Itemsets Mining in Data Streams...20 2.4.1 The INSATNT Algorithm.................................21 2.4.2 The DSM-MFI Algorithm.................................23 3. The Set-Checking Algorithm.......................................25 3.1 Observations for Mining Maximal Frequent Itemsets...25 3.2 The Proposed Algorithm.........................................26 3.2.1 Data structure..................................................31 3.2.2 Data Initialization.............................................32 3.2.3 Data Insertion..................................................33 3.2.4 The Maximal Frequent Itemset.....................40 3.2.5 A Comparison..................................................40 4. Performance 4.1 The Simulation Model.............................................50 4.2 Experiment Results.................................................51 5. Conclusion....................................................................59 5.1 Summary...................................................................59 5.2 Future Work...............................................................60 BIBLIOGRAPHY.................................................................61

參考文獻 References
[1] R. Agrawal and R. Srikant, "Fast Algorithm for Mining Association Rules in large databases," Proc. of the 20th Int. Conf. in Very Large Data Bases, pp. 490-501, 1994. [2] T. Calders, N. Dexters, and B. Goethals, “Mining Frequent Itemsets in a Stream,” Proc. of IEEE Int. Conf. on Data Mining, pp. 83-92, 2007. [3] J.H. Chang and W. S. Lee, “A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Stream,” Journal of Information Science and Engineering, Vol. 4, No. 11, pp. 753-762, July 2004. [4] M. S. Chen, J. Han and P. S. Yu, “Data Mining: An Overview from A Database Perspective,” IEEE Trans. On Knowledge and Data Eng., Vol. 8, No. 5, pp. 866-882, Dec. 1996. [5] Y. Chi, H. Wung, P. S. Yu, and R. R. Muntz, “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window,” Knowledge and Information Systems, pp. 59-66, 2004. [6] Y. Chi, H. Wung, P. S. Yu, and R. R. Muntz, “Catch the Moment: Maintaining Closed Frequent Itemsets over a Data Stream Sliding Window,” Proc. of the 4th IEEE Int. Conf. on Data Mining, pp. 265-294, 2006. [7] B. Datar and R. Widom, “Models and Issues in Data Stream Systems,” Proc. of 2002 ACM Symp. On Principles of Database Systems, Citeseer, 2002. [8] M. M. Gaber, A Zaslavsky, and S. Krishnaswamy, “Mining Data Streams” a Review,” ACM SIGMOD Record, Vol.34, No. 2, pp. 18-26, June 2005. [9] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu, “Mining Frequent Patterns in Data Streams at Multiple Time Granularities,” Nest Gnerartion Data Mining, Vol. 212, pp. 191-212, 2003. [10] N. Jiang and L. Gruenwald, “Research Issues in Data Stream Association Rule Mining,” ACM SIGMOD Record, Vol.35, No. 1, pp. 14-19, March 2006. [11] N. Jiang and L. Gruenwald, “CFI-stream: Mining Closed Frequent Itemsets in Data Streams,” Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 592-597, 2006. [12] J. L. Koh and S. N. Shin, “An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams,” Computer Science Data Warehousing and Knowledge Discovery, Vol. 4081, No. 1, 99. 352-362, Sept. 2006. [13] C.-H. Lee, C.-R. Lin and M.-S. Chen, “Sliding Window Filtering: An Efficient Method for Incremental Mining on a Time-Variant Database,” Information Systems, pp. 227-244, 2005. [14] W. Lee, C. T. Park, and S. J. Stolfo, “Automated Intrusion Detection Using NFR: Methods and Experiences,” Proc. of the 1st conference on Workshop on Intrusion Detection and Network Monitoring, pp. 63-72, 1999. [15] H. F. Li, S. Y. Lee, and M. K. Shan, “An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams,” Proc. of Int. Conf. on Principals and Practice of Knowledge Discovery in Databases, pp. 20-24, 2004. [16] H. F. Li and S. Y. Lee, “Mining Frequent Itemsets over Data Streams Using Efficient Window Sliding Techniques,” Expert Systems with Applications, Vol. 36, pp. 1466-1477, March 2009. [17] H. F. Li, S. Y. Lee, and M. K. Shan, “Online Mining (Recently) Maximal Frequent Itemsets from Data Streams,” In Proc. of the IEEE RIDE, 2005. [18] C. H. Lin, D. Y. Chiu, and A. L. P. Chen, “Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window,” Proc. of SIAM Int. Conf. on Data Mining, pp. 648-660, 2005. [19] G. Mao, X. Wu, X. Zhu, G. Chen, and C. Liu, ”Mining Maximal Frequent Itemsets From Data Streams,” Journal of Information Science, Vol. 33, No. 3, pp. 251-262, March 2007. [20] J. Srivastava, R. Cooley, M. Deshpande, and P. N. Tan, “Web Usage Mining: Discovery and Applications of Usage Patterns From Web Data,” ACM SIGMOD Explorations Newsletter, Vol. 1, No. 2, pp.12-23, Jan. 2000. [21] Y. Tao and D. Papadias, “Maintaining Sliding Window Skylines on Data Stream,” Journal of Information Science and Engineering, Vol. 18, No. 3, pp. 377-391, March 2006. [22] J.-W. Xin, G.-Q. Yang, J.-Z Sun, and Y.-P. Zhang, “A New Algorithm for Discovery Maximal Frequent Itemsets Based on Binary Vector Sets,” Machine Learning and Cybernetics, pp. 1120-1124, 2006.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內外都一年後公開 withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0715111-180126.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS