Responsive image
博碩士論文 etd-0625109-204302 詳細資訊
Title page for etd-0625109-204302
論文名稱
Title
一個於資料串流中有效率地以集合晶格來探勘封閉頻繁集的方法
An Efficient Subset-Lattice Algorithm for Mining Closed Frequent Itemsets in Data Streams
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
76
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2009-06-12
繳交日期
Date of Submission
2009-06-25
關鍵字
Keywords
關聯式法則、封閉頻繁集、資料串流、頻繁集
frequent itemset, closed frequent itemset, Data Streams, Association Rules
統計
Statistics
本論文已被瀏覽 5770 次,被下載 2718
The thesis/dissertation has been browsed 5770 times, has been downloaded 2718 times.
中文摘要
在資料探勘的領域中,線上探勘關聯式法則是一個重要的問題,而關聯式法則表示有些項目在一筆交易中的存在將會意味著其它項目在同一筆交易中的出現。在資料串流的領域中有許多利用關聯式法則的應用,像是市場分析、網路安全、感測網路和網頁追蹤。我們可以透過關聯式法則,從資料串流中潛在地萃取出一些有用的資訊。而找出封閉頻集(closed frequent itemsets)為其進階的工作,是為了找出能夠萃取出所有資訊的代表集合。形式上,封閉頻繁集是一種找不到superset與它具有相同支持值的頻繁集。由於資料串流的特性是連續、快速、且沒有資料上限。把資料串流中的資訊收集歸檔幾乎是不可能的事情,並且只允許對資料串流做一次掃描,沒有辦法透過需要許多次掃描的方法從資料串流中找出關聯式法則。因此在先前傳統資料庫的演算法中,尋找封閉頻繁集的方法已經無法適用在資料串流上。在另一方面,散佈在資料串流中的資料通常是隨著時間而改變,並且有許多的應用對於距離現在時間最近的資料有比較大的興趣。對於這類型的應用,有一種可以設定視窗大小來獲取資料的方法稱為”滑動式窗模型”(sliding window model) 來處理資料串流中最近的資料。NewMoment演算法是在許多有名的封閉頻繁集演算法中利用滑動式窗模型的其中之一。然而,NewMoment演算法卻不能夠有效率的在資料串流中來探勘封閉頻繁集,因為它們必須不僅產生封閉頻繁集,且還產生一些不必要的候選集來輔助演算法的運作。此外,當sliding window在遞增地更新時,NewMoment演算法必須重新建造整個樹狀資料結構。因此,在本論文中,我們提出一個以滑動視窗來探索的方法,Subset-Lattice演算法,將子集合的性質成為判斷情況嵌入在晶格的結構中,使得能有效率的從資料串流中找出封閉頻繁集。為了可以善用集合的特性,我們利用一種晶格資料結構來儲存資料。基本上,我們所提出的Subset-Lattice演算法當新增資料時考慮到五種集合的觀念 : (1) equivalent (2) superset (3) subset (4) intersection (5) empty relation,我們透過這五種集合的觀念以不產生後選集的方式來判斷封閉頻繁集。在Sliding Window更新的同時,我們的Subset-Lattice演算法將不會重新建構整個資料結構。此外,我們利用位元表示法來表示一筆項目集,並且使用位元運算來加速集合檢查和支撐值的計算。從我們的模擬結果得知,我們所提出的Subset-Lattice演算法可以比NewMoment演算法使用更少的記憶體和處理時間。當sliding window開始滑移時,所需要的執行時間可以節省至50%
Abstract
Online mining association rules over data streams is an important issue in the area of data mining, where an association rule means that the presence of some items in a transaction will imply the presence of other items in the same transaction. There are many applications of using association rules in data streams, such as market analysis, network security, sensor networks and web tracking.
Mining closed frequent itemsets is a further work of mining association rules, which aims to find the subsets of frequent itemsets that could extract all frequent itemsets. Formally, a
closed frequent itemset is an frequent itemset which has no superset with the same support as it. Since data streams are continuous, high-speed, and unbounded, archiving everything from data streams is impossible. That is, we can only scan once for the data streams and it is a main-memory database. Therefore, previous algorithms to mine closed frequent itemsets in the traditional database are not suitable for data streams. On the other hand, many applications are interested in the most recent data, and there is a model to deal with the most recent data in data streams, called emph{Sliding Window Model}, which acquires the recent data with a window size meets this characteristic. One of well-known algorithms for mining closed frequent itemsets which based on the sliding window model is the NewMoment algorithm. However, the NewMoment algorithm could not efficiently mine closed frequent itemsets in data streams, since they will generate closed frequent itemsets and many unclosed frequent itemsets. Moreover, when data in the sliding window is incrementally updated, the NewMoment algorithm needs to reconstruct the whole tree structure. Therefore, in this thesis, we propose a
sliding window approach, the Subset-Lattice algorithm, which embeds the subset property into the lattice structure to efficiently mine closed frequent itemsets. Basically, Our proposed algorithm considers five kinds of set concepts : (1) equivalent, (2) superset, (3) subset, (4) intersection, (5) empty relation, when data items are inserted. We judge closed frequent itemsets without generating unclosed frequent itemsets by these five kinds of set concepts.
Moreover, when data in the sliding window is incrementally updated, our Subset-Lattice algorithm will not reconstruct the whole lattice structure. Therefore, our Subset-Lattice algorithm is more efficient than the Moment algorithm. Furthermore, we use the bit-pattern to represent the itemsets, and use bit-operations to speed up the set-checking. From our simulation results, we show that our Subset-Lattice algorithm needs less memory and less processing time than the NewMoment algorithm. When window slides, the execution time could be saved up to 50\%.
目次 Table of Contents
TABLE OF CONTENTS
Page
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Window Models in Data Streams . . . . . . . . . . . . . . . . 3
1.1.3 Mining Frequent Itemsets in Data Streams . . . . . . . . . . . 3
1.2 Mining Closed Frequent Itemsets in Data Streams . . . . . . . . . . . 8
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 16
2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Association Rules Mining . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 The FP-Growth Algorithm . . . . . . . . . . . . . . . . . . . . 20
2.2 Frequent Itemsets Mining in Data Streams . . . . . . . . . . . . . . . 20
2.2.1 The FP-Stream Algorithm . . . . . . . . . . . . . . . . . . . . 22
2.2.2 The DSTree Algorithm . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Closed Frequent Itemsets Mining in Data Streams . . . . . . . . . . . 27
2.3.1 The Moment Algorithm . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 The NewMoment algorithm . . . . . . . . . . . . . . . . . . . 29
3. The Subset-Lattice Algorithm . . . . . . . . . . . . . . . . . . . . . . 32
3.1 Interesting Observations for Mining Closed Frequent Itemsets . . . . 32
3.2 Interesting Observations of Set-Relations Between Frequent Itemsets . 33
3.3 The Subset-Lattice Algorithm . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Data Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Data Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 The Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
LIST OF FIGURES
Figure Page
1.1 An example of stock streams . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The landmark window model . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The tilted-time window model with logarithmic partition . . . . . . . 4
1.4 The sliding window model . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 The data stream mining model . . . . . . . . . . . . . . . . . . . . . 5
1.6 The processing model of data streams . . . . . . . . . . . . . . . . . . 6
1.7 Aprroximate mining result: (a) False-Positive answer, (b) False-Negative
answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Transaction Database T . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Frequent itemsets of Transaction Database T with the minimal support
= 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.10 Maximal large itemsets of Transaction Database T with the minimal
support = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.11 Closed large itemsets of Transaction Database T with the minimal
support = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.12 An example of the transaction data streams TDS . . . . . . . . . . . 14
1.13 A comparison of the number of the node checks between the NewMoment
algorithm and the Subset-Lattice algorithm . . . . . . . . . . . 15
1.14 The result of processing the transaction data streams TDS by the
NewMoment algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure Page
1.15 The result of processing the transaction data streams TDS by the
Subset-Lattice algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Example 1 of the transaction database . . . . . . . . . . . . . . . . . 18
2.2 Generation of candidate itemsets and large itemsets . . . . . . . . . . 19
2.3 Example 2 of the transaction database . . . . . . . . . . . . . . . . . 21
2.4 The constructed FP-tree for Example 2 . . . . . . . . . . . . . . . . . 21
2.5 The pattern tree with tilted-time windows embedded in the FP-Stream 23
2.6 The FP-Streaming Algorithm . . . . . . . . . . . . . . . . . . . . . . 24
2.7 The DSTree after each batch of transactions is added . . . . . . . . . 26
2.8 The Closed Enumeration Tree . . . . . . . . . . . . . . . . . . . . . . 28
2.9 Example of Transaction-Sensitive Window . . . . . . . . . . . . . . . 30
2.10 The NewCET in the first window . . . . . . . . . . . . . . . . . . . . 31
3.1 The set-relations diagram between two transactions: (a) equivalent;
(b) disjoin; (c) belong; (d) contain; (e) partial overlap . . . . . . . . . 34
3.2 An Example of a Lattice . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 The flowchart of inserting a transaction NewT . . . . . . . . . . . . . 38
3.4 Conditions of subsumption checking: (a) Case 1; (b) Case 2; (c) Case
3; (d) Case 4; (e) Case 5. . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 An example of the data stream . . . . . . . . . . . . . . . . . . . . . 40
3.6 Procedure InsertD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Function FindT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.8 Procedure CheckSet . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.9 Subset-Lattice of transaction Tid(1) and transaction Tid(2) . . . . . 43
3.10 Subset-Lattice of transaction Tid(3) . . . . . . . . . . . . . . . . . . . 43
Figure Page
3.11 Subset-Lattice of transaction Tid(4) . . . . . . . . . . . . . . . . . . . 43
3.12 Procedure Check relation . . . . . . . . . . . . . . . . . . . . . . . . 43
3.13 Subset-Lattice of transaction tid(5) . . . . . . . . . . . . . . . . . . . 44
3.14 Procedure DeleteD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.15 The flowchart of deleting the oldest transaction . . . . . . . . . . . . 45
3.16 Subset-Lattice after deleting itemset {CD}: (a) the original lattice;
(b) updating the Tidset of itemset {CD} and itemset {D} (c) deleting
itemset {D}; (d) deleting itemset {CD}. . . . . . . . . . . . . . . . . 47
3.17 An example of the transaction data stream . . . . . . . . . . . . . . . 48
3.18 The result of processing the transaction data stream by the NewMoment
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.19 The result of processing the transaction data stream by the Subset-
Lattice algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.20 A comparison of the number of the node checks for different algorithm 50
4.1 A comparison of the processing time of loading the first window with
different sliding window size . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 A comparison of the processing time of loading the first window with
different minimum supports . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 A comparison of the average processing time of the window sliding
with different minimum supports . . . . . . . . . . . . . . . . . . . . 55
4.4 A comparison of the average processing time of the window sliding
with different sliding window sizes . . . . . . . . . . . . . . . . . . . . 56
4.5 A comparison of the number of nodes of loading the first window with
different size of items . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 A comparison of the number of nodes of loading the first window with
different sliding window size . . . . . . . . . . . . . . . . . . . . . . . 58
4.7 A comparison of the number of nodes of loading the first window with
different minimum supports . . . . . . . . . . . . . . . . . . . . . . . 58
LIST OF TABLES
Table Page
2.1 An example of the incoming stream data . . . . . . . . . . . . . . . . 25
2.2 Bit-sequences of items in each window . . . . . . . . . . . . . . . . . 30
3.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Parameters used in the experiment . . . . . . . . . . . . . . . . . . . 52
4.2 Parameters values for synthetic data . . . . . . . . . . . . . . . . . . 53
參考文獻 References
BIBLIOGRAPHY
[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” Proc. of the 20th Int. Conf. on Very Large Data Bases, pp. 490–501, 1994.
[2] T. Calders, N. Dexters, and B. Goethals, “Mining Frequent Itemsets in a Stream,” Proc. of IEEE Int. Conf. on Data Mining, pp. 83–92, 2007.
[3] J. H. Chang and W. S. Lee, “A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Stream,” Journal of Information Science and Engineering, Vol. 4, No. 11, pp. 753–762, July 2004.
[4] Y. Chi, H. Wung, P. S. Yu, and R. R. Muntz, “Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window,” Proc. of the 4th IEEE Int. Conf. on Data Mining, pp. 59–66, 2004.
[5] C. Giannella, J. Han, J. Pei, X. Yan, and P. Yu, “Mining Frequent Patterns in Data Streams at Multiple Time Granularities,” Proc. of the NSF Workshop on Next Generation Data Mining, pp. 191–211, 2002.
[6] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach,” Data Mining and Knowledge Discovery, Vol. 8, No. 1, pp. 53–87, Jan. 2004.
[7] N. Jiang and L. Gruenwald, “Research Issues in Data Stream Association Rule Mining,” ACM SIGMOD Record, Vol. 35, No. 1, pp. 14–19, March 2006.
[8] N. Jiang and L. Gruenwald, “CFI-stream: Mining Closed Frequent Itemsets in Data Streams,” Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 592–597, 2006.
[9] J. L. Koh and S. N. Shin, “An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams,” Computer Science Data Warehousing and Knowledge Discovery, Vol. 4081, No. 1, pp. 352–362, Sept. 2006.
[10] W. Lee, C. T. Park, and S. J. Stolfo, “Automated Intrusion Detection Using NFR: Methods and Experiences,” Proc. of the 1st conference on Workshop on Intrusion Detection and Network Monitoring, pp. 63–72, 1999.
[11] C. K. S. Leung and Q. I. Khan, “DSTree: A Tree Structure for the Mining of Frequent Sets from Data Streams,” Proc. of the 6th Int. Conf. on Data Mining, pp. 205–226, 2006.
[12] C. K. S. Leung, Q. I. Khan, and T. Hoque, “CanTree: A Tree Struture for Efficient Incremental Mining of Frequent Patterns,” Proc. of the 5th Int. Conf. on Data Mining, pp. 274–281, 2005.
[13] H. F. Li, S. Y. Lee, and M. K. Shan, “DSM-PLW: Single-Pass Mining of Path Traversal Patterns over Streaming Web Click-Squences,” Proc. of Computer Net-works on Web Dynamics, pp. 1474–1487, 2006.
[14] H. F. Li, C. C. Ho, and S. Y. Lee, “A New Algorithm for Maintaining Closed Frequent Itemsets in Data Streams by Incremental Updates,” Proc. of the 6th IEEE Int. Conf. on Data Mining, pp. 672–676, 2007.
[15] H. F. Li, S. Y. Lee, and M. K. Shan, “An Efficient Algorithm for Mining Fre-quent Itemsets over the Entire History of Data Streams,” Proc. of Int. Conf. on
Principals and Practice of Knowledge Discovery in Databases, pp. 20–24, 2004.
[16] C. H. Lin, D. Y. Chiu, Y. H. Wu, and A. L. P. Chen, “Mining Frequent Item-sets from Data Streams with a Time-Sensitive Sliding Window,” Proc. of SIAM Int. Conf. on Data Mining, pp. 648–660, 2005.
[17] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering Frequent Closed Itemsets for Association Rules,” Proc. of the 7th Int. Conf. on Database Theory, pp. 398–416, 1998.
[18] C. Raissi, P. Poncelet, and M. Teisseire, “Towards a New Approach for Mining Frequent Itemsets on Data Stream,” Intelligent Information Systems, Vol. 28, No. 1, pp. 23–36, Dec. 2007.
[19] J. Srivastava, R. Cooley, M. Deshpande, and P. N. Tan, “Web Usage Mining: Discovery and Applications of Usage Patterns FormWeb Data,” ACM SIGMOD Explorations Newsletter, Vol. 1, No. 2, pp. 12–23, Jan. 2000.
[20] Y. Tao and D. Papadias, “Maintaining Sliding Window Skylines on Data Stream,” Journal of Information Science and Engineering, Vol. 18, No. 3, pp. 377–391, March 2006.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內外都一年後公開 withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code