Responsive image
博碩士論文 etd-0728104-135327 詳細資訊
Title page for etd-0728104-135327
論文名稱
Title
一個於大型資料庫中以滑動視窗來探勘最大項目集合的方法
A Sliding-Window Approach to Mining Maximal Large Itemsets for Large Databases
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
89
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2004-06-04
繳交日期
Date of Submission
2004-07-28
關鍵字
Keywords
資料探索、關聯式法則、漸增式探索、分割、最大代表集
partition, Incremental Mining, Maximal Large Itemsets, Data Mining, Association Rules
統計
Statistics
本論文已被瀏覽 5652 次,被下載 0
The thesis/dissertation has been browsed 5652 times, has been downloaded 0 times.
中文摘要
探索關聯式法則,指的是如何從尚未分析的資料找出有用的資訊。而找出最大代表集(maximal large itemsets)為其進階工作,是為了找出能夠代表全部資訊的代表集合。以前用來找出最大代表集的方法可以分為兩類:窮舉法(exhausted)跟捷徑法(shortcut)。捷徑法會比窮舉法產生較少的項目集,且在時間與儲存空間上會有較佳的效能。在另一方面,當資料庫有更新動作時,可以重新執行整個探索的演算法。另一方法,則為漸增性探索(incremental mining)。這個方法可以有效率地更新有關關聯式法則的資訊,且不需要重新執行整個探索的演算法。然而,以前基於捷徑法的演算法無法支援最大代表集的漸增性探索。而用於漸增性探索的演算法,如SWF演算法,卻不能夠有效地支援最大代表集的探索,因為這些方法是基植於窮舉法。因此,在本篇論文中,我們著重在設計一個同時能在找出最大代表集與漸增性探索這兩方面都有良好效能的演算法。根據一些在最大代表集方面之觀察,例如,“一個項目集如果是大代表集的,那它的子項目集必定會是大代表集,所以這些子項目集不需要再被檢查”,我們提出一個以滑動視窗(Sliding-Window)來探索的方法,SWMax演算法,能同時有效地找出最大代表集及漸增性探索。我們的SWMax演算法是一個只需掃瞄兩次,且以分割資料庫為基礎的方法。我們將在掃瞄第一遍時找出所有大小分別為1與3的候選項目集,(C1, C3),還有大小分別是1與3的大代表集(L1, L3)。我們在掃瞄第一遍資料庫之後,會去生產虛擬的最大代表集。然後,我們利用L1去生產C2,用L3生產C4,用C4生產C5,直到Ck無法被生產出來為止。在第二次掃瞄資料庫時,我們對Ck做累計同時找出最大代表集。在漸增性探索方面,我們考慮兩種情況:(1)資料新增,(2)資料刪除。在這兩種情況下,當用SWF演算法,只要有大小為1的項目集在舊的資料庫無法成為候選項目而不被保留,那在更新之後的資料庫也決不會再出現,因此導致正確結果可能會被流失。也就是,可能會有遺漏的情況發生。因為SWF演算法只保留大小為2的項目集的資訊。而我們的SWMax演算法,則可以正確地進行漸增性探索。因為,我們保留了大小為1與3的項目集來支援更新資訊。在我們的效能測試中,我們會生產一些人造的資料庫,來模擬真實交易記錄的資料庫。從我們的模擬結果得知,我們的SWMax演算法可以比SWF演算法產生較少的項目集且需要較少的執行時間。
Abstract
Mining association rules, means a process of nontrivial extraction of implicit,
previously and potentially useful information from data in databases. Mining maximal
large itemsets is a further work of mining association rules, which aims to find
the set of all subsets of large (frequent) itemsets that could be representative of all large
itemsets. Previous algorithms to mining maximal large itemsets can be classified into two approaches: exhausted and
shortcut. The shortcut approach could generate smaller number of
candidate itemsets than the exhausted approach,
resulting in better performance in terms of time and storage space.
On the other hand, when updates to the transaction databases occur,
one possible approach is to re-run the mining algorithm on the whole
database. The other approach is incremental mining, which aims for efficient maintenance of discovered association rules
without re-running the mining algorithms. However,
previous algorithms for mining maximal large itemsets based on the shortcut approach
can not support incremental mining for mining maximal large itemsets.
While the algorithms for incremental mining, {it e.g.}, the SWF
algorithm, could not efficiently support mining maximal large
itemsets, since it is based on the exhausted approach.
Therefore, in this thesis, we focus on the design of an
algorithm which could provide good performance for both mining maximal itemsets and incremental mining.
Based on some observations, for example, ``{it if an itemset is large, all its
subsets must be large; therefore, those subsets need not to be examined
further}", we propose a Sliding-Window approach, the SWMax algorithm, for
efficiently mining maximal large itemsets and incremental mining. Our
SWMax algorithm is a two-passes partition-based approach. We will find all candidate
1-itemsets ($C_1$), candidate 3-itemsets ($C_3$), large 1-itemsets ($L_1$),
and large 3-itemsets ($L_3$) in the first pass.
We generate the virtual maximal large itemsets after the first pass. Then, we use $L_1$ to generate $C_2$, use $L_3$
to generate $C_4$, use $C_4$ to generate $C_5$, until there is no
$C_k$ generated. In the second pass, we use the virtual maximal large itemsets to
prune $C_k$, and decide the maximal large itemsets.
For incremental mining, we consider two cases: (1)
data insertion, (2) data deletion. Both in Case 1 and Case 2, if an itemset
with size equal to 1 is not large in the original database, it could not be found in the
updated database based on the SWF algorithm. That is, a missing case
could occur in the incremental mining process of the SWF
algorithm, because the SWF algorithm only keeps the $C_2$ information.
While our SWMax algorithm could support incremental mining
correctly, since $C_1$ and $C_3$ are maintained in our algorithm.
We generate some synthetic databases to simulate the real transaction
databases in our simulation. From our simulation, the
results show that our SWMax algorithm could generate fewer number of candidates
and needs less time than the SWF algorithm.
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Formal Problem Description . . . . . . . . . . . . . . . . . . . 2
1.1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Mining Maximal Large Itemsets . . . . . . . . . . . . . . . . . . . . . 5
1.3 Incremental Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 14
2. A Survey of Algorithms for Mining Maximal Large Itemsets and
Incremental Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The Partition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 The Pincer-Search Algorithm . . . . . . . . . . . . . . . . . . . . . . 23
2.4 The Max-Miner Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 The Pattern Decomposition Algorithm . . . . . . . . . . . . . . . . . 27
2.6 The SWF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3. The SWMax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Interesting Observations for Mining Maximal Large Itemsets . . . . . 35
3.2 The SWMax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Interesting Observations for Incremental Mining . . . . . . . . . . . . 56
3.4 Incremental Mining Process . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.1 Generation of Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A. The Counting Approach in the SWF Algorithm . . . . . . . . . . . 89
參考文獻 References
[1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules,"
Proc. of the 20th Int. Conf. on Very Large Data Bases, 1994.
[2] R. J. Bayardo, Efficiently Mining Long Patterns from Databases," Proc. of Int.
Conf. on Data Eng., pp. 85{93, 1998.
[3] D. Burdick, M. Calimlim, and J. Gehrke, Ma a: A Maximal Frequent Item-
set Algorithm for Transactional Databases," Proc. of Int. Conf. on Data Eng.,
pp. 443{452, 2001.
[4] Y. I. Chang and Y. M. Hsieh, Setm*-Lmax: An Efficient SET-Based Approach
to Find Maximal Large Itemsets," Proc. of Int. Conf. on Computer Symposium:
Workshop on Software Eng. and Database Systems, 2002.
[5] M. S. Chen, J. Han, and P. S. Yu, Data Mining: An Overview from A Database
Perspective," IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 5, pp. 866{
882, Dec. 1996.
[6] D. W. Cheung, J. Han, V. T. Ng, and C. Y. Wong, Maintenance of Discovered
Association Rules in Large Databases: An Incremental Updating Technique,"
Proc. of the 12th IEEE Int. Conf. on Data Eng., 1996.
[7] D. W. Cheung, S. D. Lee, and B. Kao, A General Incremental Technique for
Maintaining Discovered Association Rules," Proc. on Database Systems for Ad-
vanced Applications, pp. 185{194, April 1997.
[8] U. M. Fayyad, G. P. Shapiro, P. Smyth, and R. Uthurusamy, Advances in
Knowledge Discovery and Data Mining," AAAI/MIT Press, 1996.
[9] Y. Fu, Data Mining: Tasks, Techniques, and Applications," IEEE Potentials,
Vol. 16, No. 4, pp. 18{20, 1997.
[10] C. H. Lee, C. R. Lin, and M. S. Chen, Sliding-Window Filtering: An E?
cient Algorithm for Incremental Mining," Proc. of the 10th ACM Int. Conf. on
Information and Knowledge Management, pp. 263{270, 2001.
[11] D. I. Lin and Z. M. Kedem, Pincer-Search: An Efficient Algorithm for Discov-
ering the Maximum Frequent Set," IEEE Trans. on Knowledge and Data Eng.,
Vol. 14, No. 3, May/June 2002.
[12] J. S. Park, M. S. Chen, and P. S. Yu, An E ective Hash-Based Algorithm for
Mining Association Rules," Proc. of the ACM SIGMOD Int. Conf. on Manage-
ment of Data, pp. 175{186, 1999.
[13] V. Pudi and J. R. Haritsa, Quantifying The Utility of The Past in Mining Large
Databases," Information Systems, Vol. 25, No. 5, pp. 323{343, April 2000.
[14] N. L. Sarda and N. V. Srinivas, An Adaptive Algorithm for Incremental Mining
of Association Rules," Proc. of the 9th Int. Workshop on Database and Expert
Systems Applications, pp. 240{246, 1998.
[15] A. Savasere, E. Omiecinski, and S. Navath, An Efficient Algorithm for Mining
Association Rules in Large Databases," Proc. of the 21st VLDB Conf., pp. 432{
444, 1995.
[16] A. Silbersrchatz, M. Stonebraker, and J. D. Ullman, Database Research:
Achievements and Opportunities into The 21st Century," Tech. Rep. 26{27, Re-
port NSF Workshop Future of Database Systems Research, May 1995.
[17] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka, An Efficient Algorithm for
The Incremental Updation of Association Rules in Large Databases," Proc. of
the 3rd Int. Conf. on Knowledge Discovery and Data Mining, pp. 263{266, 1997.
[18] P. S. M. Tsai, C. C. Lee, and A. L. P. Chen, An Efficient Approach for In-
cremental Association Rule Mining," Proc. of Paci c-Asia Conf. on Knowledge
Discovery and Data Mining, pp. 74{83, 1999.
[19] Y. J. Tsay and Y. W. Chang-Chien, An Efficient Cluster and Decomposition Al-
gorithm for Mining Association Rules," Information Sciences, Vol. 160, pp. 161{
171, Mar. 2004.
[20] M. J. Zaki and K. Gouda, Fast Vertical Mining Using Di sets," Proc. of the
9th Int. Conf. on Knowledge Discovery and Data Mining, pp. 24{27, 2003.
[21] M. J. Zaki and C. J. Hsiao, CHARM: An Efficient Algorithm for Closed Itemset
Mining," Proc. of the SIAM Int. Conf. on Data Mining, pp. 99{110, 2002.
[22] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New Algorithms for
Fast Discovery of Association Rules," Proc. of the 3rd Int. Conf. on Knowledge
Discovery and Data Mining, pp. 283{286, 1997.
[23] M. Zhang, D. C. B. Kao, and C. L. Yip, Efficient Algorithms for Incremental
Update of Frequent Sequences," Proc. of the 6th Paci c-Asia Conf. on Knowl-
edge Discovery and Data Mining, pp. 186{197, 2002.
[24] Q. Zou, D. J. W. Chu, and H. Chiu, Pattern Decomposition Algorithm for
Data Mining of Frequent Patterns," Knowledge and Information Systems, Vol. 4,
No. 4, pp. 466{482, Oct. 2002.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 44.197.116.176
論文開放下載的時間是 校外不公開

Your IP address is 44.197.116.176
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code