論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available
論文名稱 Title |
一個於大型資料庫中以滑動視窗來探勘最大項目集合的方法 A Sliding-Window Approach to Mining Maximal Large Itemsets for Large Databases |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
89 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2004-06-04 |
繳交日期 Date of Submission |
2004-07-28 |
關鍵字 Keywords |
資料探索、關聯式法則、漸增式探索、分割、最大代表集 partition, Incremental Mining, Maximal Large Itemsets, Data Mining, Association Rules |
||
統計 Statistics |
本論文已被瀏覽 5652 次,被下載 0 次 The thesis/dissertation has been browsed 5652 times, has been downloaded 0 times. |
中文摘要 |
探索關聯式法則,指的是如何從尚未分析的資料找出有用的資訊。而找出最大代表集(maximal large itemsets)為其進階工作,是為了找出能夠代表全部資訊的代表集合。以前用來找出最大代表集的方法可以分為兩類:窮舉法(exhausted)跟捷徑法(shortcut)。捷徑法會比窮舉法產生較少的項目集,且在時間與儲存空間上會有較佳的效能。在另一方面,當資料庫有更新動作時,可以重新執行整個探索的演算法。另一方法,則為漸增性探索(incremental mining)。這個方法可以有效率地更新有關關聯式法則的資訊,且不需要重新執行整個探索的演算法。然而,以前基於捷徑法的演算法無法支援最大代表集的漸增性探索。而用於漸增性探索的演算法,如SWF演算法,卻不能夠有效地支援最大代表集的探索,因為這些方法是基植於窮舉法。因此,在本篇論文中,我們著重在設計一個同時能在找出最大代表集與漸增性探索這兩方面都有良好效能的演算法。根據一些在最大代表集方面之觀察,例如,“一個項目集如果是大代表集的,那它的子項目集必定會是大代表集,所以這些子項目集不需要再被檢查”,我們提出一個以滑動視窗(Sliding-Window)來探索的方法,SWMax演算法,能同時有效地找出最大代表集及漸增性探索。我們的SWMax演算法是一個只需掃瞄兩次,且以分割資料庫為基礎的方法。我們將在掃瞄第一遍時找出所有大小分別為1與3的候選項目集,(C1, C3),還有大小分別是1與3的大代表集(L1, L3)。我們在掃瞄第一遍資料庫之後,會去生產虛擬的最大代表集。然後,我們利用L1去生產C2,用L3生產C4,用C4生產C5,直到Ck無法被生產出來為止。在第二次掃瞄資料庫時,我們對Ck做累計同時找出最大代表集。在漸增性探索方面,我們考慮兩種情況:(1)資料新增,(2)資料刪除。在這兩種情況下,當用SWF演算法,只要有大小為1的項目集在舊的資料庫無法成為候選項目而不被保留,那在更新之後的資料庫也決不會再出現,因此導致正確結果可能會被流失。也就是,可能會有遺漏的情況發生。因為SWF演算法只保留大小為2的項目集的資訊。而我們的SWMax演算法,則可以正確地進行漸增性探索。因為,我們保留了大小為1與3的項目集來支援更新資訊。在我們的效能測試中,我們會生產一些人造的資料庫,來模擬真實交易記錄的資料庫。從我們的模擬結果得知,我們的SWMax演算法可以比SWF演算法產生較少的項目集且需要較少的執行時間。 |
Abstract |
Mining association rules, means a process of nontrivial extraction of implicit, previously and potentially useful information from data in databases. Mining maximal large itemsets is a further work of mining association rules, which aims to find the set of all subsets of large (frequent) itemsets that could be representative of all large itemsets. Previous algorithms to mining maximal large itemsets can be classified into two approaches: exhausted and shortcut. The shortcut approach could generate smaller number of candidate itemsets than the exhausted approach, resulting in better performance in terms of time and storage space. On the other hand, when updates to the transaction databases occur, one possible approach is to re-run the mining algorithm on the whole database. The other approach is incremental mining, which aims for efficient maintenance of discovered association rules without re-running the mining algorithms. However, previous algorithms for mining maximal large itemsets based on the shortcut approach can not support incremental mining for mining maximal large itemsets. While the algorithms for incremental mining, {it e.g.}, the SWF algorithm, could not efficiently support mining maximal large itemsets, since it is based on the exhausted approach. Therefore, in this thesis, we focus on the design of an algorithm which could provide good performance for both mining maximal itemsets and incremental mining. Based on some observations, for example, ``{it if an itemset is large, all its subsets must be large; therefore, those subsets need not to be examined further}", we propose a Sliding-Window approach, the SWMax algorithm, for efficiently mining maximal large itemsets and incremental mining. Our SWMax algorithm is a two-passes partition-based approach. We will find all candidate 1-itemsets ($C_1$), candidate 3-itemsets ($C_3$), large 1-itemsets ($L_1$), and large 3-itemsets ($L_3$) in the first pass. We generate the virtual maximal large itemsets after the first pass. Then, we use $L_1$ to generate $C_2$, use $L_3$ to generate $C_4$, use $C_4$ to generate $C_5$, until there is no $C_k$ generated. In the second pass, we use the virtual maximal large itemsets to prune $C_k$, and decide the maximal large itemsets. For incremental mining, we consider two cases: (1) data insertion, (2) data deletion. Both in Case 1 and Case 2, if an itemset with size equal to 1 is not large in the original database, it could not be found in the updated database based on the SWF algorithm. That is, a missing case could occur in the incremental mining process of the SWF algorithm, because the SWF algorithm only keeps the $C_2$ information. While our SWMax algorithm could support incremental mining correctly, since $C_1$ and $C_3$ are maintained in our algorithm. We generate some synthetic databases to simulate the real transaction databases in our simulation. From our simulation, the results show that our SWMax algorithm could generate fewer number of candidates and needs less time than the SWF algorithm. |
目次 Table of Contents |
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Formal Problem Description . . . . . . . . . . . . . . . . . . . 2 1.1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Mining Maximal Large Itemsets . . . . . . . . . . . . . . . . . . . . . 5 1.3 Incremental Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 14 2. A Survey of Algorithms for Mining Maximal Large Itemsets and Incremental Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 The Partition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 The Pincer-Search Algorithm . . . . . . . . . . . . . . . . . . . . . . 23 2.4 The Max-Miner Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 The Pattern Decomposition Algorithm . . . . . . . . . . . . . . . . . 27 2.6 The SWF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3. The SWMax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Interesting Observations for Mining Maximal Large Itemsets . . . . . 35 3.2 The SWMax Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Interesting Observations for Incremental Mining . . . . . . . . . . . . 56 3.4 Incremental Mining Process . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1 Generation of Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 A. The Counting Approach in the SWF Algorithm . . . . . . . . . . . 89 |
參考文獻 References |
[1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules," Proc. of the 20th Int. Conf. on Very Large Data Bases, 1994. [2] R. J. Bayardo, Efficiently Mining Long Patterns from Databases," Proc. of Int. Conf. on Data Eng., pp. 85{93, 1998. [3] D. Burdick, M. Calimlim, and J. Gehrke, Ma a: A Maximal Frequent Item- set Algorithm for Transactional Databases," Proc. of Int. Conf. on Data Eng., pp. 443{452, 2001. [4] Y. I. Chang and Y. M. Hsieh, Setm*-Lmax: An Efficient SET-Based Approach to Find Maximal Large Itemsets," Proc. of Int. Conf. on Computer Symposium: Workshop on Software Eng. and Database Systems, 2002. [5] M. S. Chen, J. Han, and P. S. Yu, Data Mining: An Overview from A Database Perspective," IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 5, pp. 866{ 882, Dec. 1996. [6] D. W. Cheung, J. Han, V. T. Ng, and C. Y. Wong, Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique," Proc. of the 12th IEEE Int. Conf. on Data Eng., 1996. [7] D. W. Cheung, S. D. Lee, and B. Kao, A General Incremental Technique for Maintaining Discovered Association Rules," Proc. on Database Systems for Ad- vanced Applications, pp. 185{194, April 1997. [8] U. M. Fayyad, G. P. Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining," AAAI/MIT Press, 1996. [9] Y. Fu, Data Mining: Tasks, Techniques, and Applications," IEEE Potentials, Vol. 16, No. 4, pp. 18{20, 1997. [10] C. H. Lee, C. R. Lin, and M. S. Chen, Sliding-Window Filtering: An E? cient Algorithm for Incremental Mining," Proc. of the 10th ACM Int. Conf. on Information and Knowledge Management, pp. 263{270, 2001. [11] D. I. Lin and Z. M. Kedem, Pincer-Search: An Efficient Algorithm for Discov- ering the Maximum Frequent Set," IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 3, May/June 2002. [12] J. S. Park, M. S. Chen, and P. S. Yu, An E ective Hash-Based Algorithm for Mining Association Rules," Proc. of the ACM SIGMOD Int. Conf. on Manage- ment of Data, pp. 175{186, 1999. [13] V. Pudi and J. R. Haritsa, Quantifying The Utility of The Past in Mining Large Databases," Information Systems, Vol. 25, No. 5, pp. 323{343, April 2000. [14] N. L. Sarda and N. V. Srinivas, An Adaptive Algorithm for Incremental Mining of Association Rules," Proc. of the 9th Int. Workshop on Database and Expert Systems Applications, pp. 240{246, 1998. [15] A. Savasere, E. Omiecinski, and S. Navath, An Efficient Algorithm for Mining Association Rules in Large Databases," Proc. of the 21st VLDB Conf., pp. 432{ 444, 1995. [16] A. Silbersrchatz, M. Stonebraker, and J. D. Ullman, Database Research: Achievements and Opportunities into The 21st Century," Tech. Rep. 26{27, Re- port NSF Workshop Future of Database Systems Research, May 1995. [17] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka, An Efficient Algorithm for The Incremental Updation of Association Rules in Large Databases," Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data Mining, pp. 263{266, 1997. [18] P. S. M. Tsai, C. C. Lee, and A. L. P. Chen, An Efficient Approach for In- cremental Association Rule Mining," Proc. of Paci c-Asia Conf. on Knowledge Discovery and Data Mining, pp. 74{83, 1999. [19] Y. J. Tsay and Y. W. Chang-Chien, An Efficient Cluster and Decomposition Al- gorithm for Mining Association Rules," Information Sciences, Vol. 160, pp. 161{ 171, Mar. 2004. [20] M. J. Zaki and K. Gouda, Fast Vertical Mining Using Di sets," Proc. of the 9th Int. Conf. on Knowledge Discovery and Data Mining, pp. 24{27, 2003. [21] M. J. Zaki and C. J. Hsiao, CHARM: An Efficient Algorithm for Closed Itemset Mining," Proc. of the SIAM Int. Conf. on Data Mining, pp. 99{110, 2002. [22] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New Algorithms for Fast Discovery of Association Rules," Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data Mining, pp. 283{286, 1997. [23] M. Zhang, D. C. B. Kao, and C. L. Yip, Efficient Algorithms for Incremental Update of Frequent Sequences," Proc. of the 6th Paci c-Asia Conf. on Knowl- edge Discovery and Data Mining, pp. 186{197, 2002. [24] Q. Zou, D. J. W. Chu, and H. Chiu, Pattern Decomposition Algorithm for Data Mining of Frequent Patterns," Knowledge and Information Systems, Vol. 4, No. 4, pp. 466{482, Oct. 2002. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內校外均不公開 not available 開放時間 Available: 校內 Campus:永不公開 not available 校外 Off-campus:永不公開 not available 您的 IP(校外) 位址是 44.197.116.176 論文開放下載的時間是 校外不公開 Your IP address is 44.197.116.176 This thesis will be available to you on Indicate off-campus access is not available. |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |