國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以集合為基礎之有效率的資料關聯式法則挖掘方法 ,AN EFFICIENT SET-BASED APPROACH TO MINING ASSOCIATION RULES

論文名稱 Title	以集合為基礎之有效率的資料關聯式法則挖掘方法 AN EFFICIENT SET-BASED APPROACH TO MINING ASSOCIATION RULES
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	88 學年度第 2 學期 The spring semester of Academic Year 88	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	129
研究生 Author	謝佑明 Yu-Ming Hsieh
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor
口試委員 Advisory Committee	李建億, 郭大維, 黃三益 Chien-I Lee; Tei-Wei Kuo; San-Yih Hwang
口試日期 Date of Exam	2000-06-23	繳交日期 Date of Submission	2000-07-28
關鍵字 Keywords	關聯式法則、資料挖掘 association rule, data mining
統計 Statistics	本論文已被瀏覽 5673 次，被下載 0 次 The thesis/dissertation has been browsed 5673 times, has been downloaded 0 times.

中文摘要
在資料挖掘的領域裡，關聯式法則的發現是一個很重要的問題。基本上，對於一個買賣的交易資料庫，能夠去發現產品間的關聯式關係是很有用的。所謂的關聯式關係，乃是說，在同一筆交易中，某些產品的存在必然暗示著其它某些特定產品之存在。由於關聯式法則的挖掘必須在大筆交易資料庫裡，反覆地掃瞄，來找出不同的關聯式樣本。所以，它的處理數量是很大的，而執行效率的改善則是一個十分重要的課題。在這個問題當中，如何有效率地計算出大的項目集合(稱為Lk)為主要的工作，而所謂的大的項目集合乃是具有足夠交易筆數的項目集合。在這篇論文裡，我們提出以高階集合為基礎之結構化查詢語言來有效率地完成找到大的項目集合。以集合為基礎之方法主要能清楚地表達要的是什麼，而非像低階之方法一般要明確地指明如何去做，這裡所謂低階之方法乃指從資料庫裡一次只取出一筆記錄。以集合為基礎之方法的優點，如同SETM演算法一般，為簡單且穩定。但是，Houtsma與Swami所提的SETM演算法會產生很多不必要的候選項目集合。因此，我們提出一個新的以集合為基礎之演算法，稱之為SETM。它保留了SETM演算法的優點，同時避免了SETM演算法會產生太多候選項目集合的缺點。在SETM演算法中，我們藉由修改建立候選資料庫的方式來減少了它的大小，所謂候選資料庫乃是由包含候選項目集合之交易資料所組成的交易資料庫。再則，我們以SETM演算法為基礎，提出三個演算法：SETM-2K，SETM-MaxK與SETM-Lmax。在SETM-2K演算法中，使用者指定一個k值，我們有效地以Lw為基礎來求得Lk，此時w=2^{lceil log_{2}k ceil-1} ，而非一步一步地去求Lk。在SETM-MaxK演算法中，我們有效率地以Lw為基礎來求得Lk，此時$L_{k} ot= emptyset, L_{k+1}=emptyset$ 與 $w=2^{lceil log_{2}k ceil - 1}$，而非一步一步地去求Lk。在SETM-Lmax演算法中，我們採用向前方法從Lk中找到所有最大的大項目集合，此時k個項目集合並不被包含在j個項目集合的k個項目子集合中，除了k = MaxK (指最大數的k) 以外，此時 $1 leq k < j leq MaxK$, $L_{MaxK} ot= emptyset$ 與 $L_{MaxK+1}=emptyset$。在模擬測試中，我們建立幾個人造的關聯式資料庫，來模擬顧客的交易形況。從我們的模擬結果報告中顯示出，我們所提SETM演算法，在不同的資料庫下，無論是執行過程所需的磁碟空間或總執行時間，皆優於SETM演算法。此外，從結果顯示中，我們所提的SETM-2K或SETM-MaxK演算法，在達到其目的之執行時間，皆比SETM或SETM演算法來的少。再者，我們也顯示出所提出的向前方法(SETM-Lmax)比Agrawal’s所提出之向後方法，所需的時間來得少。
Abstract
Discovery of {it association rules} is an important problem in the area of data mining. Given a database of sales transactions, it is desirable to discover the important associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction. Since mining association rules may require to repeatedly scan through a large transaction database to find different association patterns, the amount of processing could be huge, and performance improvement is an essential concern. Among this problem, how to efficiently {it count large itemsets} is the major work, where a large itemset is a set of items appearing in a sufficient number of transactions. In this thesis, we propose efficient algorithms for mining association rules based on a high-level set-based approach. A set-based approach allows a clear expression of what needs to be done as opposed to specifying exactly how the operations are carried out in a low-level approach, where a low-level approach means to retrieve one tuple from the database at a time. The advantage of the set-based approach, like the SETM algorithm, is simple and stable over the range of parameter values. However, the SETM algorithm proposed by Houtsma and Swami may generate too many invalid candidate itemsets. Therefore, in this thesis, we propose a set-based algorithm called SETM, which provides the same advantages of the SETM algorithm, while it avoids the disadvantages of the SETM algorithm. In the SETM algorithm, we reduce the size of the candidate database by modifying the way of constructing it, where a candidate database is a transaction database formed with candidate $k$-itemsets. Then, based on the new way to construct the candidate database in the SETM* algorithm, we propose SETM-2K, mbox{SETM-MaxK} and SETM-Lmax algorithms. In the SETM-2K algorithm, given a $k$, we efficiently construct $L_{k}$ based on $L_{w}$, where $w=2^{lceil log_{2}k ceil - 1}$, instead of step by step. In the SETM-MaxK algorithm, we efficiently to find the $L_{k}$ based on $L_{w}$, where $L_{k} ot= emptyset, L_{k+1}=emptyset$ and $w=2^{lceil log_{2}k ceil - 1}$, instead of step by step. In the SETM-Lmax algorithm, we use a forward approach to find all maximal large itemsets from $L_{k}$, and the $k$-itemset is not included in the $k$-subsets of the $j$-itemset, except $k=MaxK$, where $1 leq k < j leq MaxK$, $L_{MaxK} ot= emptyset$ and $L_{MaxK+1}=emptyset$. We conduct several experiments using different synthetic relational databases. The simulation results show that the SETM* algorithm outperforms the SETM algorithm in terms of storage space or the execution time for all relational database settings. Moreover, we show that the proposed SETM-2K and SETM-MaxK algorithms also require shorter time to achieve their goals than the SETM or SETM* algorithms. Furthermore, we also show that the proposed forward approach (SETM*-Lmax) to find all maximal large itemsets requires shorter time than the backward approach proposed by Agrawal.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Formal Problem Description . . . . . . . . . . . . . . . . . . . 5 1.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 14 2. A Survey of Data Mining Techniques for Association Rules-Related Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 The AprioriTid Algorithm . . . . . . . . . . . . . . . . . . . . 20 2.1.3 The DHP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.4 The Boolean Algorithm . . . . . . . . . . . . . . . . . . . . . 23 2.1.5 The SETM Algorithm . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Mining Multiple-Level Association Rules . . . . . . . . . . . . . . . . 33 2.3 Mining Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 Mining Path Traversal Patterns . . . . . . . . . . . . . . . . . . . . . 40 3. The SETM* Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1 Some Interesting Observations . . . . . . . . . . . . . . . . . . . . . . 44 3.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4. The SETM-2K Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5. The SETM-MaxK Algorithm . . . . . . . . . . . . . . . . . . . . . . 67 5.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6. The SETM-Lmax Algorithm . . . . . . . . . . . . . . . . . . . . . . 75 6.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.1 Generation of Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 85 7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.1 SETM vs. SETM . . . . . . . . . . . . . . . . . . . . . . . . 88 7.2.2 SETM* vs. SETM-2K . . . . . . . . . . . . . . . . . . . . . . 95 7.2.3 SETM vs. SETM-MaxK . . . . . . . . . . . . . . . . . . . . 99 7.2.4 SETM-Lmax . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 106 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A. The Hash Tree Structure in the Apriori Algorithm . . . . . . . . . 113 B. The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 114 C. The AprioriTID Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 116 D. The SETM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 E. The SETM* Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 122 F. The Flowchart of the Generation of Synthetic Data . . . . . . . . 125 G. An Example of the Generation of Synthetic Data . . . . . . . . . . 128

參考文獻 References
[1] C.C. Aggarwal and P.S. Yu, "Mining Large Itemsets for Association Rules," Proc. 14th IEEE Int'l Conf. Data Engineering, pp. 23-31, March 1998. [2] R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules Between Sets of Items in Large Databases," Proc. 1993 ACM SIGMOD Int'l Conf. Man- agement of Data, pp. 207-216, May 1993. [3] R. Agrawal, C. Faloutsos, and A. Swami, "EAcient Similarity Search in Se- quence Databases," Proc. Fourth Int'l Conf. Foundations of Data Organization and Algorithms, pp. 69-84, Oct. 1993 [4] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," Proc. 20th Int'l Conf. Very Large Data Bases, pp. 490-501, Sept. 1994. [5] R. Agrawal and R. Srikant, "Mining Sequential Patterns," Proc. 11th IEEE Int'l Conf. Data Engineering, pp. 3-14, March 1995. [6] R. Agrawal and K. Shim, "Developing Tightly-Coupled Applications on IBM DB2/CS Relational Database System: Methodology and Experience," IBM Re- search Report, 1995. [7] R. J. Bayardo Jr. "EAciently Mining Long Patterns from Databases," Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, pp. 85-93. June 1998. [8] M. Bieber and J. Wan, "Backtracking in a Multiple-Window Hypertext Environ- ment," Proc. ACM European Conf. Hypermedia Technology, pp. 158-166, 1994. [9] S. Brin, R. Motwani, J.D. Ullman, and S. Tsur, "Dynamic Itemset Counting and Implication Rules," Proc. 1997 ACM SIGMOD Int'l Conf. Management of Data, pp. 255-264, 1997. [10] E. Caramel, S. Crawford, and H. Chen, "Browsing in Hypertext: A Cognitive Study," IEEE Trans. on Systems, Man, and Cybernetics, Vol. 22, No. 5, pp. 865- 883, Sept. 1992. [11] L.D. Catledge and J.E. Pitknow, "Characterizing Browsing Strategies in the World-Wide Web," Computer Networks and ISDN Systems, Vol. 26, No. 6. pp. 1065-1073, Apr. 1995. [12] M.-S. Chen, J.-S. Park, and P.S. Yu, "Data Mining for Path Traversal Patterns in a Web Environment," Proc. 16th IEEE Int'l Conf. Distributed Computing Systems, pp. 385-392, May 27-30, 1996. [13] M.-S. Chen, J. Han, and P.S. Yu, "Data Mining: An Overview from a Database Perspective," IEEE Trans. on Knowledge and Data Engineering, Vol. 8, No. 5, pp. 866-882, Dec. 1996. [14] M.-S. Chen, J.-S. Park, and P.S. Yu, "EAcient Data Mining for Path Traversal Patterns," IEEE Trans. on Knowledge and Data Engineering, Vol. 10, No. 2, pp. 209-221, March/April 1998. [15] W.H. Chen, Y.H. Wu and A.L.P. Chen, "Web- ow Mining Techniques, Applica- tions and System Implementations," Proc. of 1999 National Computer Sympo- sium, Vol. 1, pp. 26-32, 1999. [16] D.W. Cheung, J. Han, V.T. Ng, and C.Y. Wong, "Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique," Proc. 12th IEEE Int'l Conf. Data Engineering, pp. 106-114, Feb. 1996. [17] David Wai-Lok Cheung, Sau Dan Lee, and Ben Kao, "A General Incremental Technique for Maintaining Discovered Association Rules," Proc. 5th Int'l Conf. on Database Systems for Advanced Applications, DASFAA'97, pp. 185-194, April 1-4, 1997. [18] J. December and N. Randall, "The World Wide Web Unleashed," SAMS pub- lishing, 1994. [19] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996. [20] Y. Fu, "Data Mining," IEEE Potentials, pp. 18-20, 1997. [21] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan, "Mining Very Large Databases," IEEE Computer, Vol. 32, No. 8, pp. 38-45, 1999. [22] J. Han and Y. Fu, "Discovery of Multiple-level Association Rules from Large Databases," Proc. 21th Int'l Conf. Very Large Data Bases, pp. 420-432, Sep. 1995. [23] J. Han and Y. Fu, "Mining of Multiple-level Association Rules from Large Databases," IEEE Trans. on Knowledge and Data Engineering, Vol. 11, No. 5, pp. 798-805, September/October 1999. [24] C. Hidber, "Online Association Rule Mining," Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data, pp. 145-156, 1999. [25] M. Houtsma and A. Swami, "Set-oriented Mining for Association Rules in Re- lational Databases," Proc. 11th IEEE Int'l Conf. Data Engineering, pp. 25-33, 1995. [26] Dao-I Lin and Zvi M. Kedem, "Pincer Search: A New Algorithm for Discovering the Maximum Frequent Set," Proc. of the 6th European Conf. on Ertending Database Technology, pp. 105-119, 1998. [27] I-Yuan Lin, Xin-Mao Huang, and Ming-Syan Chen, "Capturing User Access Pat- terns in the Web for Data Mining," Proc. 15th IEEE Int'l Conf. Data Engineering pp. 345-348, 1999. [28] M.-Y. Lin and S.-Y. Lee, "Incremental Update on Sequential Patterns in large Databases," IEEE Proc. 10th Int'l Conf. Tools with Articial Intelligence, pp. 24- 31, 1998. [29] W. Lu, J. Han, and B.C. Ooi, "Discovery of General Knowledge in Large Spatial Databases," Proc. Far East Workshop Geographic Information Systems, pp. 275- 289, Singapore, June 1993. [30] H. Mannila, H. Toivonen, and A. Inkeri Verkamo, "EAcient Algorithms for Dis- covering Association Rules," Proc. AAAI Workshop Knowledge Discovering in Databases, pp. 181-192, July 1994. [31] Heikki Mannila, "Data mining: machine learning, statistics, and databases," IEEE, July 1996. [32] Andreas Mueller, "Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison," Technical Report CS-TR-3515, Aug. 1995. [33] J.-S. Park, M.-S. Chen, and P.S. Yu, "An Eective Hash Based Algorithm for Mining Association Rules," Proc. 1995 ACM SIGMOD Int'l Conf. Management of Data, pp. 175-186, May 1995. [34] Jong Soo Patk, Ming-Syan Chen, and Philip S. Yu, "Mining Associarion Rules with Adjustable Accuracy," IBM Research Report, 1995. [35] G. Piatetsky-Shapiro, "Discovery, Analysis, and Presentation of Strong Rules," G. Piatetsky-Shapiro and W.J. Frawley, eds.,Knowledge Discovery in Databases, AAAI/MIT Press. pp. 229-238. 1991. [36] N.L. Sarda and N.V. Srinivas, "An Adaptive Algorithm for Incremental Mining of Association Rules," Proc. 14th IEEE Int'l Conf. Data Engineering pp. 240-245, 1998. [37] A. Savasere, E. Omiecinski, and S. Navathe, "An EAcient Algorithm for Mining Association Rules in Large Databases," Proc. 21th Int'l Conf. Very Large Data Bases, pp. 432-444, Sept. 1995. [38] Sunita Sarawagi, Shiby Thomas, and Rakesh Agrawal, "Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications," Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, pp. 343-354, 1998. [39] T. Shintani and M. Kitsuregawa, "Parallel Mining Algorithms for Generalized Association Rules with Classication Hierarchy," Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, pp. 25-36, 1998. [40] R. Srikant and R. Agrawal, "Mining Generalized Association Rules," Proc. 21th Int'l Conf. Very Large Data Bases, pp. 407-419, Sept. 1995. [41] R. Srikant and R. Agrawal, "Mining Sequential Patterns: Generalizations and Performance Improvements," Advances in Database Technology-5th Int'l Conf. KDD'95, pp. 269-274, 1995. [42] A.Silbersrchatz, M. Stonebraker, and J.D. Ullman, "Database Research: Achievements and Opportunities into the 21st Century," Report NSF Workshop Future of Database Systems Research, May 1995. [43] H. Toivonen, "Sampling Large Databases for Association Rules," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 255-264, 1997. [44] Dick Tsur, Jerey D. Ullman, Serge Abiteboul, Chris Clifton, Rajeev Motwani, Svetlozar Nestorov, and Arnon Rosenthal, "Query Flocks: A Generalization of Association-Rule Mining," Proc. ACM SIGMOD Int'l Conf. on Management of Data, pp. 1-12, June 2-4, 1998. [45] S.-Y. Wur and Y. Leu, "An Eective Boolean Algorithm for Mining Association Rules in Large Databases," Proc. 6th Int'l Conf. Database Systems for Advanced Applications, pp. 179-186, April 1999. [46] S.-J. Yen and A. Chen, "An EAcient Approach to Discovering Knowledge from Large Databases," Proc. 4th Int'l Conf. Parallel and Distributed Information Systems, pp. 8-18, 1996. [47] S.-J. Yen and A. Chen, "An EAcient Data Mining Technique for Discovering Interesting Association Rules," Proc. 8th Int'l Conf. Workshop Database and Expert System Applications pp. 664-669, 1997.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.116.42.208 論文開放下載的時間是校外不公開 Your IP address is 18.116.42.208 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS