Responsive image
博碩士論文 etd-0728100-115617 詳細資訊
Title page for etd-0728100-115617
論文名稱
Title
以集合為基礎之有效率的資料關聯式法則挖掘方法
AN EFFICIENT SET-BASED APPROACH TO MINING ASSOCIATION RULES
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
129
研究生
Author
指導教授
Advisor
召集委員
Convenor

口試委員
Advisory Committee
口試日期
Date of Exam
2000-06-23
繳交日期
Date of Submission
2000-07-28
關鍵字
Keywords
關聯式法則、資料挖掘
association rule, data mining
統計
Statistics
本論文已被瀏覽 5673 次,被下載 0
The thesis/dissertation has been browsed 5673 times, has been downloaded 0 times.
中文摘要
在資料挖掘的領域裡,關聯式法則的發現是一個很重要的問題。基本上,對於一個買賣的交易資料庫,能夠去發現產品間的關聯式關係是很有用的。所謂的關聯式關係,乃是說,在同一筆交易中,某些產品的存在必然暗示著其它某些特定產品之存在。由於關聯式法則的挖掘必須在大筆交易資料庫裡,反覆地掃瞄,來找出不同的關聯式樣本。所以,它的處理數量是很大的,而執行效率的改善則是一個十分重要的課題。在這個問題當中,如何有效率地計算出大的項目集合(稱為Lk)為主要的工作,而所謂的大的項目集合乃是具有足夠交易筆數的項目集合。在這篇論文裡,我們提出以高階集合為基礎之結構化查詢語言來有效率地完成找到大的項目集合。以集合為基礎之方法主要能清楚地表達要的是什麼,而非像低階之方法一般要明確地指明如何去做,這裡所謂低階之方法乃指從資料庫裡一次只取出一筆記錄。以集合為基礎之方法的優點,如同SETM演算法一般,為簡單且穩定。但是,Houtsma與Swami所提的SETM演算法會產生很多不必要的候選項目集合。因此,我們提出一個新的以集合為基礎之演算法,稱之為SETM*。它保留了SETM演算法的優點,同時避免了SETM演算法會產生太多候選項目集合的缺點。在SETM*演算法中,我們藉由修改建立候選資料庫的方式來減少了它的大小,所謂候選資料庫乃是由包含候選項目集合之交易資料所組成的交易資料庫。再則,我們以SETM*演算法為基礎,提出三個演算法:SETM*-2K,SETM*-MaxK與SETM*-Lmax。在SETM*-2K演算法中,使用者指定一個k值,我們有效地以Lw為基礎來求得Lk,此時w=2^{lceil log_{2}k
ceil-1} ,而非一步一步地去求Lk。在SETM*-MaxK演算法中,我們有效率地以Lw為基礎來求得Lk,此時$L_{k}
ot= emptyset, L_{k+1}=emptyset$ 與 $w=2^{lceil log_{2}k
ceil - 1}$,而非一步一步地去求Lk。在SETM*-Lmax演算法中,我們採用向前方法從Lk中找到所有最大的大項目集合,此時k個項目集合並不被包含在j個項目集合的k個項目子集合中,除了k = MaxK (指最大數的k) 以外,此時 $1 leq k < j leq MaxK$,
$L_{MaxK}
ot= emptyset$ 與 $L_{MaxK+1}=emptyset$。在模擬測試中,我們建立幾個人造的關聯式資料庫,來模擬顧客的交易形況。從我們的模擬結果報告中顯示出,我們所提SETM*演算法,在不同的資料庫下,無論是執行過程所需的磁碟空間或總執行時間,皆優於SETM演算法。此外,從結果顯示中,我們所提的SETM*-2K或SETM*-MaxK演算法,在達到其目的之執行時間,皆比SETM或SETM*演算法來的少。再者,我們也顯示出所提出的向前方法(SETM*-Lmax)比Agrawal’s所提出之向後方法,所需的時間來得少。
Abstract
Discovery of {it association rules} is an important problem in the area of data
mining. Given a database of sales transactions, it is desirable to discover
the important associations among items such that the presence of some items
in a transaction will imply the presence of other items in the same
transaction.
Since mining association rules may require to repeatedly scan through a large
transaction database to find different association patterns, the amount of
processing could be huge, and performance improvement is an essential
concern.
Among this problem, how to efficiently {it count large
itemsets} is the major work, where a large itemset is a set of items
appearing in a sufficient number of transactions.
In this thesis, we propose efficient algorithms for mining association
rules based on a high-level set-based approach.
A set-based approach allows a clear expression of what needs to be done
as opposed to specifying exactly how the operations are carried out in a low-level approach, where
a low-level approach means to retrieve one tuple from the database at a time.
The advantage of the set-based approach, like the SETM algorithm,
is simple and stable over the range of parameter values.
However, the SETM algorithm proposed by Houtsma and Swami may generate too many invalid candidate itemsets.
Therefore, in this thesis, we propose a set-based algorithm called SETM*,
which provides the same advantages of the SETM algorithm,
while it avoids the disadvantages of the SETM algorithm.
In the SETM* algorithm, we reduce the size of the candidate database by
modifying the way of constructing it,
where a candidate database is a transaction database formed with candidate
$k$-itemsets.
Then, based on the new way to construct the candidate database in the SETM*
algorithm, we propose SETM*-2K, mbox{SETM*-MaxK} and SETM*-Lmax algorithms.
In the SETM*-2K algorithm, given a $k$, we efficiently construct $L_{k}$
based on $L_{w}$, where $w=2^{lceil log_{2}k
ceil - 1}$,
instead of step by step.
In the SETM*-MaxK algorithm, we efficiently to find the $L_{k}$ based on $L_{w}$,
where $L_{k}
ot= emptyset, L_{k+1}=emptyset$ and $w=2^{lceil log_{2}k
ceil - 1}$,
instead of step by step.
In the SETM*-Lmax algorithm, we use a forward approach to find all maximal large itemsets from $L_{k}$,
and the $k$-itemset is not included in the $k$-subsets of the $j$-itemset,
except $k=MaxK$, where $1 leq k < j leq MaxK$,
$L_{MaxK}
ot= emptyset$ and $L_{MaxK+1}=emptyset$.
We conduct several experiments using different synthetic relational databases.
The simulation results show that the
SETM* algorithm outperforms the SETM algorithm in terms of storage space or the
execution time for all relational database settings.
Moreover, we show that the proposed SETM*-2K
and SETM*-MaxK algorithms also require shorter time to achieve their goals than the SETM or SETM* algorithms.
Furthermore, we also show that the proposed forward approach (SETM*-Lmax)
to find all maximal large itemsets requires shorter time than the backward approach proposed by Agrawal.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Formal Problem Description . . . . . . . . . . . . . . . . . . . 5
1.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 14
2. A Survey of Data Mining Techniques for Association Rules-Related
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 The AprioriTid Algorithm . . . . . . . . . . . . . . . . . . . . 20
2.1.3 The DHP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.4 The Boolean Algorithm . . . . . . . . . . . . . . . . . . . . . 23
2.1.5 The SETM Algorithm . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Mining Multiple-Level Association Rules . . . . . . . . . . . . . . . . 33
2.3 Mining Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Mining Path Traversal Patterns . . . . . . . . . . . . . . . . . . . . . 40
3. The SETM* Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Some Interesting Observations . . . . . . . . . . . . . . . . . . . . . . 44
3.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4. The SETM*-2K Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5. The SETM*-MaxK Algorithm . . . . . . . . . . . . . . . . . . . . . . 67
5.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6. The SETM*-Lmax Algorithm . . . . . . . . . . . . . . . . . . . . . . 75
6.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.1 Generation of Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.1 SETM vs. SETM* . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2.2 SETM* vs. SETM*-2K . . . . . . . . . . . . . . . . . . . . . . 95
7.2.3 SETM* vs. SETM*-MaxK . . . . . . . . . . . . . . . . . . . . 99
7.2.4 SETM*-Lmax . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 106
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A. The Hash Tree Structure in the Apriori Algorithm . . . . . . . . . 113
B. The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 114
C. The AprioriTID Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 116
D. The SETM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
E. The SETM* Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 122
F. The Flowchart of the Generation of Synthetic Data . . . . . . . . 125
G. An Example of the Generation of Synthetic Data . . . . . . . . . . 128
參考文獻 References
[1] C.C. Aggarwal and P.S. Yu, "Mining Large Itemsets for Association Rules,"
Proc. 14th IEEE Int'l Conf. Data Engineering, pp. 23-31, March 1998.
[2] R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules Between
Sets of Items in Large Databases," Proc. 1993 ACM SIGMOD Int'l Conf. Man-
agement of Data, pp. 207-216, May 1993.
[3] R. Agrawal, C. Faloutsos, and A. Swami, "EAcient Similarity Search in Se-
quence Databases," Proc. Fourth Int'l Conf. Foundations of Data Organization
and Algorithms, pp. 69-84, Oct. 1993
[4] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in
Large Databases," Proc. 20th Int'l Conf. Very Large Data Bases, pp. 490-501,
Sept. 1994.
[5] R. Agrawal and R. Srikant, "Mining Sequential Patterns," Proc. 11th IEEE Int'l
Conf. Data Engineering, pp. 3-14, March 1995.
[6] R. Agrawal and K. Shim, "Developing Tightly-Coupled Applications on IBM
DB2/CS Relational Database System: Methodology and Experience," IBM Re-
search Report, 1995.
[7] R. J. Bayardo Jr. "EAciently Mining Long Patterns from Databases," Proc. 1998
ACM SIGMOD Int'l Conf. Management of Data, pp. 85-93. June 1998.
[8] M. Bieber and J. Wan, "Backtracking in a Multiple-Window Hypertext Environ-
ment," Proc. ACM European Conf. Hypermedia Technology, pp. 158-166, 1994.
[9] S. Brin, R. Motwani, J.D. Ullman, and S. Tsur, "Dynamic Itemset Counting
and Implication Rules," Proc. 1997 ACM SIGMOD Int'l Conf. Management of
Data, pp. 255-264, 1997.
[10] E. Caramel, S. Crawford, and H. Chen, "Browsing in Hypertext: A Cognitive
Study," IEEE Trans. on Systems, Man, and Cybernetics, Vol. 22, No. 5, pp. 865-
883, Sept. 1992.
[11] L.D. Catledge and J.E. Pitknow, "Characterizing Browsing Strategies in the
World-Wide Web," Computer Networks and ISDN Systems, Vol. 26, No. 6.
pp. 1065-1073, Apr. 1995.
[12] M.-S. Chen, J.-S. Park, and P.S. Yu, "Data Mining for Path Traversal Patterns
in a Web Environment," Proc. 16th IEEE Int'l Conf. Distributed Computing
Systems, pp. 385-392, May 27-30, 1996.
[13] M.-S. Chen, J. Han, and P.S. Yu, "Data Mining: An Overview from a Database
Perspective," IEEE Trans. on Knowledge and Data Engineering, Vol. 8, No. 5,
pp. 866-882, Dec. 1996.
[14] M.-S. Chen, J.-S. Park, and P.S. Yu, "EAcient Data Mining for Path Traversal
Patterns," IEEE Trans. on Knowledge and Data Engineering, Vol. 10, No. 2,
pp. 209-221, March/April 1998.
[15] W.H. Chen, Y.H. Wu and A.L.P. Chen, "Web- ow Mining Techniques, Applica-
tions and System Implementations," Proc. of 1999 National Computer Sympo-
sium, Vol. 1, pp. 26-32, 1999.
[16] D.W. Cheung, J. Han, V.T. Ng, and C.Y. Wong, "Maintenance of Discovered
Association Rules in Large Databases: An Incremental Updating Technique,"
Proc. 12th IEEE Int'l Conf. Data Engineering, pp. 106-114, Feb. 1996.
[17] David Wai-Lok Cheung, Sau Dan Lee, and Ben Kao, "A General Incremental
Technique for Maintaining Discovered Association Rules," Proc. 5th Int'l Conf.
on Database Systems for Advanced Applications, DASFAA'97, pp. 185-194, April
1-4, 1997.
[18] J. December and N. Randall, "The World Wide Web Unleashed," SAMS pub-
lishing, 1994.
[19] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in
Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.
[20] Y. Fu, "Data Mining," IEEE Potentials, pp. 18-20, 1997.
[21] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan, "Mining Very
Large Databases," IEEE Computer, Vol. 32, No. 8, pp. 38-45, 1999.
[22] J. Han and Y. Fu, "Discovery of Multiple-level Association Rules from Large
Databases," Proc. 21th Int'l Conf. Very Large Data Bases, pp. 420-432, Sep.
1995.
[23] J. Han and Y. Fu, "Mining of Multiple-level Association Rules from Large
Databases," IEEE Trans. on Knowledge and Data Engineering, Vol. 11, No. 5,
pp. 798-805, September/October 1999.
[24] C. Hidber, "Online Association Rule Mining," Proc. 1999 ACM SIGMOD Int'l
Conf. Management of Data, pp. 145-156, 1999.
[25] M. Houtsma and A. Swami, "Set-oriented Mining for Association Rules in Re-
lational Databases," Proc. 11th IEEE Int'l Conf. Data Engineering, pp. 25-33,
1995.
[26] Dao-I Lin and Zvi M. Kedem, "Pincer Search: A New Algorithm for Discovering
the Maximum Frequent Set," Proc. of the 6th European Conf. on Ertending
Database Technology, pp. 105-119, 1998.
[27] I-Yuan Lin, Xin-Mao Huang, and Ming-Syan Chen, "Capturing User Access Pat-
terns in the Web for Data Mining," Proc. 15th IEEE Int'l Conf. Data Engineering
pp. 345-348, 1999.
[28] M.-Y. Lin and S.-Y. Lee, "Incremental Update on Sequential Patterns in large
Databases," IEEE Proc. 10th Int'l Conf. Tools with Arti cial Intelligence, pp. 24-
31, 1998.
[29] W. Lu, J. Han, and B.C. Ooi, "Discovery of General Knowledge in Large Spatial
Databases," Proc. Far East Workshop Geographic Information Systems, pp. 275-
289, Singapore, June 1993.
[30] H. Mannila, H. Toivonen, and A. Inkeri Verkamo, "EAcient Algorithms for Dis-
covering Association Rules," Proc. AAAI Workshop Knowledge Discovering in
Databases, pp. 181-192, July 1994.
[31] Heikki Mannila, "Data mining: machine learning, statistics, and databases,"
IEEE, July 1996.
[32] Andreas Mueller, "Fast Sequential and Parallel Algorithms for Association Rule
Mining: A Comparison," Technical Report CS-TR-3515, Aug. 1995.
[33] J.-S. Park, M.-S. Chen, and P.S. Yu, "An E ective Hash Based Algorithm for
Mining Association Rules," Proc. 1995 ACM SIGMOD Int'l Conf. Management
of Data, pp. 175-186, May 1995.
[34] Jong Soo Patk, Ming-Syan Chen, and Philip S. Yu, "Mining Associarion Rules
with Adjustable Accuracy," IBM Research Report, 1995.
[35] G. Piatetsky-Shapiro, "Discovery, Analysis, and Presentation of Strong Rules,"
G. Piatetsky-Shapiro and W.J. Frawley, eds.,Knowledge Discovery in Databases,
AAAI/MIT Press. pp. 229-238. 1991.
[36] N.L. Sarda and N.V. Srinivas, "An Adaptive Algorithm for Incremental Mining
of Association Rules," Proc. 14th IEEE Int'l Conf. Data Engineering pp. 240-245,
1998.
[37] A. Savasere, E. Omiecinski, and S. Navathe, "An EAcient Algorithm for Mining
Association Rules in Large Databases," Proc. 21th Int'l Conf. Very Large Data
Bases, pp. 432-444, Sept. 1995.
[38] Sunita Sarawagi, Shiby Thomas, and Rakesh Agrawal, "Integrating Association
Rule Mining with Relational Database Systems: Alternatives and Implications,"
Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, pp. 343-354, 1998.
[39] T. Shintani and M. Kitsuregawa, "Parallel Mining Algorithms for Generalized
Association Rules with Classi cation Hierarchy," Proc. 1998 ACM SIGMOD
Int'l Conf. Management of Data, pp. 25-36, 1998.
[40] R. Srikant and R. Agrawal, "Mining Generalized Association Rules," Proc. 21th
Int'l Conf. Very Large Data Bases, pp. 407-419, Sept. 1995.
[41] R. Srikant and R. Agrawal, "Mining Sequential Patterns: Generalizations and
Performance Improvements," Advances in Database Technology-5th Int'l Conf.
KDD'95, pp. 269-274, 1995.
[42] A.Silbersrchatz, M. Stonebraker, and J.D. Ullman, "Database Research:
Achievements and Opportunities into the 21st Century," Report NSF Workshop
Future of Database Systems Research, May 1995.
[43] H. Toivonen, "Sampling Large Databases for Association Rules," Proc. ACM
SIGMOD Int'l Conf. Management of Data, pp. 255-264, 1997.
[44] Dick Tsur, Je rey D. Ullman, Serge Abiteboul, Chris Clifton, Rajeev Motwani,
Svetlozar Nestorov, and Arnon Rosenthal, "Query Flocks: A Generalization of
Association-Rule Mining," Proc. ACM SIGMOD Int'l Conf. on Management of
Data, pp. 1-12, June 2-4, 1998.
[45] S.-Y. Wur and Y. Leu, "An E ective Boolean Algorithm for Mining Association
Rules in Large Databases," Proc. 6th Int'l Conf. Database Systems for Advanced
Applications, pp. 179-186, April 1999.
[46] S.-J. Yen and A. Chen, "An EAcient Approach to Discovering Knowledge from
Large Databases," Proc. 4th Int'l Conf. Parallel and Distributed Information
Systems, pp. 8-18, 1996.
[47] S.-J. Yen and A. Chen, "An EAcient Data Mining Technique for Discovering
Interesting Association Rules," Proc. 8th Int'l Conf. Workshop Database and
Expert System Applications pp. 664-669, 1997.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 18.116.42.208
論文開放下載的時間是 校外不公開

Your IP address is 18.116.42.208
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code