Responsive image
博碩士論文 etd-0729104-165736 詳細資訊
Title page for etd-0729104-165736
論文名稱
Title
一個於大型資料庫中以位元為基礎來探勘順序項目的有效方法
An Efficient Bitmap-Based Approach to Mining Sequential Patterns for Large Databases
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
92
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2004-06-04
繳交日期
Date of Submission
2004-07-29
關鍵字
Keywords
序列樣式、位元基礎探勘、資料探勘、知識發現、樣式分析
Bitmap Based Mining, Knowledge Discovery, Sequential Patterns, Pattern Analysis, Data Mining
統計
Statistics
本論文已被瀏覽 5735 次,被下載 0
The thesis/dissertation has been browsed 5735 times, has been downloaded 0 times.
中文摘要
資料探勘(Data Mining)的工作是在驚人的資料量中找出有用的資訊。在資料探勘的領域中,序列樣式探勘(Mining Sequential Patterns)是研究的一項重點。對於一個交易資料庫而言,所謂序列樣式是指在一段時間內,顧客購買的各項商品間具有某種相關性。若能利用序列樣式探勘找出這些關聯,就能夠提出更好的銷售策略以吸引更多顧客。然而,由於交易資料庫中包含龐大的資料量,且在探勘的過程中會不停的掃描資料庫,因此如何改善執行效率是很重要的課題。在Srikant和Agrawal提出的GSP演算法中,他們使用了複雜的資料結構來儲存和產生候選項目,所產生的候選項目滿足〝頻繁項目集的子集也是頻繁項目的特性〞。雖然此特性減少了候選項目的數量,然而進行候選項目計算的過程仍然花了過多的時間。在Aryes等人提出的SPAM演算法中,他們利用了位元運算的特性,使計算候選項目所花的時間大幅降低。然而它卻產生了過多的候選項目,太多不可能成為頻繁項目集的候選項目依然被產生出來,使它的效率降低。在本篇論文中,我們提出一個新的以位元為基礎的演算法,藉由修改GSP演算法中產生候選項目的方法及採用SPAM演算法中位元運算的方式,我們的演算法可以更有效率地探勘序列樣式。也就是,我們的方法採取類似GSP演算法中產生候選項目的方法以減少候選項目的數量,以及採取類似SPAM演算法中位元運算的方式以減少候選項目計算所花的時間。在我們提出的演算法中,將項目集分成兩種情況,同時發生(標示成AB)和先後發生(標示成A->B)。在同時發生的情形中,根據窮舉法,候選項目的數量為C(n,k)。為避免過多的候選項目產生,我們利用〝頻繁項目集的子集也是頻繁項目〞的特性,將候選項目的個數從C(n,k)降到C(y,k),k <= y < n。在先後發生的情形中,候選項目的產生是利用一個特別的合成運算來達成,例如合成A->B與B->C會得到A->B->C。除此之外,我們也考慮了另外兩種情況:(1)合成A->B與A->C會得到A->BC;(2)合成A->C與B->C會得到AB->C。候選項目計算的方法與SPAM演算法相似(位元運算)。從我們的模擬結果中顯示,基植於將交易資料庫用相同的位元表示法情況下,由於產生的候選項目在我們所提出的演算法比SPAM演算法來的少,因此執行時所花的時間可以少於SPAM演算法。
Abstract
The task of Data Mining is to find the useful information within the incredible sets of data. One of important research areas of Data Mining is Mining Sequential Patterns. For a transaction database, sequential pattern means that there are some relations between the items bought by customers in a period of time. If we can find these relations by mining sequential patterns, we can provide better selling strategy to gain more customers' attentions. However, since the transaction database contains a lot of data, and it will be scanned during the mining process again and again, to improve the running efficiency is an important topic. In the GSP algorithm proposed by Srikant and Agrawal, they use a complex data structure to store and generate candidates. The generated candidates satisfy a property, ``the subsets of a frequent itemset are also frequent'. The property leads to fewer number of candidates; however, it still spends too much time to counting candidates. In the SPAM algorithm proposed by Aryes et al., they use the bitwise operations to reduce the time for counting candidates. However, it generates too many candidates which will never become frequent itemsets, which decreases the efficiency. In this thesis, we proposed a new bitmap-based algorithm. By modifying the way to generate candidates in the GSP algorithm and applying the bitwise operations in the SPAM algorithm, the proposed algorithm can mine sequential patterns efficiently. That is, we use the similar candidate generation method presented in the GSP algorithm to reduce the number of candidates and the similar counting method proposed in the SPAM algorithm to reduce the time of counting candidates. In the proposed algorithm, we classify the itemsets into two cases, simultaneous occurrence (noted as AB) and sequential occurrence (noted as A-> B). In the case of simultaneous occurrence, the number of candidate is C(n,k) based on the exhausted method. In order to prevent too many candidates generated, we make use of the property, ``the subsets of a frequent itemset are also frequent', to reduce the number of candidates from C(n,k) to C(y,k), k <= y < n. In the case of sequential occurrence, the candidates are generated by using a special join operation which could combine, for example, A->B and B->C to A->B->C. Moreover, we have to consider two other cases: (1) combing A->B and A->C to A->BC; (2) combing A->C and B->C to AB->C. The method of counting candidates is similar to the SPAM algorithm (i.e., bitwise operations). From our simulation results, based on the same bit representation for the transaction database, we show that our proposed algorithm could provide better performance than the SPAM algorithm in terms of the processing time, since our algorithm could generate fewer number of candidates than the SPAM algorithm.
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Mining Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Formal Problem De nitions . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 16
2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2 The Boolean Algorithm . . . . . . . . . . . . . . . . . . . . . 18
2.2 Mining Sequential Patterns . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 The AprioriAll Algorithm . . . . . . . . . . . . . . . . . . . . 22
2.2.2 The GSP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 The SPADE Algorithm . . . . . . . . . . . . . . . . . . . . . . 30
2.2.4 The Pre xSpan Algorithm . . . . . . . . . . . . . . . . . . . . 32
2.2.5 The DELISP Algorithm . . . . . . . . . . . . . . . . . . . . . 34
2.2.6 The SPAM Algorithm . . . . . . . . . . . . . . . . . . . . . . 36
ii
Page
3. The Bitmap-Based Approach . . . . . . . . . . . . . . . . . . . . . . 40
3.1 The Bitmap Representation . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 De nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1 Generation of Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 Time Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.2 Sensitivity to Parameters . . . . . . . . . . . . . . . . . . . . . 78
5. Conlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A. The Flowchart of the Generation of Synthetic Data . . . . . . . . 87
B. An Example of the Generation of Synthetic Data . . . . . . . . . . 88
參考文獻 References
[1] R. L. Acko , From Data to Wisdom," Journal of Applied Systems Analysis, Vol. 16, pp. 3-9, 1989.
[2] R. Agrawal and J. C. Shafer, Parallel Mining of Association Rules," IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 6, pp. 962-969, Dec. 1996.
[3] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules," Proc. of the 20th Int. Conf. Very Large Data Bases, pp. 487-499, 1994.
[4] R. Agrawal and R. Srikant, Mining Sequential Patterns," Proc. of the 11th Int. Conf. on Data Eng., pp. 3-14, 1995.
[5] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, Sequential Pattern Mining Using a Bitmap Representation," Proc. of the 8th Int. Conf. on Knowledge Discovery and Data Mining, pp. 429-435, 2002.
[6] G. Bellinger, D. Castro, and A. Mills, Data, Information, Knowledge, and Wisdom." http://www.systems-thinking.org/dikw/dikw.htm.
[7] Y. L. Chen, S. S. Chen, and P. Y. Hsu, Mining Hybrid Sequential Patterns and Sequential Rules," Information Systems, Vol. 27, No. 5, pp. 345-362, 2002.
[8] G. Gardarin, P. Pucheral, and F. Wu, Bitmap Based Algorithms for Mining Association Rules," Proc. of the 14th Bases de Donnes Avancees, pp. 157-176, 1998.
[9] M. Garofalakis, R. Rastogi, and K. Shim, Mining Sequential Patterns with Regular Expression Constrains," IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 3, pp. 530-552, May/June 2002.
[10] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. C. Hsu, FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining," Proc. of the 6th Int. Conf. on Knowledge Discovery and Data Mining, pp. 335-359, 2000.
[11] C. Kamath, The Role of Parallel and Distributed Processing in Data Mining," Tech. Rep. UCRL-JC-142468, Newsletter of the IEEE Technical Committee on Distributed Processing, 2001.
[12] W. P. Lee, Introduction to Data Mining." http://www.datamining.org.tw/dl/checkload1.asp?data no 68. [13] M. Y. Lin and S. Y. Lee, Incremental Update on Sequential Patterns in Large Databases," Proc. of the 10th Int. Conf. on Tools with Arti cial Intelligence, pp. 24-31, 1998.
[14] M. Y. Lin, S. Y. Lee, and S. S. Wang, DELISP: Efficient Discovery of Generalized Sequential Patterns by Delimited Pattern-Growth Technology," Proc. of the 6th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 198-209, 2002.
[15] F. Masseglia, P. Poncelet, and M. Teisseire, Incremental Mining of Sequential Patterns in Large Databases," Data and Knowledge Eng., Vol. 46, No. 1, pp. 97-121, July 2003.
[16] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu, PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth," Proc. of the 17th Int. Conf. on Data Eng., pp. 215-226, 2001.
[17] R. Srikant and R. Agrawal, Mining Sequential Patterns: Generalizations and Performance Improvements," Proc. of the 5th Int. Conf. on Extending Database Technology, pp. 3-17, 1996.
[18] C. Y. Wang, T. P. Hong, and S. S. Tseng, Maintenance of Discovered Sequential Patterns for Record Modification," Proc. of the Int. Computer Symp. Workshop on Artificial Intelligence, pp. 1682-1687, 2002.
[19] W. Wang, J. Yang, and P. S. Yu, Mining Patterns in Long Sequential Data with Noise," ACM SIGKDD Explorations Newsletter, Vol. 2, No. 2, pp. 28-33, Dec. 2000.
[20] S. Y. Wur and Y. Leu, An Effective Boolean Algorithm for Mining Association Rules in Large Databases," Proc. of the 6th Int. Conf. on Database Systems for Advanced Applications, pp. 179-186, 1999.
[21] J. Yang, W. Wang, P. S. Yu, and J. Han, Mining Long Sequential Patterns in a Noisy Environment," Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 406-417, 2002.
[22] M. J. Zaki, E cient Enumeration of Frequent Sequences," Proc. of the 7th Int. Conf. on Information and Knowledge Management, pp. 68-75, 1998.
[23] M. J. Zaki, Parallel Sequence Mining on Shared-Memory Machines," Journal of Parallel and Distributed Computing, Vol. 61, No. 3, pp. 401-426, March 2001.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 44.222.169.36
論文開放下載的時間是 校外不公開

Your IP address is 44.222.169.36
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code