國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,在資料探勘中頻繁項目集的精簡表示法,A Study on Compact Representation of Frequent Itemsets in Data Mining

論文名稱 Title	在資料探勘中頻繁項目集的精簡表示法 A Study on Compact Representation of Frequent Itemsets in Data Mining
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	105 學年度第 1 學期 The fall semester of Academic Year 105	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	103
研究生 Author	黃正男 Jheng-Nan Huang
指導教授 Advisor	江明朝, 洪宗貝 Ming-chao Chiang; Tzung-Pei Hong
召集委員 Convenor	林文揚 Wen-yang Lin
口試委員 Advisory Committee	林威成, 李健興, 高宏宇, 李宗南 Kawuu W. Lin; Chang-Shing Lee; Hung-Yu Kao; Chungnan Lee
口試日期 Date of Exam	2016-09-07	繳交日期 Date of Submission	2016-11-14
關鍵字 Keywords	封閉項目集、最大項目集、頻繁項目集、資料挖掘、近似支持度 closed itemsets, maximal itemsets, frequent itemset, data mining, approximate support
統計 Statistics	本論文已被瀏覽 5667 次，被下載 91 次 The thesis/dissertation has been browsed 5667 times, has been downloaded 91 times.

中文摘要
在資料探勘中找出頻繁項目集對於產生關聯規則而言，是一個非常關鍵的步驟。若最小支持度被設定在一個非常小的數值，或者被處理的交易資料庫的交易項目非常多的時候，頻繁項目集的數目常可能會相當巨大，因此在過去有一些精簡表示的方法被提出，以有效紀錄頻繁項目集。例如，最大項目集的方法用保存頻繁與非頻繁項目集間的分隔線方式，以減少保存頻繁項目集所需的資訊。此種方法可回復所有頻繁項目集，但卻無法得知這些頻繁項目集的支持度。相較於最大項目集，封閉項目集的做法則可以同時回復頻繁項目集及其支持度，但所需之保存數量較多。因此在本論文中，我們提出一種介於最大項目集和封閉項目集之間的近似表示方式，其能正確回復完整的頻繁項目集，並能計算出這些頻繁項目集的近似支持度。我們也提出並推導了這個近似表示法的一些特性。此外我們利用兩種不同的資料結構來組織頻繁項目集並分別提出了兩個不同的方法來求其近似表示。第一個方法採用前綴樹的結構，而第二個方法則使用了流量網路的方式。最後我們也針對不同的資料集及變換多個參數值來驗證這兩個方法的效能，而實驗結果也顯示我們所提的方法和封閉項目集相比，確實可以達到較佳的壓縮比率。
Abstract
In data mining, finding frequent itemsets is a critical step for deriving association rules. The number of frequent itemsets may be huge if the minimum support is set at a low value or if the number of items in a transaction database is large. In the past, some approaches were proposed to record frequent itemsets with compact representation. For example, the approach of maximal itemsets keeps a borderline composed of the maximal itemsets, which separate frequent itemsets from non-frequent ones. It can recover all the frequent itemsets, but cannot get their real supports back. On the contrary, the approach of closed itemsets can correctly recover each frequent itemset and its support. In the thesis, we propose approximate representation of derived frequent itemsets, which lies between maximal itemsets and closed itemsets. It can correctly get back all the derived itemsets and can calculate their approximate supports. Some properties with the representation are also derived. Two methods are then proposed to get the approximate representation by organizing frequent itemsets with two special data structures. The first one adopts a prefix tree, and the other utilizes a flow network. Finally, some experiments on different datasets and with a variety of parameter values are conducted to show the performance of the two methods. The results reflect the proposed methods have better compact rates than the closed itemsets.

目次 Table of Contents
誌謝 + i 中文摘要 + iii ABSTRACT + iv Chapter 1 INTRODUCTION + 1 1.1. Background and Motivation + 1 1.2. Thesis Organization + 4 Chapter 2 REVIEW OF RELATED WORKS + 5 2.1. Maximal Itemsets + 5 2.2. Closed Itemsets + 6 2.3. Other Concepts + 8 Chapter 3 THE METHOD BASED ON PREFIX TREE + 11 3.1. Term Definitions + 11 3.2. Problem Statement + 18 3.3. Proposed Concepts and Methods + 22 3.4. Experiments + 38 3.4.1. Experimental Datasets and Criteria + 38 3.4.2. Experimental Results + 39 Chapter 4 THE METHOD BASED ON FLOW NETWORK + 50 4.1. Term Definitions + 50 4.1.1. Flow Network + 51 4.1.2. Lattice of Frequent Itemsets + 52 4.1.3. Simple Flow Network of Frequent Itemsets + 54 4.1.4. Complex Flow Network of Frequent Itemsets + 57 4.2. Preliminary Observations and Problem Statement + 59 4.3. Proposed Concepts and Methods + 65 4.3.1. Setting Capacities of Correcting Edge Pairs + 65 4.3.2. Dividing the itemsets at a Level as Intervals + 69 4.4. Experiments + 77 4.4.1. Experimental Datasets and Criteria + 77 4.4.2. Experimental Results + 79 Chapter 5 DISCUSSION AND CONCLUSION + 85 5.1. Comparison of the Two Proposed Methods + 85 5.2. Comparison of the Two Methods with Closed Sets and Free Sets + 86 5.3. CONCLUSION AND FUTURE WORK + 87 REFERENCES + 88

參考文獻 References
[1]. Rakesh Agrawal, Tomasz Imieliński and Arun Swami. Mining Association Rules between Sets of Items in Large Databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207-216. 1993. [2]. Roberto J. Bayardo Jr. Efficiently Mining Long Patterns from Databases. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 85-93. 1998. [3]. Richard Bellman. On a Routing Problem. Quarterly of Applied Mathematics, Vol. 16, No. 1, pp. 87-90. 1958. [4]. Jean-François Boulicaut, Artur Bykowski and Christophe Rigotti. Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries. International Journal of Data Mining and Knowledge Discovery, Vol. 7, Issue 1, pp. 5-22. 2003. [5]. Toon Calders and Bart Goethals. Non-Derivable Itemset Mining. International Journal of Data Mining and Knowledge Discovery, Vol. 14, Issue 1, pp. 171-206. 2007. [6]. Varun Chandola and Vipin Kumar. Summarization – Compressing Data into an Informative Representation. International Journal of Knowledge and Information Systems, Vol. 12, Issue 3, pp. 355-378. 2007. [7]. Manike Chiranjeevi and Om Hari. Modified GUIDE (LM) Algorithm for Mining Maximal High Utility Patterns from Data Streams. International Journal of Computational Intelligence Systems, Vol. 8, Issue 3, pp. 517-529. 2015. [8]. Arianna Gallo, Tijl De Bie and Nello Cristianini. MINI: Mining Informative Non-Redundant Itemsets. Proceedings of the 11th Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 438-435. 2007. [9]. Jiawei Han, Jian Pei and Yiwen Yin. Mining Frequent Patterns without Candidate Generation. Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1-12. 2000. [10]. Jochen Hipp, Ulrich Güntzer and Gholamreza Nakhaeizadeh. Algorithms for Association Rule Mining - A General Survey and Comparison. ACM SIGKDD Explorations Newsletter, Vol. 2, Issue 1, pp. 58-64. 2000. [11]. Matthijs van Leeuwen and Antti Ukkonen. Discovering Skylines of Subgroup Sets. International Journal of Machine Learning and Knowledge Discovery in Databases, Vol. 8190, pp. 272-287. 2013. [12]. Ming-Yen Lin, Tzer-Fu Tu and Sue-Chen Hsueh. High Utility Pattern Mining Using the Maximal Itemset Property and Lexicographic Tree Structures. International Journal of Information Sciences, Vol. 215, pp. 1-14. 2012. [13]. Junqiang Liu, Yunhe Pan, Ke Wang and Jiawei Han. Mining Frequent Item Sets by Opportunistic Projection. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 229-238. 2002. [14]. Michael Mampaey, Nikolaj Tatti and Jilles Vreeken. Tell Me What I Need to Know: Succinctly Summarizing Data with Itemsets. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 573-581. 2011. [15]. Kleanthis-Nikolaos Kontonasios and Tijl DeBie. Formalizing Complex Prior Information to Quantify Subjective Interestingness of Frequent Pattern Sets. Proceedings of the 11th International Conference on Advances in Intelligent Data Analysis, pp. 161-171. 2012. [16]. Fatemeh Nori, Mahmood Deypir and Mohamad Hadi Sadreddini. A Sliding Window Based Algorithm for Frequent Closed Itemset Mining over Data Streams. The Journal of Systems and Software, Vol. 86, Issue 3, pp. 615-623. 2013. [17]. Matthijs van Leeuwen and Arno Knobbe. Diverse Subgroup Set Discovery. International Journal of Data Mining and Knowledge Discovery, Vol. 25, Issue 2, pp. 208-242. 2012. [18]. Jefrey Lijffijt, Panagiotis Papapetrou and Kai Puolamäki. A Statistical Significance Testing Approach to Mining the Most Informative Set of Patterns. International Journal of Data Mining and Knowledge Discovery, Vol. 28, Issue 1, pp. 238-263. 2012. [19]. Y. Okada, T. Tada, K. Fukuta and T. Nagashima. Audio Classification based on a Closed Itemset Mining Algorithm. International Conference on Computer Information Systems and Industrial Management Applications. 2010. [20]. Nicolas Pasquier, Yves Bastide, Rafik Taouil and Lotfi Lakhal. Discovering Frequent Closed Itemsets for Association Rules. Proceedings of the 7th International Conference on Database Theory, pp. 398-416. 1999. [21]. Jian Pei, Guozhu Dong, Wei Zou and Jiawei Han. Mining Condensed Frequent-Pattern Bases. International Journal of Knowledge and Information Systems, Vol. 6, Issue 5, pp. 570-594. 2004. [22]. Matteo Riondato and Eli Upfal. Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ACM Transactions on Knowledge Discovery from Data, Vol. 8, Issue 4, Article No. 20. 2014. [23]. Prabha S, Shanmugapriya S and Duraiswamy K. A Survey on Closed Frequent Pattern Mining. International Journal of Computer Applications, Vol. 63, Number 14, pp. 47-52. 2013. [24]. Nikolaj Tatti and Michael Mampaey. Using Background Knowledge to Rank Itemsets. International Journal of Data Mining and Knowledge Discovery, Vol. 21, Issue 2, pp. 293-309. 2010. [25]. Yun Unil, Lee Gangin and Ryu Keun Ho. Mining Maximal Frequent Patterns by Considering Weight Conditions over Data Streams. International Journal of Knowledge-Based Systems, Vol. 55, pp. 49-65. 2014. [26]. Jianyong Wang and George Karypis. On Efficiently Summarizing Categorical Databases. International Journal of Knowledge and Information Systems, Vol. 9, Issue 1, pp. 19-37. 2006. [27]. Geoffrey I. Webb. Self-Sufficient Itemsets: An Approach to Screening Potentially Interesting Associations between Items. ACM Transactions on Knowledge Discovery from Data, Vol. 4, Issue 1, Article No. 3. 2010. [28]. Geoffrey I. Webb and Jilles Vreeken. Efficient Discovery of the Most Interesting Associations. ACM Transactions on Knowledge Discovery from Data, Vol. 8, Issue 3, Article No. 15. 2014. [29]. Yang Xiang, Ruoming Jin, David Fuhry and Feodor F. Dragan. Summarizing Transactional Databases with Overlapped Hyperrectangles. International Journal of Data Mining and Knowledge Discovery, Vol. 23, Issue 2, pp. 215-251. 2011. [30]. Dong Xin, Jiawei Han, Xifeng Yan and Hong Cheng. Mining Compressed Frequent-Pattern Sets. Proceedings of the 31st International Conference on Very Large Data Bases, pp. 709-720. 2005.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-1014116-114733.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS