Responsive image
博碩士論文 etd-0625109-171938 詳細資訊
Title page for etd-0625109-171938
論文名稱
Title
一個於資料串流中提供高精確度的GDense分群方法
The GDense Algorithm for Clustering Data Streams with High Quality
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
77
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2009-06-12
繳交日期
Date of Submission
2009-06-25
關鍵字
Keywords
密度基礎、分群、資料串流、格子基礎
density-based, grid-based, clustering, data streams
統計
Statistics
本論文已被瀏覽 5685 次,被下載 1295
The thesis/dissertation has been browsed 5685 times, has been downloaded 1295 times.
中文摘要
中文摘要

近幾年來,挖掘資料串流已漸漸廣泛地被研究。資料串流是一群動態、連續、沒有範圍、即時的資料項目。資料串流的產生非常快速,而且挖掘過程中只能夠讀取資料本身一遍。資料分群的問題是給定一群存在d維空間的n個資料,根據分群演算法將資料分成k個群組,而達到讓群組裡的資料相似度達最大,群組間的相似度達到最小。傳統分群技術到了資料串流的環境裡,會遇到儲存空間不足、較低的精確度,及因更新造成效率減低的問題。一般來說,我們可以將資料分群的演算法分成四大類:分割式、階層式、密度基礎和格子基礎的分群方法。 以格子基礎的方法之優點為可以處理較大的資料庫。而以密度為基礎的分群方法,方群過程中的新增與刪除資料只會影響到鄰近的資料,而不會影響到全部。結合了格子基礎及密度基礎分群方法的優點,CDS-Tree演算法已被提出。雖然CDS-Tree可以處理大量的資料,但是分群結果會受到格子分割大小及密集格子的門檻值會影響到分群的精確度。因此,在此篇論文中,我們提出一個高精確度的資料串流分群演算法,GDense。GDense演算法因為有兩種分割的機制:格子和四分格及兩種密度門檻值:δ和(1/4)δ,因此能提高精確度。在我們的演算法中,資料新增的部分,關於格子和四分格有三個因素決定了七個情況,而在資料刪除的部分有五個因素決定了十個情況。在我們的模擬研究中,不管在什麼情況(包含資料點個數、格子數目、滑動視窗大小、密集格子的門檻值)下,我們的分群純度都比CDS-Tree演算法高。接著我們加入雜質來做比較,不管雜質總數的多少,我們GDense演算法的精確度仍然比CDS-Tree演算法高且分群純度的差異可以改善20%。
Abstract
In recent years, mining data streams has been widely studied. A data streams is a
sequence of dynamic, continuous, unbounded and real time data items with a very
high data rate that can only be read once. In data mining, clustering is one of use-
ful techniques for discovering interesting data in the underlying data objects. The
problem of clustering can be defined formally as follows: given n data points in the d-
dimensional metric space, partition the data points into k clusters such that the data
points within a cluster are more similar to each other than data points in different
clusters. In the data streams environment, the difficulties of data streams clustering
contain storage overhead, low clustering quality and a low updating efficiency. Cur-
rent clustering algorithms can be broadly classified into four categories: partition,
hierarchical, density-based and grid-based approaches. The advantage of the grid-
based algorithm is that it can handle large databases. Based on the density-based
approach, the insertion or deletion of data affects the current clustering only in the
neighborhood of this data. Combining the advantages of the grid-based approach
and density-based approach, the CDS-Tree algorithm was proposed. Although it can
handle large databases, its clustering quality is restricted to the grid partition and the
threshold of a dense cell. Therefore, in this thesis, we present a new clustering algo-
rithm with high quality, GDense, for data streams. The GDense algorithm has high
quality due to two kinds of partition: cells and quadcells, and two kinds of threshold:
δ and (1/4) . Moreover, in our GDense algorithm, in the data insertion part, the
7 cases takes 3 factors about the cell and the quadcell into consideration. In the
deletion part, the 10 cases take 5 factors about the cell into consideration. From our
simulation results, no matter what condition (including the number of data points,
the number of cells, the size of the sliding window, and the threshold of dense cell)
is, the clustering purity of our GDense algorithm is always higher than that of the
CDS-Tree algorithm. Moreover, we make a comparison of the purity between the our
GDense algorithm and the CDS-Tree algorithm with outliers. No matter whether the
number of outliers is large or small, the clustering purity of our GDense algorithm is
still higher than that of the CDS-Tree and we can improve about 20% the clustering
purity as compared to the CDS-Tree algorithm.
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1. A Survey of Algorithms for Data Clustering . . . . . . . . . . . . . 1
1.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 CLIQUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 CURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 The CDS-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 Related Algorithms of the CDS-Tree . . . . . . . . . . . . . . 11
1.5.2 Granularity Adjustment . . . . . . . . . . . . . . . . . . . . . 11
1.6 The DUCstream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2. The GDense Algorithm for Clustering Data Streams with High
Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 The Basic Idea and Definitions . . . . . . . . . . . . . . . . . . . . . 14
2.3 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Step 1: Data Insertion . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Step 2: Data Deletion . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 The Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Special Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 51
4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
參考文獻 References
[1] R. Agrawal, J. Gehrke, D. Gunoppulos, and P. Raghavan, “Automatic Subspace
Clustering of High Dimensional Data for Data Mining Applications,” ACM SIG-
MOD Conf., pp. 94–105, 1998.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues
in Data Stream Systems,” Proc. of the 21th ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems, pp. 1–16, 2002.
[3] B. Babcook, M. Datar, R. Motwani, and L. O’Callaghan, “Maintaining Variance
and k-medians over Data Stream Windows,” Proc. of the 22th ACM SIGMOD-
SIGACT-SIGART Symposium on Principles of Database Systems, pp. 234–243,
2003.
[4] V. Chaoji, M. Hasan, S. Salem, Mohammed, and J. Zaki, “SPARCL: Efficient
and Effective Shape-based Clustering,” Proc. of the IEEE Int. Conf. on Data
Mining, 2008.
[5] M. Ester, J. Han, and P. S. Yu, “A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise,” Proc. of the 2nd Int. Conf. on
on Knowledge Discovery and Data Mining, pp. 226–231, 1996.
[6] M. M. Gaber, S. Krishnaswamy, and A. Zaslavsky, “Cost-Efficient Mining Tech-
niques for Data Streams,” Proc. of the Australasian Workshop on Data Mining
and Web Intelligence, pp. 109–114, 2004.
[7] J. Gao, J. Li, Z. Zhang, and P.-N. Tan, “An Incremental Data Stream Clustering
Algorithm Based on Dense Units Detection,” Proc. of the 13th Pacific-Asia Conf.
on Knowledge Discovery and Data Mining, pp. 420–425, 2005.
[8] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm
for Large Databasses,” ACM SIGMOD Record, Vol. 27, No. 2, pp. 73–84, June
1998.
[9] Q. He, K. Chang, E. Lim, and J. Zhang, “Bursty Feature Representation for
Clustering Text Streams,” Proc. of the SIAM Conf. on Data Mining, 2007.
[10] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction
to Cluster Anslysis,” Wiley Series in Probability and Mathematical Statistics
Applied Probability and Statistics, New York: Wiley, 1990.
[11] H. Sun, G. Yu, Y. Bao, F. Zhao, and D. Wang, “CDS-Tree: An Effective Index
for Clustering Arbitrary Shapes,” Proc. of the 15th Int. Workshop on Research
Issues in Data Engineering: Stream Data Mining and Applications, pp. 81–88,
2005.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內外都一年後公開 withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code