國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個以Hilbert curve為基礎的有效支援大型空間資料分群方法 ,An Efficient Hilbert Curve-based Clustering Strategy for Large Spatial Databases

論文名稱 Title	一個以Hilbert curve為基礎的有效支援大型空間資料分群方法 An Efficient Hilbert Curve-based Clustering Strategy for Large Spatial Databases
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	91 學年度第 2 學期 The spring semester of Academic Year 91	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	82
研究生 Author	陸雲泰 Yun-Tai Lu
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	郭大維 Tei-Wei Kuo
口試委員 Advisory Committee	黃三益, 李建億 San-Yi Huang; Chien-I Lee
口試日期 Date of Exam	2003-06-20	繳交日期 Date of Submission	2003-07-25
關鍵字 Keywords	資料分群、空間曲線、階層式分群方法、空間資料探勘、格子基礎分群方法 hierarchical clustering, grid-based clustering, clustering, space-filling curve, spatial data mining
統計 Statistics	本論文已被瀏覽 5660 次，被下載 3755 次 The thesis/dissertation has been browsed 5660 times, has been downloaded 3755 times.

中文摘要
近來，數以百萬計的資料庫正在被使用，因此我們需要一種新的技術，它能自動的將原始資料轉換成為有用的資訊及知識。資料挖掘就是一種可以從資料庫中分析原始資料，將隱含的、先前並不知道的的資訊從資料庫中萃取出來的技術。而空間資料挖掘是資料挖掘中的一個分支，它主要用在處理空間資料。在空間資料挖掘中，資料分群 (data clustering) 乃是從原始資料中發掘出我們有興趣的資料，是一種非常有用的技術。資料分群是將一群存在 d 維空間的 n 個資料，根據分群的原則，將資料分成 k 個群組。這個原則乃是：讓群組裡的資料相似度達最大，群組間的相似度達最小。群組分析已經被應用在非常多的領域，例如醫藥、社會研究、生物資訊、地圖分析和網際地理資訊系統 (GIS)。在近年來，許多的研究人員專注於找到一個有效的方法，可以快速的解決資料分群的問題。一般來說，我們可以將資料分群的演算法分成四大類：分割式、階層式、密度基礎和格子基礎的分群方法。k-means演算法是屬於分割式分群方法，它是一個非常普遍且廣泛被使用的資料分群方法。但是k-means演算法主要的缺點，乃在於它的分群品質受到開始選擇的參數k很大的影響，並且它只適用於發掘圓形的群組。此外，k-means演算法要花很多的時間來做運算，而無法處理大量的資料。因此，在此篇論文研究中，我們針對處理大量的空間資料庫提出一個新的演算法，它結合了階層式分群的方法和格子基礎的分群方法的資料結構。我們使用格子基礎的分群方法，是因為它可以有效地處理大量的資料。此外，我們使用階層式分群方法，來結合這些格子，來幫助我們找出真實的群組出來。基本上，我們利用了the Hilbert curve以提供一種規則順序來儲存我們的資料。The Hilbert curve是space-filling curve的其中一種，space-filling curve是一種連續的路徑。這些路徑經過在空間中的每一筆資料，並且以一對一的對應關係將高維度的資料轉換成一維的號碼順序。而使用space-filling curve的主要目的是保留空間中的距離關係，若資料在二維空間中相近，則儘量將這些資料存到相近的一維空間磁碟中。這種對應關係可以幫助我們減少磁碟的存取次數和加快資料分群的速度。我們所提出的新演算法只需輸入一個參數，並且能提供使用者一個適當的值。在我們的模擬研究中，我們可以發現我們所提出的演算法在處理大量的資料時，執行時間比其它演算法來的少。而當我們所處理的資料量增加時，我們演算法的執行時間增加的非常緩慢。最後，我們的演算法可以處理一些任意形狀的群組，而這些群組是k-means演算法所不能發掘出來的。
Abstract
Recently, millions of databases have been used and we need a new technique that can automatically transform the processed data into useful information and knowledge. Data mining is the technique of analyzing data to discover previously unknown information and spatial data mining is the branch of data mining that deals with spatial data. In spatial data mining, clustering is one of useful techniques for discovering interesting data in the underlying data objects. The problem of clustering is that give n data points in a d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in different clusters. Cluster analysis has been widely applied to many areas such as medicine, social studies, bioinformatics, map regions and GIS, etc. In recent years, many researchers have focused on finding efficient methods to the clustering problem. In general, we can classify these clustering algorithms into four approaches: partition, hierarchical, density-based, and grid-based approaches. The k-means algorithm which is based on the partitioning approach is probably the most widely applied clustering method. But a major drawback of k-means algorithm is that it is difficult to determine the parameter k to represent ``natural' cluster, and it is only suitable for concave spherical clusters. The k-means algorithm has high computational complexity and is unable to handle large databases. Therefore, in this thesis, we present an efficient clustering algorithm for large spatial databases. It combines the hierarchical approach with the grid-based approach structure. We apply the grid-based approach, because it is efficient for large spatial databases. Moreover, we apply the hierarchical approach to find the genuine clusters by repeatedly combining together these blocks. Basically, we make use of the Hilbert curve to provide a way to linearly order the points of a grid. Note that the Hilbert curve is a kind of space-filling curves, where a space-filling curve is a continuous path which passes through every point in a space once to form a one-one correspondence between the coordinates of the points and the one-dimensional sequence numbers of the points on the curve. The goal of using space-filling curve is to preserve the distance that points which are close in 2-D space and represent similar data should be stored close together in the linear order. This kind of mapping also can minimize the disk access effort and provide high speed for clustering. This new algorithm requires only one input parameter and supports the user in determining an appropriate value for it. In our simulation, we have shown that our proposed clustering algorithm can have shorter execution time than other algorithms for the large databases. Since the number of data points is increased, the execution time of our algorithm is increased slowly. Moreover, our algorithm can deal with clusters with arbitrary shapes in which the k-means algorithm can not discover.

目次 Table of Contents
ABSTRACT LIST OF FIGURES LIST OF TABLES 1. Introduction 1.1 Spatial Data Mining 1.2 Clustering 1.3 Space Filling Curves 1.4 Motivations 1.5 Organization of Thesis 2. Survey 2.1 K-Means 2.2 CLARANS 2.3 HAC 2.4 CURE 2.5 DBSCAN 2.6 STING 3. The Clustering Algorithm Based on the Hilbert Curve 3.1 The Basic Idea 3.2 The Algorithm 3.2.1 The First Round 3.2.3 The Second Round 3.2.3 The Third Round 3.2.4 The Fourth Round 4. Performance 4.1 Performance Measures 4.2 The Simulation Model 4.3 Simulation Results 4.3.1 Time Scalability 4.3.2 Sensitivity to Parameters 4.3.3 Special Data Sets 5. Conclusion 5.1 Summary 5.2 Future Work

參考文獻 References
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, "Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications," ACM SIGMOD Conf., pp. 94-105, 1998. M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, "OPTICS: Ordering Points To Identify the Clustering Structure," ACM SIGMOD Conf., pp. 49-60, 1999. A. Ben-Dor, R. Shamir, and Z. Yakhini, "Clustering Gene Expression Patterns," Proc. of the 3rd Annual Int. Conf. on Computational Molecular Biology, pp. 33-42, 1999. P. Berkhin, "Survey of Clustering Data Mining Techniques," Accrue Software, Inc., 2002. M. S. Chen, J. Han, and P. S. Yu, "Data Mining: An Overview from Database Perspective," IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 6, pp. 866-883, Dec. 1996. M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proc. of the 2nd Int. Conf. on KDD, pp. 226-231, 1996. Christos Faloutsos, "Gray Codes for Partial Match and Range Queries," IEEE Trans. on Software Eng., Vol. 14, No. 10, pp. 1381-1393, Aug. 1988. Christos Faloutsos and Shari Roseman, "Fractals for Secondary Key Retrieval," ACM SIGACT-SIGMOD-SIGART Symposium on PODS, pp. 247-252, 1989. S. Guha, R. Rastogi, and K. Shim, "ROCK: A Robust Clustering Algorithm for Categorical," Int. Conf. on Data Eng., pp. 512-521, 1999. S. Guha, R. Rastogi, and K. Shim, "CURE: An Efficient Clustering Algorithm for Large Databases," Information Systems, Vol. 26, No. 1, pp. 35-58, March 2001. H. V. Jagadish, "Linear Clustering of Objects with Multiple Attributes," ACM SIGMOD Conf., pp. 332-342, 1990. A. K. Jain and R. C. Dubes, "Algorithms for Clustering Data," Prentice Hall, Englewood Cliffs, New Jersey, 1988. A. K. Jain, M. N. Murty, and P. J. Flynn, "Data Clustering: A Review," ACM Computing Survey, Vol. 31, No. 3, pp. 264-323, Sept. 1999. G. Karypis, E. H. Han, and V. Kumar, "CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling," IEEE Trans. on Computer, Vol. 32, No. 8, pp. 68-75, Aug. 1999. L. Kaufman and P. J. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster Analysis," John Wiley & Sons, Inc., New York, 1990. E. Kolatch, "Clustering Algorithms for Spatial Databases: A Survey," Dept. of Computer Science, University of Maryland, College Park, 2001. B. Lent, A. Swami, and J. Widom, "Clustering Association Rules," Proc. of the 13th Int. Conf. on Data Eng., pp. 220-231, 1997. BongKi Moon, H. V. Jagadish, Christos Faloutsos, and Joel H. Saltz, "Analysis of the Clustering Properties of the Hilbert Space-Filling Curve," IEEE Trans. on Knowledge and Data Eng., Vol. 13, No. 1, pp. 124-141, Jan. 2001. R. T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining," Proc. of the 20th VLDB Conf., pp. 144-155, 1994. R. T. Ng and J. Han, "CLARANS: A Method for Clustering Objects for Spatial Data Mining," IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 5, pp. 1003-1016, Sept./Oct. 2002. C. F. Olson, "Parallel Algorithms for Hierarchical Clustering," Parallel Computing, Vol. 21, No. 8, pp. 1313-1325, Aug. 1995. Jack A. Orenstein and T. H. Merrett, "A Class of Data Structures for Associative Searching," Proc. Symp. on PODS, pp. 181-190, 1984. Jack A. Orenstein, "Spatial Query Processing in an Object-Oriented Database System," Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 326-336, 1986. G. Sheikholeslami, S. Chatterjee, and A. Zhang, "WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases," Proc. of the 24th VLDB Conf., pp. 428-439, 1998. L. Y. Tseng and S. B. Yang, "A Genetic Approach to the Automatic Clustering Problem," Pattern Recognition, Vol. 34, No. 2, pp. 415-424, Feb. 2001. E. M. Voorhees, "Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval," Information Proc.& Management, pp. 465-476, 1986. W. Wang, J. Yang, and R. Muntz, "STING: A Statistical Information Grid Approach to Spatial Data," Proc. of the 23th VLDB Conf., pp. 186-195, 1997. C. P. Wei, Y. H. Lee, and C. M. Hsu, "Empirical Comparison of Fast Clustering Algorithms for Large Data Sets," Proc. of the 33rd Hawaii Int.~Conf.~on System Sciences, Maui, Hawaii, Jan. 2000. O. R. Zaiane, A. Foss, C. H. Lee, and W. wang, "On Data Clustering Analysis: Scalability, Constraints, and Validation," Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 28-39, 2002. T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: A Efficient Data Clustering Method for Very Large Databases," ACM SIGMOD Int. Conf. on Management of Data, pp. 103-114, 1996.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內外都一年後公開 withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0725103-112256.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS