國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個以參數關係為基礎之有效率投影叢集方法,An Efficient Parameter-Relationship-Based Approach for Projected Clustering

論文名稱 Title	一個以參數關係為基礎之有效率投影叢集方法 An Efficient Parameter-Relationship-Based Approach for Projected Clustering
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	96 學年度第 2 學期 The spring semester of Academic Year 96	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	81
研究生 Author	黃尊葵 Tsun-Kuei Huang
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	陳健輝 Gen-huey Chen
口試委員 Advisory Committee	黃三益, 李建億, 林宣華 San-Yi Huang; Chien-I Lee; Shian-Hua Lin
口試日期 Date of Exam	2008-05-30	繳交日期 Date of Submission	2008-06-16
關鍵字 Keywords	大型項目集、關聯式法則、生物資訊學、分群、投影叢集 Projected Clustering, Large Itemset, Clustering, Association rule, Bioinformatics
統計 Statistics	本論文已被瀏覽 5635 次，被下載 0 次 The thesis/dissertation has been browsed 5635 times, has been downloaded 0 times.

中文摘要
分群問題在資料庫領域中已經被廣泛地應用，例如生物資訊學。傳統的分群演算法會考慮輸入資料集的所有維度。然而，在高維資料中，有許多維度通常是不相關的。因此，投影分群就被提出來。一個投影叢集包含了資料點的子集C以及使得C中的點之間的距離夠近的維度子集D。目前已經有很多找投影叢集的演算法被提出來了。這之中，大部分可以分為三類：分割法、以密度為基礎法以及階層法。DOC演算法是一個眾所皆知的以密度為基礎法的投影叢集演算法。它使用蒙地卡羅演算法來重複計算找出投影叢集，且定義一個用來計算投影叢集品質的公式。FPC演算法是DOC演算法的延伸版本，它使用挖掘大型項目集的方法來找出投影叢集的維度。找出大型項目集是挖掘關聯式法則的主要目標，而大型項目集指的是一群項目的集合，且這個項目集的發生次數超過給定的門檻值。雖然FPC演算法利用了挖掘大項目集的方法去加快找出投影叢集的速度，但它仍然需要許多的使用者指定參數。此外，在第一步去挑選中心點的時候，FPC演算法使用數次隨機的方式去取得中心點，這個方式需要花費許多時間，而且仍然有可能挑到壞的中心點。再者，如果將每個維度的權重考慮進去的話，計算叢集品質的方式可以更進一步改進。因此，在這篇論文中，我們提出了一個以參數關係為基礎的演算法來改進這些缺點。首先，我們觀察到參數之間的關係，並提出了一個只需要兩個參數的演算法，而非如其他大部份的演算法需要三個參數。接著，我們的演算法利用中位數去挑選中心點，我們只挑選一次中心點，而且接下來找出來的叢集的品質比FPC演算法來的好。最後，我們的品質測量公式考慮了叢集中每一個維度的權重，我們根據每個維度的出現次數給予不同的權重值。這個公式使我們找出來的投影叢集品質比FPC演算法好。它可以避免找出來的投影叢集包含太多不相關的維度。從我們的模擬結果，我們顯示了我們的演算法在執行時間以及分群品質方面比FPC演算法來的好。
Abstract
The clustering problem has been discussed extensively in the database literature as a tool for many applications, for example, bioinformatics. Traditional clustering algorithms consider all of the dimensions of an input dataset in an attempt to learn as much as possible about each object described. In the high dimensional data, however, many of the dimensions are often irrelevant. Therefore, projected clustering is proposed. A projected cluster is a subset C of data points together with a subset D of dimensions such that the points in C are closely clustered in the subspace of dimensions D. There have been many algorithms proposed to find the projected cluster. Most of them can be divided into three kinds of classification: partitioning, density-based, and hierarchical. The DOC algorithm is one of well-known density-based algorithms for projected clustering. It uses a Monte Carlo algorithm for iteratively computing projected clusters, and proposes a formula to calculate the quality of cluster. The FPC algorithm is an extended version of the DOC algorithm, it uses the mining large itemsets approach to find the dimensions of projected cluster. Finding the large itemsets is the main goal of mining association rules, where a large itemset is a combination of items whose appearing times in the dataset is greater than a given threshold. Although the FPC algorithm has used the technique of mining large itemsets to speed up finding projected clusters, it still needs many user-specified parameters to work. Moreover, in the first step, to choose the medoid, the FPC algorithm applies a random approach for several times to get the medoid, which takes long time and may still find a bad medoid. Furthermore, the way to calculate the quality of a cluster can be considered in more details, if we take the weight of dimensions into consideration. Therefore, in this thesis, we propose an algorithm which improves those disadvantages. First, we observe that the relationship between parameters, and propose a parameter-relationship-based algorithm that needs only two parameters, instead of three parameters in most of projected clustering algorithms. Next, our algorithm chooses the medoid with the median, we choose the medoid only one time and the quality of our cluster is better than that in the FPC algorithm. Finally, our quality measure formula considers the weight of each dimension of the cluster, and gives different values according to the times of occurrences of dimensions. This formula makes the quality of projected clustering based on our algorithm better than that of the FPC algorithm. It avoids the cluster containing too many irrelevant dimensions. From our simulation results, we show that our algorithm is better than the FPC algorithm, in term of the execution time and the quality of clustering.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Analyzing Microarray Data Using Clustering Analysis . . . . . . . . . 8 1.3 Projected Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2. A Survey of Algorithms for Projected Clustering . . . . . . . . . . 22 2.1 CLIQUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 PROCLUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 ORCLUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 DOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 FPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3. A Parameter-Relationship-based Approach . . . . . . . . . . . . . . 35 3.1 The Relationship Between Parameters . . . . . . . . . . . . . . . . . 35 3.2 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 A Comparison with the FPC Algorithm . . . . . . . . . . . . . . . . 50 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1 The Performace Model . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

參考文獻 References
[1] C. C. Aggarwal, “A Human-Computer Cooperative System for Effective High Dimensional Clustering,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 221–226, 2003. [2] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, and J. S. Park, “Fast Algorithm for Projected Clustering,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 61–72,1999. [3] C. C. Aggarwal and P. S. Yu, “Finding Generalized Projected Clusters in High Dimensional Space,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 70–81, 2000. [4] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 94–105, 1998. [5] R. Agrawal and R. Srikant, “Fast Algorithm for Mining Association Rule,” Proc. of the 20th Int. Conf. on Very Large Database., pp. 487–499, 1994. [6] J. An, J. X. Yu, C. A. Ratanamahatana, and Y. P. Chen, “A Dimensionality Reduction Algorithm and Its Application for Interactive Visualization,” Journal of Visual Languages and Computing, Vol. 18, No. 1, pp. 48–70, Feb. 2007. [7] A. Ben-Dor, R. Shamir, and Z. Yakhini, “Clustering Gene Expression Patterns,” Proc. of the 3rd Annual Int. Conf. on Computational Molecular Biology, pp. 33–42, 1999. [8] M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An Overview from Database Perspective,” IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 6, pp. 866–883, Dec. 1996. [9] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noises,” Proc. of the 2nd Int. Conf. on KDD, pp. 226–231, 1996. [10] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proc. of Int. Conf. on Data Eng., pp. 512–521, 1999. [11] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases,” Information Systems, Vol. 26, No. 1, pp. 35–58, March 2001. [12] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,” Proc. of the ACM SIGMOD, pp. 57–87, 2000. [13] L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Analysis,” John Wiley, Sons, Inc., 1990. [14] E. Ng, A. Fu, and R. Wong, “Projective Clustering by Histograms,” IEEE Trans. on Knowledge and Data Eng., Vol. 17, No. 3, pp. 369–383, March 2005. [15] R. T. Ng and J. Han, “Efficent and Effective Clustering Methods for Spatial Data Mining,” Proc. of the 20th Int. Conf. on Vary Large Data Bases, pp. 144–155, 1994. [16] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali, “A Monte Carlo Algorithm for Fast Projective Clustering,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 418–427, 2002. [17] W. Shannon, R. Culverhouse, and J. Duncan, “Analyzing Microarray Data Using Clustering Analysis,” Pharmacogenomics, Vol. 4, No. 1, pp. 41–51, Jan. 2003. [18] L. Y. Tseng and S. B. Yang, “A Genetic Approach to the Automatic Clustering Problem,” Pattern Recognition, Vol. 34, No. 2, pp. 415–424, Feb. 2001. [19] E. M. Voorhees, “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Proc. of Int. Conf. on Information Management, pp. 465–476, 1986. [20] C. P. Wei, Y. H. Lee, and C. M. Hsu, “Empirical Comparison of Fast Clustering Algorithm for Large Data Sets,” Proc. of the 33rd Hawaii Int. Conf. on System Sciences, pp. 57–87, Jan. 2000. [21] K. G. Woo and J. H. Lee., “FINDIT:A Fast and Intelligent Subspace Clustering Algorithm Using Dimension Voting,” PhD Thesis, Korea Advance Institute of Science and Technology, 2002. [22] K. Y. Yip, D. W. Cheung, and M. K. Ng, “HARP:A Practical Projected Clustering Algorithm,” IEEE Trans. on Knowledge and Data Eng., Vol. 16, No. 11, pp. 1387–1397, Nov. 2004. [23] K. Y. Yip, D. W. Cheung, and M. K. Ng, “On Discovery of Extremely Low-dimensional Clusters Using Semi-supervised Projected Clustering,” Proc. of the 21th Int. Conf. on Data Eng., pp. 329–340, 2005. [24] M. L. Yiu and N. Mamoulis, “Clustering Gene Expression Data in SQL Using Locally Adaptive Metrics,” Proc. of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 35–41, 2003. [25] M. L. Yiu and N. Mamoulis, “Frequent-Pattern Based Iterative Projected Clustering,” Proc. of the 3rd IEEE Int. Conf. on Data Mining, pp. 689–692, 2003. [26] M. L. Yiu and N. Mamoulis, “Iterative Projected Clustering by Subspace Mining,” IEEE Trans. on Knowledge and Data Eng., Vol. 17, No. 2, pp. 176–189, Feb. 2005. [27] C. H. Yun, K. T. Chuang, and M. S. Chen, “An Efficient Method for Mining Market-Basket Clusters,” Information Systems, Vol. 31, No. 3, pp. 170–186, May 2006. [28] O. R. Zaiane, A. Foss, C. H. Lee, and W. Wang, “On Data Clustering Analysis: Scalability, Constraints, and Validation,” Proc. of Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 28–39, 2002. [29] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: A Efficient Data Clustering Method for Very Large Databases,” Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 103–114, 1986.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.138.33.87 論文開放下載的時間是校外不公開 Your IP address is 3.138.33.87 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS