國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個處理大量高維度文件的分群架構,A clustering scheme for large high-dimensional document datasets

論文名稱 Title	一個處理大量高維度文件的分群架構 A clustering scheme for large high-dimensional document datasets
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	95 學年度第 2 學期 The spring semester of Academic Year 95	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	45
研究生 Author	陳經文 Jing-wen Chen
指導教授 Advisor	李錫智 Shie-jue Lee
召集委員 Convenor	洪宗貝 Tzung-pei Hong
口試委員 Advisory Committee	郭忠民, 林文揚, 吳志宏 Chung-ming Kuo; Wen-yang Lin; Chih-hung Wu
口試日期 Date of Exam	2007-07-26	繳交日期 Date of Submission	2007-08-09
關鍵字 Keywords	文件分群、文件探勘、高維度資料分群、維度縮減 Dimension reduction, high-dimensional data clustering, text mining, Document clustering
統計 Statistics	本論文已被瀏覽 5889 次，被下載 0 次 The thesis/dissertation has been browsed 5889 times, has been downloaded 0 times.

中文摘要
本文提出了一個新的分群架構。運用此架構，我們可以使執行時間相較於原本的方法有大幅度的縮短。由於文件資料集通常是數量龐大的高維度資料，導致了一般的分群方法需要不少時間運算。在我們的方法中，同時使用了分群方法與維度縮減的方法並行處理。首先，我們把資料分割成幾個部份，取出其中一份作分群後依分群結果做維度縮減。接下來再加進一部分資料，與原來的部份結合並轉換為縮減過後之維度，重複分群及維度縮減的動作，如此循環直至所有分割的資料均處理完畢。因為有對資料做分割及維度縮減，故我們可以處理原本的分群法不能處理的大量資料。而利用不同分群方法與維度縮減方法的結合，我們可以對分群所需時間做一定程度上的改善。
Abstract
Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method.

目次 Table of Contents
摘要 i Abstract ii 目錄 iii 圖目錄 iv 表目錄 v 第一章簡介 - 1 - 第二章文獻探討 - 5 - 2.1 向量空間模型 - 5 - 2.2 分群的方法 - 6 - 2.2.1 k-means - 9 - 2.2.2 Isodata - 11 - 2.3 維度縮減 - 13 - 第三章研究方法 - 15 - 3.1 研究動機 - 15 - 3.2 方法架構 - 16 - 第四章實驗結果與分析 - 21 - 4.1 實驗：k-means - 23 - 4.2 實驗：isodata - 29 - 4.3 大型資料集 - 34 - 第五章結論與展望 - 35 - 5.1 結論 - 35 - 5.2 未來發展 - 36 - 第六章參考文獻 - 37 -

參考文獻 References
[1] S. M. Rüger and S. E. Gauch, “Feature Reduction for Document Clustering and Classification,” Technical report, Computing Department, Imperial College, London, UK, 2000. [2] D. Sullivan, “Document Warehousing and Text Mining,” Wiley Computer Publishing, p.p.326, 2001. [3] J. Moore, E. H. Han, D. Boley, M. Gini, R. Gros, K. Hasting, G. Karypis, V. Kumar, and B. Mobasher, “Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering,” In 7th Workshop on Information Technologies and Systems, 1997. [4] L. D. Baker and A. “McCallum, Distributional Clustering of Words for Text Classification,” In Proceedings of 21st Annual International ACM SIGIR, p.p.96-103, 1998. [5] N. Slonim and N. Tishby, “The Power of Word Clusters for Text Classification,” In 23rd European Colloquium on Information Retrieval Research, 2001. [6] R. Bekkerman, R. El-Yaniv, N. Tishby and Y. Winter, “Distributional Word Clusters vs. Words for Text Categorization,” Journal of Machine Learning Research, p.p.1-48, 2002. [7] F. Pereira, N. Tishby and L. Lee, “Distributional Clustering of English Words,” In Meeting of the Association for Computational Linguistics, p.p.183-190, 1993. [8] I. Dhillon, S. Mallela and R. Kumar, “A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification,” Journal of Machine Learning Research, p.p.1265-1287, 2003. [9] Y. Yang and J. O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” In Proceedings of 14th International Conference on Machine Learning, p.p.412-420, 1997. [10] I. Dhillon, Y. Guan, and J. Fan, “Efficient Clustering of Very Large Document Collections,” In Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, p.p.357-381, 2001. [11] I. Dhillon, J. Kogan, and M. Nicholas, “Feature Selection and Document Clustering,” In a Comprehensive Survey of Text Mining, p.p.73-100, 2003. [12] J. Kogan, M. Teboulle, and C. Nicholas, “Data Driven Similarity Measures for k-Means Like Clustering Algorithms,” Information Retrieval, p.p.331-349, 2005. [13] I. Dhillon and D. Modha, “Concept Decompositions for Large Sparse Text Data using Clustering,” Machine Learning, p.p.143-175, 2001. [14] Duda, Richard 0. , and Peter B. Hart, “Pattern Classification and Scene Analysis.” Wiley & Sons, New York, 1973. [15] E. Piazza, “Comparison of different classification algorithms of NOAA AVHRR images,” Proceedings SPIE, July 2000. [16] G. Salton, and M. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill, New York, 1983. [17] P. Willet, “Recent Trends in Hierarchical Document Clustering: A Critical Review,” Information Processing and Management, Vol. 24 No. 5, p.p.557-597, 1988. [18] V. Faber, “Clustering and the Continuous k-Means Algorithm”, Los Alamos Science, November 22, p.p.138-144, 1994. [19] G. H. Ball and D. J. Hall, “ISODATA, a novel method of data analysis and classification,” Technical Report, Stanford University, Stanford, 1965. [20] M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An Overview from Database Perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, p.p.866-883, December, 1996. [21] Http://www.daviddlewis.com/resources/testcollections/reuters21578

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.222.125.171 論文開放下載的時間是校外不公開 Your IP address is 18.222.125.171 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS