Responsive image
博碩士論文 etd-0809107-165527 詳細資訊
Title page for etd-0809107-165527
論文名稱
Title
一個處理大量高維度文件的分群架構
A clustering scheme for large high-dimensional document datasets
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
45
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2007-07-26
繳交日期
Date of Submission
2007-08-09
關鍵字
Keywords
文件分群、文件探勘、高維度資料分群、維度縮減
Dimension reduction, high-dimensional data clustering, text mining, Document clustering
統計
Statistics
本論文已被瀏覽 5889 次,被下載 0
The thesis/dissertation has been browsed 5889 times, has been downloaded 0 times.
中文摘要
本文提出了一個新的分群架構。運用此架構,我們可以使執行時間相較於原本的方法有大幅度的縮短。由於文件資料集通常是數量龐大的高維度資料,導致了一般的分群方法需要不少時間運算。在我們的方法中,同時使用了分群方法與維度縮減的方法並行處理。首先,我們把資料分割成幾個部份,取出其中一份作分群後依分群結果做維度縮減。接下來再加進一部分資料,與原來的部份結合並轉換為縮減過後之維度,重複分群及維度縮減的動作,如此循環直至所有分割的資料均處理完畢。因為有對資料做分割及維度縮減,故我們可以處理原本的分群法不能處理的大量資料。而利用不同分群方法與維度縮減方法的結合,我們可以對分群所需時間做一定程度上的改善。
Abstract
Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method.
目次 Table of Contents
摘要 i
Abstract ii
目錄 iii
圖目錄 iv
表目錄 v
第一章 簡介 - 1 -
第二章 文獻探討 - 5 -
2.1 向量空間模型 - 5 -
2.2 分群的方法 - 6 -
2.2.1 k-means - 9 -
2.2.2 Isodata - 11 -
2.3 維度縮減 - 13 -
第三章 研究方法 - 15 -
3.1 研究動機 - 15 -
3.2 方法架構 - 16 -
第四章 實驗結果與分析 - 21 -
4.1 實驗:k-means - 23 -
4.2 實驗:isodata - 29 -
4.3 大型資料集 - 34 -
第五章 結論與展望 - 35 -
5.1 結論 - 35 -
5.2 未來發展 - 36 -
第六章 參考文獻 - 37 -
參考文獻 References
[1] S. M. Rüger and S. E. Gauch, “Feature Reduction for Document Clustering and Classification,” Technical report, Computing Department, Imperial College, London, UK, 2000.
[2] D. Sullivan, “Document Warehousing and Text Mining,” Wiley Computer Publishing, p.p.326, 2001.
[3] J. Moore, E. H. Han, D. Boley, M. Gini, R. Gros, K. Hasting, G. Karypis, V. Kumar, and B. Mobasher, “Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering,” In 7th Workshop on Information Technologies and Systems, 1997.
[4] L. D. Baker and A. “McCallum, Distributional Clustering of Words for Text Classification,” In Proceedings of 21st Annual International ACM SIGIR, p.p.96-103, 1998.
[5] N. Slonim and N. Tishby, “The Power of Word Clusters for Text Classification,” In 23rd European Colloquium on Information Retrieval Research, 2001.
[6] R. Bekkerman, R. El-Yaniv, N. Tishby and Y. Winter, “Distributional Word Clusters vs. Words for Text Categorization,” Journal of Machine Learning Research, p.p.1-48, 2002.
[7] F. Pereira, N. Tishby and L. Lee, “Distributional Clustering of English Words,” In Meeting of the Association for Computational Linguistics, p.p.183-190, 1993.
[8] I. Dhillon, S. Mallela and R. Kumar, “A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification,” Journal of Machine Learning Research, p.p.1265-1287, 2003.
[9] Y. Yang and J. O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” In Proceedings of 14th International Conference on Machine Learning, p.p.412-420, 1997.
[10] I. Dhillon, Y. Guan, and J. Fan, “Efficient Clustering of Very Large Document Collections,” In Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, p.p.357-381, 2001.
[11] I. Dhillon, J. Kogan, and M. Nicholas, “Feature Selection and Document Clustering,” In a Comprehensive Survey of Text Mining, p.p.73-100, 2003.
[12] J. Kogan, M. Teboulle, and C. Nicholas, “Data Driven Similarity Measures for k-Means Like Clustering Algorithms,” Information Retrieval, p.p.331-349, 2005.
[13] I. Dhillon and D. Modha, “Concept Decompositions for Large Sparse Text Data using Clustering,” Machine Learning, p.p.143-175, 2001.
[14] Duda, Richard 0. , and Peter B. Hart, “Pattern Classification and Scene Analysis.” Wiley & Sons, New York, 1973.
[15] E. Piazza, “Comparison of different classification algorithms of NOAA AVHRR images,” Proceedings SPIE, July 2000.
[16] G. Salton, and M. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill, New York, 1983.
[17] P. Willet, “Recent Trends in Hierarchical Document Clustering: A Critical Review,” Information Processing and Management, Vol. 24 No. 5, p.p.557-597, 1988.
[18] V. Faber, “Clustering and the Continuous k-Means Algorithm”, Los Alamos Science, November 22, p.p.138-144, 1994.
[19] G. H. Ball and D. J. Hall, “ISODATA, a novel method of data analysis and classification,” Technical Report, Stanford University, Stanford, 1965.
[20] M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An Overview from Database Perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, p.p.866-883, December, 1996.
[21] Http://www.daviddlewis.com/resources/testcollections/reuters21578
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 18.222.125.171
論文開放下載的時間是 校外不公開

Your IP address is 18.222.125.171
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code