博碩士論文 etd-0809107-165527 詳細資訊


[回到前頁查詢結果 | 重新搜尋]

姓名 陳經文(Jing-wen Chen) 電子郵件信箱 E-mail 資料不公開
畢業系所 電機工程學系研究所(Electrical Engineering)
畢業學位 碩士(Master) 畢業時期 95學年第2學期
論文名稱(中) 一個處理大量高維度文件的分群架構
論文名稱(英) A clustering scheme for large high-dimensional document datasets
檔案
  • etd-0809107-165527.pdf
  • 本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
    請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
    論文使用權限

    電子論文:校內校外均不公開

    論文語文/頁數 中文/45
    統計 本論文已被瀏覽 5054 次,被下載 0 次
    摘要(中) 本文提出了一個新的分群架構。運用此架構,我們可以使執行時間相較於原本的方法有大幅度的縮短。由於文件資料集通常是數量龐大的高維度資料,導致了一般的分群方法需要不少時間運算。在我們的方法中,同時使用了分群方法與維度縮減的方法並行處理。首先,我們把資料分割成幾個部份,取出其中一份作分群後依分群結果做維度縮減。接下來再加進一部分資料,與原來的部份結合並轉換為縮減過後之維度,重複分群及維度縮減的動作,如此循環直至所有分割的資料均處理完畢。因為有對資料做分割及維度縮減,故我們可以處理原本的分群法不能處理的大量資料。而利用不同分群方法與維度縮減方法的結合,我們可以對分群所需時間做一定程度上的改善。
    摘要(英) Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method.
    關鍵字(中)
  • 文件分群
  • 文件探勘
  • 高維度資料分群
  • 維度縮減
  • 關鍵字(英)
  • Dimension reduction
  • high-dimensional data clustering
  • text mining
  • Document clustering
  • 論文目次 摘要 i
    Abstract ii
    目錄 iii
    圖目錄 iv
    表目錄 v
    第一章 簡介 - 1 -
    第二章 文獻探討 - 5 -
    2.1 向量空間模型 - 5 -
    2.2 分群的方法 - 6 -
    2.2.1 k-means - 9 -
    2.2.2 Isodata - 11 -
    2.3 維度縮減 - 13 -
    第三章 研究方法 - 15 -
    3.1 研究動機 - 15 -
    3.2 方法架構 - 16 -
    第四章 實驗結果與分析 - 21 -
    4.1 實驗:k-means - 23 -
    4.2 實驗:isodata - 29 -
    4.3 大型資料集 - 34 -
    第五章 結論與展望 - 35 -
    5.1 結論 - 35 -
    5.2 未來發展 - 36 -
    第六章 參考文獻 - 37 -
    參考文獻 [1] S. M. Rüger and S. E. Gauch, “Feature Reduction for Document Clustering and Classification,” Technical report, Computing Department, Imperial College, London, UK, 2000.
    [2] D. Sullivan, “Document Warehousing and Text Mining,” Wiley Computer Publishing, p.p.326, 2001.
    [3] J. Moore, E. H. Han, D. Boley, M. Gini, R. Gros, K. Hasting, G. Karypis, V. Kumar, and B. Mobasher, “Web Page Categorization and Feature Selection Using Association Rule and Principal Component Clustering,” In 7th Workshop on Information Technologies and Systems, 1997.
    [4] L. D. Baker and A. “McCallum, Distributional Clustering of Words for Text Classification,” In Proceedings of 21st Annual International ACM SIGIR, p.p.96-103, 1998.
    [5] N. Slonim and N. Tishby, “The Power of Word Clusters for Text Classification,” In 23rd European Colloquium on Information Retrieval Research, 2001.
    [6] R. Bekkerman, R. El-Yaniv, N. Tishby and Y. Winter, “Distributional Word Clusters vs. Words for Text Categorization,” Journal of Machine Learning Research, p.p.1-48, 2002.
    [7] F. Pereira, N. Tishby and L. Lee, “Distributional Clustering of English Words,” In Meeting of the Association for Computational Linguistics, p.p.183-190, 1993.
    [8] I. Dhillon, S. Mallela and R. Kumar, “A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification,” Journal of Machine Learning Research, p.p.1265-1287, 2003.
    [9] Y. Yang and J. O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” In Proceedings of 14th International Conference on Machine Learning, p.p.412-420, 1997.
    [10] I. Dhillon, Y. Guan, and J. Fan, “Efficient Clustering of Very Large Document Collections,” In Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, p.p.357-381, 2001.
    [11] I. Dhillon, J. Kogan, and M. Nicholas, “Feature Selection and Document Clustering,” In a Comprehensive Survey of Text Mining, p.p.73-100, 2003.
    [12] J. Kogan, M. Teboulle, and C. Nicholas, “Data Driven Similarity Measures for k-Means Like Clustering Algorithms,” Information Retrieval, p.p.331-349, 2005.
    [13] I. Dhillon and D. Modha, “Concept Decompositions for Large Sparse Text Data using Clustering,” Machine Learning, p.p.143-175, 2001.
    [14] Duda, Richard 0. , and Peter B. Hart, “Pattern Classification and Scene Analysis.” Wiley & Sons, New York, 1973.
    [15] E. Piazza, “Comparison of different classification algorithms of NOAA AVHRR images,” Proceedings SPIE, July 2000.
    [16] G. Salton, and M. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill, New York, 1983.
    [17] P. Willet, “Recent Trends in Hierarchical Document Clustering: A Critical Review,” Information Processing and Management, Vol. 24 No. 5, p.p.557-597, 1988.
    [18] V. Faber, “Clustering and the Continuous k-Means Algorithm”, Los Alamos Science, November 22, p.p.138-144, 1994.
    [19] G. H. Ball and D. J. Hall, “ISODATA, a novel method of data analysis and classification,” Technical Report, Stanford University, Stanford, 1965.
    [20] M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An Overview from Database Perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, p.p.866-883, December, 1996.
    [21] Http://www.daviddlewis.com/resources/testcollections/reuters21578
    口試委員
  • 洪宗貝 - 召集委員
  • 吳志宏 - 委員
  • 林文揚 - 委員
  • 郭忠民 - 委員
  • 李錫智 - 指導教授
  • 口試日期 2007-07-26 繳交日期 2007-08-09

    [回到前頁查詢結果 | 重新搜尋]


    如有任何問題請與論文審查小組聯繫