Responsive image
博碩士論文 etd-0809107-164953 詳細資訊
Title page for etd-0809107-164953
論文名稱
Title
一個以信賴度為基礎的階層式文件分群方式
A Confidence-based Hierarchical Word Clustering for Document Classification
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
51
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2007-07-26
繳交日期
Date of Submission
2007-08-09
關鍵字
Keywords
分類器、文字分群、信賴度、階層式分群
Classification, Word Clustering, Confidence, Hierarchical
統計
Statistics
本論文已被瀏覽 5816 次,被下載 0
The thesis/dissertation has been browsed 5816 times, has been downloaded 0 times.
中文摘要
我們提出了一個新的降低文字特徵維度的方法。我們將文字特徵做階層式的分群,將文字特徵合併成一個新的群聚,用來當作新的特徵,新的特徵可以用來作為分類。一開始每個文字特徵視為獨立的一群,我們計算出所有兩兩不同文字群聚之間的信賴度(confidence)值。然後將有著最高相互信賴度(Mutual confidence)的兩個不相同文字特徵合併起來,成為一個新的群聚。如此不斷重覆合併,直到只剩下一群,或是到達某門檻值為止。利用這樣的方式,使用者可以得到一個階層式的文字分群方法,使用者可以決定想要群聚的數目,從某一層中挑選出來。然後將新的特徵用來作為文件分類。從實驗的結果,可以看出我們的方法相當不錯,優於其他的方法。
Abstract
We propose a novel feature reduction approach to group words hierarchically into clusters which can then be used as new features for document classification. Initially, each word constitutes a cluster. We calculate the mutual confidence between any two different words. The pair of clusters containing the two words with the highest mutual confidence are combined into a new cluster. This process of merging is iterated until all the mutual confidences between the un-processed pair of words are smaller than a predefined threshold or only one cluster exists. In this way, a hierarchy of word clusters is obtained. The user can decide the clusters, from a certain level, to be used as new features for document classification. Experimental results have shown that our method can perform better than other methods.
目次 Table of Contents
摘要 i
Abstract ii
目錄 iii
圖目錄 v
表目錄 vi
第一章 簡介 1
1.1 研究背景與動機 1
1.2 研究目的 3
1.3 論文架構 5
第二章 相關研究介紹 6
2.1 文字探勘的定義 6
2.2 文字探勘介紹與流程 7
2.3 群集化(Clustering) 9
2.4 特徵選取(Feature selection) 12
2.5 特徵擷取 (Feature extraction) 14
2.6貝式分類器 (Naïve Bayes Classfier) 17
第三章 我們的研究方法 19
3.1 研究目的和架構 19
3.2 研究方法的詳細介紹與說明 23
3.3 舉例說明我們的方法 27
第四章 實驗分析與結果 33
4.1實驗 Reuters-21587 33
4.2 實驗Cora 36
第五章 結論 40
5.1結論 40
5.2 後續研究與發展 40
參考文獻 42
參考文獻 References
[1] D. Sullivan, Document Warehousing and Text Mining, Wiley Computer Publishing, 2001, pp. 326.

[2] G.. Salton and M. J. McGill, Introduction to Modern Retrieval, McGraw-Hill Book Company, 1983.

[3] J. Dorre, P. Gerstl and R. Seiffert, Text Mining: Finding Nuggets in Mountains of Textual Data, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 398-401.

[4] J. Han, M. Amber, Data Mining: Concept and Techniques , Morgan Kaufmann, 2000.

[5] P. Willet, Recent Trens in Hierarchical Document Clustering: A Critical Review, Information Processing and Management, 24(5), 1988, pp. 557-597.

[6] R. Sebastiani, Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No.1, March 2002, pp. 1-47.

[7] Y. Yang and J. O. Pedersen, A comparative study on feature selection in text categorization. In Proceedings of 14th International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 412-420.

[8] L. D. Baker and A. McCallum, Distributional clustering of words for text classification. In SIGIR’98: Proceedings of the 21st Annual International ACM SIGIR, pp. 96–103. ACM, August 1998.

[9] N. Slom and Tishby, The power of word clusters for text classification. In Proceedings of 23rd European Colloquium on Information Research (ECIR), 2001.

[10] R. Bekkerman, R. El-Yaniv, Y. Winter, and N. Tishby, On feature distributional clustering for text categorization. In ACM SIGIR, pp. 146–153, 2001.

[11] F. Pereira, N. Tishby and L. Lee, Distributional clustering of English words. In 31st Annual Meeting of ACL, 1993, pp. 183-190.

[12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, Vol. 41, NO. 6, 1990, pp. 391-407.

[13] I. S. Dhillon, S. Mallela and R. Kumar, A Divisive Infromation-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 2003, pp. 1265-1287.

[14] Http://www.daviddlewis.com/resources/testcollections/reuters21578

[15] A Mc Callum, K. Nigam and L. Ungar, Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. In Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, 2000, pp. 169-178.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 3.138.134.107
論文開放下載的時間是 校外不公開

Your IP address is 3.138.134.107
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code