論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available
論文名稱 Title |
一個以信賴度為基礎的階層式文件分群方式 A Confidence-based Hierarchical Word Clustering for Document Classification |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
51 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2007-07-26 |
繳交日期 Date of Submission |
2007-08-09 |
關鍵字 Keywords |
分類器、文字分群、信賴度、階層式分群 Classification, Word Clustering, Confidence, Hierarchical |
||
統計 Statistics |
本論文已被瀏覽 5817 次,被下載 0 次 The thesis/dissertation has been browsed 5817 times, has been downloaded 0 times. |
中文摘要 |
我們提出了一個新的降低文字特徵維度的方法。我們將文字特徵做階層式的分群,將文字特徵合併成一個新的群聚,用來當作新的特徵,新的特徵可以用來作為分類。一開始每個文字特徵視為獨立的一群,我們計算出所有兩兩不同文字群聚之間的信賴度(confidence)值。然後將有著最高相互信賴度(Mutual confidence)的兩個不相同文字特徵合併起來,成為一個新的群聚。如此不斷重覆合併,直到只剩下一群,或是到達某門檻值為止。利用這樣的方式,使用者可以得到一個階層式的文字分群方法,使用者可以決定想要群聚的數目,從某一層中挑選出來。然後將新的特徵用來作為文件分類。從實驗的結果,可以看出我們的方法相當不錯,優於其他的方法。 |
Abstract |
We propose a novel feature reduction approach to group words hierarchically into clusters which can then be used as new features for document classification. Initially, each word constitutes a cluster. We calculate the mutual confidence between any two different words. The pair of clusters containing the two words with the highest mutual confidence are combined into a new cluster. This process of merging is iterated until all the mutual confidences between the un-processed pair of words are smaller than a predefined threshold or only one cluster exists. In this way, a hierarchy of word clusters is obtained. The user can decide the clusters, from a certain level, to be used as new features for document classification. Experimental results have shown that our method can perform better than other methods. |
目次 Table of Contents |
摘要 i Abstract ii 目錄 iii 圖目錄 v 表目錄 vi 第一章 簡介 1 1.1 研究背景與動機 1 1.2 研究目的 3 1.3 論文架構 5 第二章 相關研究介紹 6 2.1 文字探勘的定義 6 2.2 文字探勘介紹與流程 7 2.3 群集化(Clustering) 9 2.4 特徵選取(Feature selection) 12 2.5 特徵擷取 (Feature extraction) 14 2.6貝式分類器 (Naïve Bayes Classfier) 17 第三章 我們的研究方法 19 3.1 研究目的和架構 19 3.2 研究方法的詳細介紹與說明 23 3.3 舉例說明我們的方法 27 第四章 實驗分析與結果 33 4.1實驗 Reuters-21587 33 4.2 實驗Cora 36 第五章 結論 40 5.1結論 40 5.2 後續研究與發展 40 參考文獻 42 |
參考文獻 References |
[1] D. Sullivan, Document Warehousing and Text Mining, Wiley Computer Publishing, 2001, pp. 326. [2] G.. Salton and M. J. McGill, Introduction to Modern Retrieval, McGraw-Hill Book Company, 1983. [3] J. Dorre, P. Gerstl and R. Seiffert, Text Mining: Finding Nuggets in Mountains of Textual Data, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 398-401. [4] J. Han, M. Amber, Data Mining: Concept and Techniques , Morgan Kaufmann, 2000. [5] P. Willet, Recent Trens in Hierarchical Document Clustering: A Critical Review, Information Processing and Management, 24(5), 1988, pp. 557-597. [6] R. Sebastiani, Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No.1, March 2002, pp. 1-47. [7] Y. Yang and J. O. Pedersen, A comparative study on feature selection in text categorization. In Proceedings of 14th International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 412-420. [8] L. D. Baker and A. McCallum, Distributional clustering of words for text classification. In SIGIR’98: Proceedings of the 21st Annual International ACM SIGIR, pp. 96–103. ACM, August 1998. [9] N. Slom and Tishby, The power of word clusters for text classification. In Proceedings of 23rd European Colloquium on Information Research (ECIR), 2001. [10] R. Bekkerman, R. El-Yaniv, Y. Winter, and N. Tishby, On feature distributional clustering for text categorization. In ACM SIGIR, pp. 146–153, 2001. [11] F. Pereira, N. Tishby and L. Lee, Distributional clustering of English words. In 31st Annual Meeting of ACL, 1993, pp. 183-190. [12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, Vol. 41, NO. 6, 1990, pp. 391-407. [13] I. S. Dhillon, S. Mallela and R. Kumar, A Divisive Infromation-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 2003, pp. 1265-1287. [14] Http://www.daviddlewis.com/resources/testcollections/reuters21578 [15] A Mc Callum, K. Nigam and L. Ungar, Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. In Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, 2000, pp. 169-178. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內校外均不公開 not available 開放時間 Available: 校內 Campus:永不公開 not available 校外 Off-campus:永不公開 not available 您的 IP(校外) 位址是 18.119.107.96 論文開放下載的時間是 校外不公開 Your IP address is 18.119.107.96 This thesis will be available to you on Indicate off-campus access is not available. |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |