國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個以信賴度為基礎的階層式文件分群方式,A Confidence-based Hierarchical Word Clustering for Document Classification

論文名稱 Title	一個以信賴度為基礎的階層式文件分群方式 A Confidence-based Hierarchical Word Clustering for Document Classification
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	95 學年度第 2 學期 The spring semester of Academic Year 95	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	51
研究生 Author	殷開泰 Kai-Tai Yin
指導教授 Advisor	李錫智 Shie-jue Lee
召集委員 Convenor	洪宗貝 Tzung-pei Hong
口試委員 Advisory Committee	郭忠民, 林文揚, 吳志宏 Chung-Ming Kuo; Wen-Yang Lin; Chih-Hung Wu
口試日期 Date of Exam	2007-07-26	繳交日期 Date of Submission	2007-08-09
關鍵字 Keywords	分類器、文字分群、信賴度、階層式分群 Classification, Word Clustering, Confidence, Hierarchical
統計 Statistics	本論文已被瀏覽 5817 次，被下載 0 次 The thesis/dissertation has been browsed 5817 times, has been downloaded 0 times.

中文摘要
我們提出了一個新的降低文字特徵維度的方法。我們將文字特徵做階層式的分群，將文字特徵合併成一個新的群聚，用來當作新的特徵，新的特徵可以用來作為分類。一開始每個文字特徵視為獨立的一群，我們計算出所有兩兩不同文字群聚之間的信賴度(confidence)值。然後將有著最高相互信賴度(Mutual confidence)的兩個不相同文字特徵合併起來，成為一個新的群聚。如此不斷重覆合併，直到只剩下一群，或是到達某門檻值為止。利用這樣的方式，使用者可以得到一個階層式的文字分群方法，使用者可以決定想要群聚的數目，從某一層中挑選出來。然後將新的特徵用來作為文件分類。從實驗的結果，可以看出我們的方法相當不錯，優於其他的方法。
Abstract
We propose a novel feature reduction approach to group words hierarchically into clusters which can then be used as new features for document classification. Initially, each word constitutes a cluster. We calculate the mutual confidence between any two different words. The pair of clusters containing the two words with the highest mutual confidence are combined into a new cluster. This process of merging is iterated until all the mutual confidences between the un-processed pair of words are smaller than a predefined threshold or only one cluster exists. In this way, a hierarchy of word clusters is obtained. The user can decide the clusters, from a certain level, to be used as new features for document classification. Experimental results have shown that our method can perform better than other methods.

目次 Table of Contents
摘要 i Abstract ii 目錄 iii 圖目錄 v 表目錄 vi 第一章簡介 1 1.1 研究背景與動機 1 1.2 研究目的 3 1.3 論文架構 5 第二章相關研究介紹 6 2.1 文字探勘的定義 6 2.2 文字探勘介紹與流程 7 2.3 群集化（Clustering） 9 2.4 特徵選取（Feature selection） 12 2.5 特徵擷取（Feature extraction） 14 2.6貝式分類器（Naïve Bayes Classfier） 17 第三章我們的研究方法 19 3.1 研究目的和架構 19 3.2 研究方法的詳細介紹與說明 23 3.3 舉例說明我們的方法 27 第四章實驗分析與結果 33 4.1實驗 Reuters-21587 33 4.2 實驗Cora 36 第五章結論 40 5.1結論 40 5.2 後續研究與發展 40 參考文獻 42

參考文獻 References
[1] D. Sullivan, Document Warehousing and Text Mining, Wiley Computer Publishing, 2001, pp. 326. [2] G.. Salton and M. J. McGill, Introduction to Modern Retrieval, McGraw-Hill Book Company, 1983. [3] J. Dorre, P. Gerstl and R. Seiffert, Text Mining: Finding Nuggets in Mountains of Textual Data, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 398-401. [4] J. Han, M. Amber, Data Mining: Concept and Techniques , Morgan Kaufmann, 2000. [5] P. Willet, Recent Trens in Hierarchical Document Clustering: A Critical Review, Information Processing and Management, 24(5), 1988, pp. 557-597. [6] R. Sebastiani, Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No.1, March 2002, pp. 1-47. [7] Y. Yang and J. O. Pedersen, A comparative study on feature selection in text categorization. In Proceedings of 14th International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 412-420. [8] L. D. Baker and A. McCallum, Distributional clustering of words for text classification. In SIGIR’98: Proceedings of the 21st Annual International ACM SIGIR, pp. 96–103. ACM, August 1998. [9] N. Slom and Tishby, The power of word clusters for text classification. In Proceedings of 23rd European Colloquium on Information Research (ECIR), 2001. [10] R. Bekkerman, R. El-Yaniv, Y. Winter, and N. Tishby, On feature distributional clustering for text categorization. In ACM SIGIR, pp. 146–153, 2001. [11] F. Pereira, N. Tishby and L. Lee, Distributional clustering of English words. In 31st Annual Meeting of ACL, 1993, pp. 183-190. [12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, Vol. 41, NO. 6, 1990, pp. 391-407. [13] I. S. Dhillon, S. Mallela and R. Kumar, A Divisive Infromation-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 2003, pp. 1265-1287. [14] Http://www.daviddlewis.com/resources/testcollections/reuters21578 [15] A Mc Callum, K. Nigam and L. Ungar, Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. In Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, 2000, pp. 169-178.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.119.107.96 論文開放下載的時間是校外不公開 Your IP address is 18.119.107.96 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS