博碩士論文 etd-0809107-164953 詳細資訊


[回到前頁查詢結果 | 重新搜尋]

姓名 殷開泰(Kai-Tai Yin) 電子郵件信箱 E-mail 資料不公開
畢業系所 電機工程學系研究所(Electrical Engineering)
畢業學位 碩士(Master) 畢業時期 95學年第2學期
論文名稱(中) 一個以信賴度為基礎的階層式文件分群方式
論文名稱(英) A Confidence-based Hierarchical Word Clustering for Document Classification
檔案
  • etd-0809107-164953.pdf
  • 本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
    請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
    論文使用權限

    電子論文:校內校外均不公開

    論文語文/頁數 中文/51
    統計 本論文已被瀏覽 5065 次,被下載 0 次
    摘要(中) 我們提出了一個新的降低文字特徵維度的方法。我們將文字特徵做階層式的分群,將文字特徵合併成一個新的群聚,用來當作新的特徵,新的特徵可以用來作為分類。一開始每個文字特徵視為獨立的一群,我們計算出所有兩兩不同文字群聚之間的信賴度(confidence)值。然後將有著最高相互信賴度(Mutual confidence)的兩個不相同文字特徵合併起來,成為一個新的群聚。如此不斷重覆合併,直到只剩下一群,或是到達某門檻值為止。利用這樣的方式,使用者可以得到一個階層式的文字分群方法,使用者可以決定想要群聚的數目,從某一層中挑選出來。然後將新的特徵用來作為文件分類。從實驗的結果,可以看出我們的方法相當不錯,優於其他的方法。
    摘要(英) We propose a novel feature reduction approach to group words hierarchically into clusters which can then be used as new features for document classification. Initially, each word constitutes a cluster. We calculate the mutual confidence between any two different words. The pair of clusters containing the two words with the highest mutual confidence are combined into a new cluster. This process of merging is iterated until all the mutual confidences between the un-processed pair of words are smaller than a predefined threshold or only one cluster exists. In this way, a hierarchy of word clusters is obtained. The user can decide the clusters, from a certain level, to be used as new features for document classification. Experimental results have shown that our method can perform better than other methods.
    關鍵字(中)
  • 分類器
  • 文字分群
  • 信賴度
  • 階層式分群
  • 關鍵字(英)
  • Classification
  • Word Clustering
  • Confidence
  • Hierarchical
  • 論文目次 摘要 i
    Abstract ii
    目錄 iii
    圖目錄 v
    表目錄 vi
    第一章 簡介 1
    1.1 研究背景與動機 1
    1.2 研究目的 3
    1.3 論文架構 5
    第二章 相關研究介紹 6
    2.1 文字探勘的定義 6
    2.2 文字探勘介紹與流程 7
    2.3 群集化(Clustering) 9
    2.4 特徵選取(Feature selection) 12
    2.5 特徵擷取 (Feature extraction) 14
    2.6貝式分類器 (Naïve Bayes Classfier) 17
    第三章 我們的研究方法 19
    3.1 研究目的和架構 19
    3.2 研究方法的詳細介紹與說明 23
    3.3 舉例說明我們的方法 27
    第四章 實驗分析與結果 33
    4.1實驗 Reuters-21587 33
    4.2 實驗Cora 36
    第五章 結論 40
    5.1結論 40
    5.2 後續研究與發展 40
    參考文獻 42
    參考文獻 [1] D. Sullivan, Document Warehousing and Text Mining, Wiley Computer Publishing, 2001, pp. 326.
    [2] G.. Salton and M. J. McGill, Introduction to Modern Retrieval, McGraw-Hill Book Company, 1983.
    [3] J. Dorre, P. Gerstl and R. Seiffert, Text Mining: Finding Nuggets in Mountains of Textual Data, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 398-401.
    [4] J. Han, M. Amber, Data Mining: Concept and Techniques , Morgan Kaufmann, 2000.
    [5] P. Willet, Recent Trens in Hierarchical Document Clustering: A Critical Review, Information Processing and Management, 24(5), 1988, pp. 557-597.
    [6] R. Sebastiani, Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No.1, March 2002, pp. 1-47.
    [7] Y. Yang and J. O. Pedersen, A comparative study on feature selection in text categorization. In Proceedings of 14th International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 412-420.
    [8] L. D. Baker and A. McCallum, Distributional clustering of words for text classification. In SIGIR’98: Proceedings of the 21st Annual International ACM SIGIR, pp. 96–103. ACM, August 1998.
    [9] N. Slom and Tishby, The power of word clusters for text classification. In Proceedings of 23rd European Colloquium on Information Research (ECIR), 2001.
    [10] R. Bekkerman, R. El-Yaniv, Y. Winter, and N. Tishby, On feature distributional clustering for text categorization. In ACM SIGIR, pp. 146–153, 2001.
    [11] F. Pereira, N. Tishby and L. Lee, Distributional clustering of English words. In 31st Annual Meeting of ACL, 1993, pp. 183-190.
    [12] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, Vol. 41, NO. 6, 1990, pp. 391-407.
    [13] I. S. Dhillon, S. Mallela and R. Kumar, A Divisive Infromation-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 2003, pp. 1265-1287.
    [14] Http://www.daviddlewis.com/resources/testcollections/reuters21578
    [15] A Mc Callum, K. Nigam and L. Ungar, Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. In Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, 2000, pp. 169-178.
    口試委員
  • 洪宗貝 - 召集委員
  • 吳志宏 - 委員
  • 林文揚 - 委員
  • 郭忠民 - 委員
  • 李錫智 - 指導教授
  • 口試日期 2007-07-26 繳交日期 2007-08-09

    [回到前頁查詢結果 | 重新搜尋]


    如有任何問題請與論文審查小組聯繫