Responsive image
博碩士論文 etd-0810110-175700 詳細資訊
Title page for etd-0810110-175700
論文名稱
Title
一個混合式多標籤文件分類方法
A Mixed Approach for Multi-Label Document Classification
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
54
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2010-07-21
繳交日期
Date of Submission
2010-08-10
關鍵字
Keywords
多標籤文件分類、模糊相似度測量、相關分數、資訊檢索
relevance score, information retrieval, Multi-Label document classification, fuzzy similarity measure
統計
Statistics
本論文已被瀏覽 5811 次,被下載 1686
The thesis/dissertation has been browsed 5811 times, has been downloaded 1686 times.
中文摘要
不同於單標籤(single-label)的文件分類,文件只屬於單一類別,當文件同時分類到兩個以上的類別時,稱為多標籤(multi-label)文件,而如何多具有多標籤特性的文件進行準確的分類,成為近年來熱門的研究課題。在此論文裡,我們針對多標籤文件分類問題提出一個結合模糊相似方法與multi-label K nearest neighbors(MLKNN)演算法的分類方法 fuzzy similarity measure multi-label K nearest neighbors(FSMLKNN),我們的方法透過模糊相似測量演算法來計算測試文件與類別群中心相似度,並結合MLKNN的演算法使其效率大幅改善且準確率相對的提升。在實驗中,會將FSMLKNN和現存的分類方法,包含決策樹C4.5、支援向量機support vector machine(SVM)、和MLKNN演算法比較,實驗的結果顯示,FSMLKNN相較於其他方法具有更佳的效率與良好的準確率。
Abstract
Unlike single-label document classification, where each document exactly belongs to a single category, when the document is classified into two or more categories, known as multi-label file, how to classify such documents accurately has become a hot research topic in recent years. In this paper, we propose a algorithm named fuzzy similarity measure multi-label K nearest neighbors(FSMLKNN) which combines a fuzzy similarity measure with the multi-label K nearest neighbors(MLKNN) algorithm for multi-label document classification, the algorithm improved fuzzy similarity measure to calculate the similarity between a document and the center of cluster similarity, and proposed algorithm can significantly improve the performance and accuracy for multi-label document classification. In the experiment, we compare FSMLKNN and the existing classification methods, including decision tree C4.5, support vector machine(SVM) and MLKNN algorithm, the experimental results show that, FSMLKNN method is better than others.
目次 Table of Contents
摘要 i
Abstract ii
圖目錄 v
表目錄 vi
第一章 緒論 1
1.1 概述 1
1.2 研究動機 2
1.3 論文架構 3
第二章 文獻探討 4
2.1 多標籤問題轉換 4
2.2 分類方法 6
2.2.1 決策樹C4.5 6
2.2.2 支援向量機 7
2.2.3 MLKNN 8
2.2.4 模糊相似方法 11
第三章 系統簡介 15
3.1 文件分類系統架構 15
3.2 文件前處理 16
3.3 特徵選取 18
3.4 特徵選取方式 20
第四章 我們的方法 21
4.1 模糊相似分群 23
4.2 MLKNN分類 27
第五章 實驗結果與分析 29
5.1 文件集 29
5.2 評估方法 30
5.3 實驗結果 31
5.3.1 實驗一 31
5.3.2 實驗二 38
第六章 結論與未來展望 42
參考文獻 43

參考文獻 References
[1] R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval,” Addison Wesley, 1999.
[2] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning Multi-Label Scene Classification,” Pattern Recognition, vol. 37, no. 9, pages 1757-1771, 2004.
[3] Y. C. Chang, S. M. Chen, and C. J. Liau, “Multilabel Text Categorization Based on a New Linear Classifier Learning Method and a Category-Sensitive Refinement Method,” Expert Systems with Application, pages 1948-1953, 2008.
[4] S. Diplaris, G. Tsoumakas, P. Mitkas, and I. Vlahavas, “Protein Classification with Multiple Algorithms,” Panhellenic Conference on Informatics , vol. 3746, pages 448-456, 2005.
[5] S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive Learning Algorithms and Representation for Text Categorization,” ACM International Conference on Information and Knowledge Management, pages 148-155, 1998.
[6] N. Fuhr and C. Buckley, “A Probabilistic Learning Approach for Document Indexing,” ACM Transactions on Information Systems, vol. 9, no. 3, pages 223-248 , 1991.
[7] I. J. Good, “The Estimation of Probabilities: An Essay on Modern Bayesian Methods,” MIT Press, 1965.
[8] D. A. Hull, “Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing,” ACM International Conference on Research and Development in Information Retrieval, pages 282-289, 1994.
[9] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with Tfidf for Text Categorization,” International Conference on Machine Learning , pages 143-151, 1997.
[10] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” European Conference on Machine Learning, pages 137-142, 1998.
[11] D. D. Lewis and M. Ringuette, “A Comparison of Two Learning Algorithms for Text Categorization,” Third Annual Symposium on Document Analysis and Information Retrieval, pages 81-93, 1994.
[12] D. D. Lewies, Y. Yang, T. G. Rose, and F. Li, “RCV1 : A New Benchmark Collection for Text Categorization Research, ” Journal of Machine Learning Research, vol. 5, pages 361-397, 2004
[13] T. Mitchell, “Machine Learning,” McGraw-Hill, 1997.
[14] J. R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pages 81-106, 1986.
[15] J. R. Quinlan, “C4.5: Programs for Machine Learning,” Morgan Kaufmann, 1993.
[16] J. J. Rocchio, “Relevance Feedback in Information Retrieval,” The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313-323, 1997.
[17] G. Salton and M. J. McGill, “Introduction to Modern Retrieval,” McGraw-Hill Book Company, 1983.
[18] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pages 1-47, 2002.
[19] R. Saracoğlu, K. Tütüncü, and N. Allahverdi, “A New Approach on Search for Similar Documents with Multiple Categories Using Fuzzy Clustering,” Expert Systems with Application, pages 2545-2554, 2008.
[20] S. Tan, “Neighbor-weighted K-nearest Neighbor for Unbalanced Text Corpus,” Expert Systems with Applications, vol. 28, no. 4, pages 667-671, 2005.
[21] S. Tan, “An Effective Refinement Strategy for KNN Text Classifier,” Expert Systems with Applications, vol. 30, no. 2, pages 290-298, 2006.
[22] G. Tsoumakas and I. Katakis, “Multi-label Classification: An Overview,” International Journal of Data Warehousing and Mining vol. 3, no. 3, pages 1-13, 2007.
[23] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining Multi-label Data,” Data Mining and Knowledge Discovery Handbook (draft of preliminary accepted chapter), O. Maimon, L. Rokach (Ed.), Springer, 2nd edition, 2009.
[24] D. H. Widyantoro and J. Yen, “A Fuzzy Similarity Approach in Text Classification Task,” IEEE International Conference on Fuzzy Systems, vol. 2, pages 653-658, 2000.
[25] M. L. Zhang and Z. H. Zhou, “A K-nearest Neighbor Based Algorithm for Multi-label Classification,” IEEE International Conference on Granular Computing, vol. 2, pages 718-721, 2005.
[26] M. L. Zhang and Z. H. Zhou, “MLKNN : A Lazy Learning Approach Multi-Label Learning,” Pattern Recognition, vol. 40, pages 2038-2048, 2007.
[27] http://disi.unitn.it/moschitti/corpora.htm
[28] http://people.csail.mit.edu/jrennie/20Newsgroups/
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code