Responsive image
博碩士論文 etd-0907111-062138 詳細資訊
Title page for etd-0907111-062138
論文名稱
Title
一個文件相似度測量方法及其應用
A Document Similarity Measure and Its Applications
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
50
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2011-07-05
繳交日期
Date of Submission
2011-09-07
關鍵字
Keywords
相似度測量方法、文件相似度、多標籤、單標籤、準確度、文件分類、亂度、文件分群
k-means, document similarity, Similarity measure, BEP, F1, single-label, multi-label, Accuracy, text classification, Entropy, document clustering, k-NN, ML-KNN
統計
Statistics
本論文已被瀏覽 5765 次,被下載 1476
The thesis/dissertation has been browsed 5765 times, has been downloaded 1476 times.
中文摘要
在本論文中,我們提出新的文件相似度測量方法並且將此方法應用於文件分類和文件分群。對於測量兩筆文件的相似度而言,我們考慮三種情況:(a)當詞彙特徵同時出現在兩筆文件中,(b)當詞彙特徵僅出現在其中一筆文件,(c)當詞彙特徵不出現在兩筆文件中。對於第一種情況,我們給定一個下界,並且根據兩筆文件特徵值的差異來給予其相似度;對於第二種情況,無論特徵值為何,給定一個負數值;對於第三種情況,不考慮其影響力。我們將這個方法應用在以相似度為基礎的單標籤文件分類器k-NN及多標籤文件分類器ML-KNN,並且延伸來測量文件與文件集合之間的相似度,應用於文件分群,在此我們使用的是k-means like演算法。實驗結果證明我們的方法比其他方法效果更佳。
Abstract
In this paper, we propose a novel similarity measure for document data processing and apply it to text classification and clustering. For two documents, the proposed measure takes three cases into account: (a) The feature considered appears in both documents, (b) the feature considered appears in only one document, and (c) the feature considered appears in none of the documents. For the first case, we give a lower bound and decrease the similarity according to the difference between the feature values of the two documents. For the second case, we give a fixed value disregarding the magnitude of the feature value. For the last case, we ignore its effectiveness. We apply it to the similarity based single-label classifier k-NN and multi-label classifier ML-KNN, and adopt these properties to measure the similarity between a document and a specific set for document clustering, i.e., k-means like algorithm, to compare the effectiveness with other measures. Experimental results show that our proposed method can work more effectively than others.
目次 Table of Contents
目錄
摘要 I
Abstract II
目錄 III
圖次 V
表次 VI
第一章 導論 1
第二章 文獻探討 3
2.1傳統相似度量測方法 3
2.2 相關分類及分群演算法之應用 5
2.2.1 k-NN單標籤分類演算法 6
2.2.2 ML-KNN多標籤分類演算法 7
2.2.3 k-means like 分群演算法 8
第三章 研究方法 10
3.1 研究動機 10
3.2 我們的方法 10
3.2.1 方法概述 10
3.2.2 方法論證 12
3.3 範例說明 18
第四章 實驗結果與分析 24
4.1 實驗資料 24
4.2 單標籤文件分類 26
4.3 多標籤文件分類 29
4.4 文件分群 32
第五章 結論 36
參考文獻 37
圖次
圖2.1 Cosine相似度示意圖 3
圖2.2 k-NN單標籤分類演算法示意圖 7
圖4.1 我們的方法在不同λ值下的分類結果 27
圖4.2 各種方法在WebKB文件集合的分類結果 28
圖4.3 各種方法在Reuters-21578的8個文件集合分類結果 29
圖4.4 我們的方法在不同λ值下的結果 30
圖4.5 各方法在RCV1的比較 31
圖4.6 我們的方法在不同λ值下Accuracy結果 33
圖4.7 WebKB文件分群結果比較 34
圖4.8 Reuters-21578 TOP 8之文件分群結果比較 35
表次
表4.1 WEBKB文件集合的文件分布情形 23
表4.2 路透社前8個文件集合的文件分布情形 24
表4.3 RCV1的5個子文件集合 25
表4.4 各種方法在WEBKB文件集合分類的正確性 26
表4.5 各種方法在Reuters-21578的8個文件集合分類的正確性 27
表4.6 在WebKB正確性的統計顯著性檢驗 27
表4.7 各方法在RCV1文件集合的F1 30
表4.8 各方法在RCV1文件集合的BEP 31
表4.9 各方法在RCV1上F1的統計顯著性檢驗結果 31
表4.10 Reuters-21578 TOP 8文件Accuracy和Entropy統計顯著性檢驗 34
參考文獻 References
參考文獻
[1] http://web.ist.utl.pt/acardoso/datasets/.
[2] http://www.cs.technion.ac.il/ronb/thesis.html.
[3] http://www.daviddlewis.com/resources/testcollections/reuters21578/.
[4] P. K. Agarwal and C. M. Procopiuc. Exact and approximation algorithms for clustering. Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 658–667, 1998.
[5] D. W. Aha. Lazy learning: Special issue editorial. Artificial Intelligence Review, 11(1-5):7–10, 1997.
[6] D. Cai, X. He, and J. Han. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering, 17(12):1624–1637, 2005.
[7] H. Chim and X. Deng. Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering, 20(9):1217–1229, 2008.
[8] M. Craven, D. DiPasquo, D. Freitag, A. K. McCallum, T. M. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge form the world wide web. Proceedings of 15th National Conference on Artificial Intelligence, 1998.
[9] I. S. Dhillon, J. Kogan, and C. Nicholas. Feature Selection and Document Clustering. In Berry MW Ed. A Comprehensive Survey of Text Mining, 2003.
[10] I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175, 2001.
[11] J. D’hondt, J. Vertommen, P.-A. Verhaegen, D. Cattrysse, and J. R. Duflou. Pairwise-adaptive dissimilarity measure for document clustering. Information Sciences, 180:2341–2358, 2010.
[12] C. G. Gonz′alez, W. B. Jr., and A. L. V. Rodrigues. Density of closed balls in real-valued and autometrized boolean spaces for clustering applications. 19th Brazilian Symposium on Artificial Intelligence, pages8–22, 2008.
[13] K. M. Hammouda and M. S. Kamel. Efficient phrase-based document indexing for web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16(10):1279–1296, 2004.
[14] K. M. Hammouda and M. S. Kamel. Hierarchically distributed peer-to-peer document clustering and cluster summarization. IEEE Transactionson Knowledge and Data Engineering, 21(5):681–698, 2009.
[15] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Second Edition, Morgan Kaufmann, Elsevier, 2006.
[16] T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. International Conference on Machine Learning, pages143–151, 1997.
[17] T. Joachims and F. Sebastiani. Guest editors’ introduction to the special issue on automated text categorization. Journal of Intelligent Information Systems, 18(2/3):103–105, 2002.
[18] T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881–892, 2002.
[19] H. Kim, P. Howland, and H. Park. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37–53, 2005.
[20] S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng. Some effective techniques for naïve bayes text classification. IEEE Transactions on Knowledge and Data Engineering, 18(11):1457–1466, 2006.
[21] K. Knight. Mining online text. Communications of the ACM, 42(11):58–61, 1999.
[22] J. Kogan, C. Nicholas, and V. Volkovich. Text mining with information-theoretic clustering. Computing in Science and Engineering, 5(6):52–59, 2003.
[23] J. Kogan, M. Teboulle, and C. K. Nicholas. Data driven similarity measures for k-means like clustering algorithms. Information Retrieval, 8(2):331–349, 2005.
[24] S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the euclidean k-median problem. Seventh Annual European Symposium on Algorithms, pages362–371, 1999.
[25] V. Lertnattee and T. Theeramunkong. Multidimensional text classification for drug information. IEEE Transactions on Information Technology in Biomedicine, 8(3):306–312, 2004.
[26] D. D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.
[27] M. G. Michie. Use of the bray-curtis similarity measure in cluster analysis of foraminiferal data. Mathematical Geology, 14(6):661–667, 1982.
[28] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[29] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2/3):103–134, 2000.
[30] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text from labeled and unlabeled documents. Proceedings of 15th National Conference on Artificial Intelligence, 1998.
[31] G. Salton and M. J. McGill. Introduction to Modern Retrieval. McGraw-Hill Book Company, 1983.
[32] T. W. Schoenharl and G. Madey. Evaluation of measurement techniques for the validation of agent-based simulations against streaming data. International Conference on Computational Science, 2008.
[33] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.
[34] C. Silva, U. Lotric, B. Ribeiro, and A. Dobnikar. Distributed text classification with an ensemble kernel-based learning approach. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40(3):287–297, 2010.
[35] A. Strehl and J. Ghosh. Value-based customer grouping from large retail data-sets. SPIE Conference on Data Mining and Knowledge Discovery, 4057:33–42, 2000.
[36] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addision-Wesley, 2006.
[37] M. L. Zhang and Z. H. Zhou. ML-KNN: Alazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007.
[38] T. Zhang, Y. Y. Tang, B. Fang, and Y. Xiang. Document clustering in correlation similarity measure space. IEEE Transactions on Knowledge and Data Engineering (to appear), 2011.
[39] Y. Zhao and G. Karypis. Comparison of agglomerative and partitional document clustering algorithms. The Workshop on Clustering High Dimensional Data and its Applications at the Second SIAM International Conference on Data Mining, pages83–93, 2002.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code