Responsive image
博碩士論文 etd-0826109-151344 詳細資訊
Title page for etd-0826109-151344
論文名稱
Title
應用於文件分類的自建構式模糊特徵擷取法
A Self-Constructing Fuzzy Feature Clustering for Text Categorization
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
75
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2009-07-22
繳交日期
Date of Submission
2009-08-26
關鍵字
Keywords
模糊相似度、文件分類、特徵縮減、特徵分群、模糊分群、特徵擷取
text classification, feature reduction, feature clustering, feature extraction, fuzzy clustering, fuzzy similarity
統計
Statistics
本論文已被瀏覽 5793 次,被下載 3612
The thesis/dissertation has been browsed 5793 times, has been downloaded 3612 times.
中文摘要
「特徵分群(feature clustering)」是一種有效降低資料維度的方法,本研究提出基於模糊相似度(fuzzy similarity)的「自建構式模糊特徵分群法」,以文件詞彙出現在各類別中的分佈所形成的詞彙樣本(word pattern)來表示文件特徵,對所有特徵向量進行漸進式模糊分群(incremental fuzzy clustering),利用高斯函數作為歸屬函數(membership function),而歸屬函數的平均值(mean)和標準差(deviation)分別用來描述一個群的中心點和資料分佈狀況;利用歸屬值(membership degree)的大小作為分群的標準,將夠相似的特徵向量歸為一群。所有特徵經過演算法形成數個群之後,每個群即代表擷取出來的新特徵。我們將每個群集中所有的特徵向量進行加權組合來擷取出新特徵。
我們所提出的演算法中,其衍生出來的歸屬函數可密切並適當地描繪出訓練資料的實際分佈情形。除此之外,使用者毋須事先設定群集的個數,可避免為尋找最佳之擷取特徵數量而所需的反覆試驗過程。我們採用Usenet新聞群組(20 Newsgroups)文件集和網頁分類(Cade 12)文件集作為實驗資料,並使用支援向量機(Support Vector Machine)進行分類。實驗結果顯示,我們的演算法相較於其他方法,不但減少分類器所需的訓練時間,還可提高分類器的準確率。
Abstract
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper, we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster.
By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. 20 Newsgroups data set and Cade 12 web directory are introduced to be our experimental data. We adopt the support vector machine to classify the documents. Experimental results show that our method can run faster and obtain better extracted features than other methods.
目次 Table of Contents
摘要 i
Abstract ii
圖目錄 iii
表目錄 v
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究架構 5
第二章 文獻探討 6
2.1 自動化文件分類 6
2.2 群聚分析 7
2.3 特徵刪減 13
2.3.1 特徵選取 14
2.3.2 特徵擷取 15
2.3.3 特徵分群 18
2.4 分類器 21
2.4.1 軟性邊界支援向量機 23
第三章 研究方法 25
3.1 前置處理 26
3.1.1 Case Folding 27
3.1.2 Stemming 27
3.1.3 Stop-word Removal 28
3.2 我們的方法 29
3.2.1 詞彙樣本 29
3.2.2 詞彙群集 31
3.2.3 自建構式模糊特徵分群法 32
3.2.4 特徵擷取 38
3.3 範例 39
第四章 實驗結果與討論 43
4.1 實驗資料 43
4.2 評估準則 44
4.3 實驗數據 45
4.3.1 20 Newsgroups 45
4.3.2 Cadê 12 45
4.4 問題與討論 57
4.4.1 時間複雜度分析 57
4.4.2 討論 57
第五章 結論 59
參考文獻 60
參考文獻 References
[1] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[2] R.E. Bellman, Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.
[3] L. D. Baker and A. McCallum, “Distributional Clustering of Words for Text Classification,” 21st Annual International ACM SIGIR, pp. 96-103, 1998
[4] N. Slonim and N. Tishby, “The Power of Word Clusters for Text Classification,” 23rd European Colloquium on Information Retrieval Research (ECIR), 2001.
[5] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” KDD Workshop on Text Mining, Technical report of University of Minnesota, 2000.
[6] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proceedings of the European Conference on Machine Learning (ECML), Springer, 1998.
[7] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, "Training Algorithms for Linear Text Classifiers," ACM SIGIR'96, Zurich, Switzerland, August 1996, pp. 298-306.
[8] Eui-Hong Han, George Karypis, and Vipin Kumar, Text Categorization Using Weight Adjusted k -Nearest Neighbor Classification. Springer Berlin, 2001.
[9] W. Lam, C.Y. Ho, “Using A Generalized Instance Set for Automatic Text Categorization,” Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), pp. 81-89, 1998.
[10] C. Aptê, F. Damerau, and S. Weiss, "Automated Learning of
Decision Rules for Text Categorization," ACM Transactions on
Information Systems, 12(2):233-251, 1994.
[11] R. E. Schapire and Y. Singer, “BoosTexter: A Boosting-based System for Text Categorization,” Machine Learning (39:2-3) 2000, pp. 135-168.
[12] L. M. Manevitz and M. Yousef, “Document Classification on Neural Networks Using Only Positive Examples,” Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 3604-306, 2000.
[13] Y. Yang and X. Liu, “A Re-examination of Text Categorization Methods,” Proceedings 37 of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42-49, 1999.
[14] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, “Training algorithms for linear text classifiers,” Proceedings of the 19th International Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 298-306, 1996.
[15] E. H. Han and G. Karypis, “Centroid-Based Document Classification: Analysis &Experimental Results,” Technical Report #00-017, 2000.
[16] W. Lam and C. Y. Ho, “Using A Generalized Instance Set for Automatic Text Categorization,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 81-89, 1998.
[17] Http://neural.cs.nthu.edu.tw/jang/books/dcpr/
[18] Y. Yang and J. O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” 14th International Conference on Machine Learning, pp. 412-420, 1997
[19] I. T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.
[20] A. M. Martinez and A. C. Kak, “PCA versus LDA,,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-233, 2001.
[21] H. Li, T. Jiang, and K. Zang, “Efficient and Robust Feature Extraction by Maximum Margin Criterion,” Conference on Advances in Neural Information Processing System, pp. 97-104, 2004.
[22] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290, pp. 2319-2323, 2000.
[23] S. T. Roweis and L. K. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326, 2000.
[24] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” Advances in Neural Information Processing Systems 14, 2002.
[25] K. Hiraoka, K. Hidai, M. Hamahira, H. Mizoguchi, T. Mishima, and S. Yoshizawa, “Successive Learning of Linear Discriminant Analysis: Sanger-Type Algorithm,” 14th International Conference on Pattern Recognition, pp. 2664-2667, 2000.
[26] J. Weng, Y. Zhang, and W. S. Hwang, “Candid Covariance-Free Incremental Principal Component Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, pp. 1034-1040, 2003.
[27] J. Yan, B. Y. Zhang, S. C. Yan, Z. Chen, W. G. Fan, Q. Yang, W. Y. Ma, and Q. S. Cheng, “IMMC: Incremental Maximum Margin Criterion,” 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 725-730, 2004
[28] J.Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, and Z. Chen, “Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 3, pp. 320-331, 2006.
[29] Y. Yang and J. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of the International Conference on Machine Learning (ICML’97), pp. 412-420, 1997.
[30] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English Words,” 31st Annual Meeting of ACL, pp. 183-190, 1993.
[31] I. S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification,” Journal of Machine Learning Research, vol. 3, pp. 1265-1287, 2003.
[32] B.-Y. Ricardo and R.-N. Berthier, Modern Information Retrieval. Addison Wesley Longman, 1999.
[33] Z. Harris, "Distributional Structure," Word 10 (2/3): 146-62, 1954
[34] Joel Larocca Neto, Alexandre D. Santos, Celso A.A. Kaestner, and Alex A. Freitas, “Document Clustering and Text Summarization,” Proceedings 4th International Conference Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), pp. 41-55, London, 2000.
[35] W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structure and Algorithms. Prentice Hall, Englwood Cliffs, NJ, USA, 1992.
[36] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distributional Word Clusters vs. Words for Text Categorization,” Journal of Machine Learning Research, vol. 3, pp. 1183-1208, 2003.
[37] S.-J. Lee and C.-S. Ouyang, "A Neuro-Fuzzy System Modeling with Self-Constructing Rule Generation and Hybrid SVD-Based Learning, " IEEE Transactions on Fuzzy Systems, Vol. 11, No. 3, pp. 341-353, Jun. 2003.
[38] J. Yen and R. Langari, Fuzzy Logic - Intelligence, Control, and Information. Prentice-Hall, Upper Saddle River, NJ, USA, 1999.
[39] J. S. Wang and C. S. G. Lee, “Self-Adaptive Neurofuzzy Inference Systems for Classification Applications,” IEEE Transactions on Fuzzy Systems, vol. 10, no. 6, pp. 790-802, 2002
[40] C. C. Chang and C. J. Lin, “Libsvm: A Library for Support Vector Machines,” 2001, software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm
[41] Http://people.csail.mit.edu/jrennie/20Newsgroups/
[42] The Cade Web directory, http//www.cade.com.br/
[43] B. Larsen and C. Aone, “Fast and Effective Text Mining Using Linear-time Document Clustering,” KDD-99, California, 1999.
[44] 曾元顯,「文件主題自動分類成效因素探討」,中國圖書館學會會報,2002年6月,第68期,頁62-83
[45] 吳毓傑、陳振南,「以叢聚式作法進行中文新聞分類」,第八屆國際資訊管理研究暨實務研討會,第1卷,民國91年11月23日,頁619-627。
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code