國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,應用於文件分類的自建構式模糊特徵擷取法,A Self-Constructing Fuzzy Feature Clustering for Text Categorization

論文名稱 Title	應用於文件分類的自建構式模糊特徵擷取法 A Self-Constructing Fuzzy Feature Clustering for Text Categorization
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	97 學年度第 2 學期 The spring semester of Academic Year 97	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	75
研究生 Author	劉仁嘉 Ren-jia Liu
指導教授 Advisor	李錫智 Shie-Jue Lee
召集委員 Convenor	洪宗貝 Tzung-Pei Hong
口試委員 Advisory Committee	郭忠民, 林文揚, 謝朝和 Chung-Ming Kuo; Wen-Yang Lin; Chaur-Heh Hsieh
口試日期 Date of Exam	2009-07-22	繳交日期 Date of Submission	2009-08-26
關鍵字 Keywords	模糊相似度、文件分類、特徵縮減、特徵分群、模糊分群、特徵擷取 text classification, feature reduction, feature clustering, feature extraction, fuzzy clustering, fuzzy similarity
統計 Statistics	本論文已被瀏覽 5793 次，被下載 3612 次 The thesis/dissertation has been browsed 5793 times, has been downloaded 3612 times.

中文摘要
「特徵分群（feature clustering）」是一種有效降低資料維度的方法，本研究提出基於模糊相似度（fuzzy similarity）的「自建構式模糊特徵分群法」，以文件詞彙出現在各類別中的分佈所形成的詞彙樣本（word pattern）來表示文件特徵，對所有特徵向量進行漸進式模糊分群（incremental fuzzy clustering），利用高斯函數作為歸屬函數（membership function），而歸屬函數的平均值（mean）和標準差（deviation）分別用來描述一個群的中心點和資料分佈狀況；利用歸屬值（membership degree）的大小作為分群的標準，將夠相似的特徵向量歸為一群。所有特徵經過演算法形成數個群之後，每個群即代表擷取出來的新特徵。我們將每個群集中所有的特徵向量進行加權組合來擷取出新特徵。我們所提出的演算法中，其衍生出來的歸屬函數可密切並適當地描繪出訓練資料的實際分佈情形。除此之外，使用者毋須事先設定群集的個數，可避免為尋找最佳之擷取特徵數量而所需的反覆試驗過程。我們採用Usenet新聞群組（20 Newsgroups）文件集和網頁分類（Cade 12）文件集作為實驗資料，並使用支援向量機（Support Vector Machine）進行分類。實驗結果顯示，我們的演算法相較於其他方法，不但減少分類器所需的訓練時間，還可提高分類器的準確率。
Abstract
Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper, we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. 20 Newsgroups data set and Cade 12 web directory are introduced to be our experimental data. We adopt the support vector machine to classify the documents. Experimental results show that our method can run faster and obtain better extracted features than other methods.

目次 Table of Contents
摘要 i Abstract ii 圖目錄 iii 表目錄 v 第一章緒論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究架構 5 第二章文獻探討 6 2.1 自動化文件分類 6 2.2 群聚分析 7 2.3 特徵刪減 13 2.3.1 特徵選取 14 2.3.2 特徵擷取 15 2.3.3 特徵分群 18 2.4 分類器 21 2.4.1 軟性邊界支援向量機 23 第三章研究方法 25 3.1 前置處理 26 3.1.1 Case Folding 27 3.1.2 Stemming 27 3.1.3 Stop-word Removal 28 3.2 我們的方法 29 3.2.1 詞彙樣本 29 3.2.2 詞彙群集 31 3.2.3 自建構式模糊特徵分群法 32 3.2.4 特徵擷取 38 3.3 範例 39 第四章實驗結果與討論 43 4.1 實驗資料 43 4.2 評估準則 44 4.3 實驗數據 45 4.3.1 20 Newsgroups 45 4.3.2 Cadê 12 45 4.4 問題與討論 57 4.4.1 時間複雜度分析 57 4.4.2 討論 57 第五章結論 59 參考文獻 60

參考文獻 References
[1] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002. [2] R.E. Bellman, Dynamic Programming. Princeton University Press, Princeton, NJ, 1957. [3] L. D. Baker and A. McCallum, “Distributional Clustering of Words for Text Classification,” 21st Annual International ACM SIGIR, pp. 96-103, 1998 [4] N. Slonim and N. Tishby, “The Power of Word Clusters for Text Classification,” 23rd European Colloquium on Information Retrieval Research (ECIR), 2001. [5] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” KDD Workshop on Text Mining, Technical report of University of Minnesota, 2000. [6] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proceedings of the European Conference on Machine Learning (ECML), Springer, 1998. [7] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, "Training Algorithms for Linear Text Classifiers," ACM SIGIR'96, Zurich, Switzerland, August 1996, pp. 298-306. [8] Eui-Hong Han, George Karypis, and Vipin Kumar, Text Categorization Using Weight Adjusted k -Nearest Neighbor Classification. Springer Berlin, 2001. [9] W. Lam, C.Y. Ho, “Using A Generalized Instance Set for Automatic Text Categorization,” Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), pp. 81-89, 1998. [10] C. Aptê, F. Damerau, and S. Weiss, "Automated Learning of Decision Rules for Text Categorization," ACM Transactions on Information Systems, 12(2):233-251, 1994. [11] R. E. Schapire and Y. Singer, “BoosTexter: A Boosting-based System for Text Categorization,” Machine Learning (39:2-3) 2000, pp. 135-168. [12] L. M. Manevitz and M. Yousef, “Document Classification on Neural Networks Using Only Positive Examples,” Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 3604-306, 2000. [13] Y. Yang and X. Liu, “A Re-examination of Text Categorization Methods,” Proceedings 37 of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42-49, 1999. [14] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka, “Training algorithms for linear text classifiers,” Proceedings of the 19th International Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 298-306, 1996. [15] E. H. Han and G. Karypis, “Centroid-Based Document Classification: Analysis &Experimental Results,” Technical Report #00-017, 2000. [16] W. Lam and C. Y. Ho, “Using A Generalized Instance Set for Automatic Text Categorization,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 81-89, 1998. [17] Http://neural.cs.nthu.edu.tw/jang/books/dcpr/ [18] Y. Yang and J. O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” 14th International Conference on Machine Learning, pp. 412-420, 1997 [19] I. T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986. [20] A. M. Martinez and A. C. Kak, “PCA versus LDA,,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-233, 2001. [21] H. Li, T. Jiang, and K. Zang, “Efficient and Robust Feature Extraction by Maximum Margin Criterion,” Conference on Advances in Neural Information Processing System, pp. 97-104, 2004. [22] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290, pp. 2319-2323, 2000. [23] S. T. Roweis and L. K. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” Science, vol. 290, pp. 2323-2326, 2000. [24] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” Advances in Neural Information Processing Systems 14, 2002. [25] K. Hiraoka, K. Hidai, M. Hamahira, H. Mizoguchi, T. Mishima, and S. Yoshizawa, “Successive Learning of Linear Discriminant Analysis: Sanger-Type Algorithm,” 14th International Conference on Pattern Recognition, pp. 2664-2667, 2000. [26] J. Weng, Y. Zhang, and W. S. Hwang, “Candid Covariance-Free Incremental Principal Component Analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, pp. 1034-1040, 2003. [27] J. Yan, B. Y. Zhang, S. C. Yan, Z. Chen, W. G. Fan, Q. Yang, W. Y. Ma, and Q. S. Cheng, “IMMC: Incremental Maximum Margin Criterion,” 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 725-730, 2004 [28] J.Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, and Z. Chen, “Effective and Efficient Dimensionality Reduction for Large-Scale and Streaming Data Preprocessing,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 3, pp. 320-331, 2006. [29] Y. Yang and J. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of the International Conference on Machine Learning (ICML’97), pp. 412-420, 1997. [30] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English Words,” 31st Annual Meeting of ACL, pp. 183-190, 1993. [31] I. S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification,” Journal of Machine Learning Research, vol. 3, pp. 1265-1287, 2003. [32] B.-Y. Ricardo and R.-N. Berthier, Modern Information Retrieval. Addison Wesley Longman, 1999. [33] Z. Harris, "Distributional Structure," Word 10 (2/3): 146-62, 1954 [34] Joel Larocca Neto, Alexandre D. Santos, Celso A.A. Kaestner, and Alex A. Freitas, “Document Clustering and Text Summarization,” Proceedings 4th International Conference Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), pp. 41-55, London, 2000. [35] W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structure and Algorithms. Prentice Hall, Englwood Cliffs, NJ, USA, 1992. [36] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distributional Word Clusters vs. Words for Text Categorization,” Journal of Machine Learning Research, vol. 3, pp. 1183-1208, 2003. [37] S.-J. Lee and C.-S. Ouyang, "A Neuro-Fuzzy System Modeling with Self-Constructing Rule Generation and Hybrid SVD-Based Learning, " IEEE Transactions on Fuzzy Systems, Vol. 11, No. 3, pp. 341-353, Jun. 2003. [38] J. Yen and R. Langari, Fuzzy Logic - Intelligence, Control, and Information. Prentice-Hall, Upper Saddle River, NJ, USA, 1999. [39] J. S. Wang and C. S. G. Lee, “Self-Adaptive Neurofuzzy Inference Systems for Classification Applications,” IEEE Transactions on Fuzzy Systems, vol. 10, no. 6, pp. 790-802, 2002 [40] C. C. Chang and C. J. Lin, “Libsvm: A Library for Support Vector Machines,” 2001, software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm [41] Http://people.csail.mit.edu/jrennie/20Newsgroups/ [42] The Cade Web directory, http//www.cade.com.br/ [43] B. Larsen and C. Aone, “Fast and Effective Text Mining Using Linear-time Document Clustering,” KDD-99, California, 1999. [44] 曾元顯，「文件主題自動分類成效因素探討」，中國圖書館學會會報，2002年6月，第68期，頁62-83 [45] 吳毓傑、陳振南，「以叢聚式作法進行中文新聞分類」，第八屆國際資訊管理研究暨實務研討會，第1卷，民國91年11月23日，頁619-627。

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0826109-151344.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS