國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,整體學習應用於中文文件分類,Ensemble Learning for Text Classiﬁcation

論文名稱 Title	整體學習應用於中文文件分類 Ensemble Learning for Text Classiﬁcation
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	106 學年度第 1 學期 The fall semester of Academic Year 106	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	63
研究生 Author	賴駿豪 Jyun-Hao Lai
指導教授 Advisor	楊昌彪 Chang-Biau Yang
召集委員 Convenor	洪宗貝 Tzung-Pei Hong
口試委員 Advisory Committee	曾國尊, 陳嘉平, 謝孫源 Kuo-Tsung Tseng; Chia-Ping Chen; Sun-Yuan Hsieh
口試日期 Date of Exam	2017-08-31	繳交日期 Date of Submission	2017-09-07
關鍵字 Keywords	整體學習、中文文件分類、支援向量機器 (SVM)、特徵產生、行為知識空間 (BKS) support vector machine (SVM), ensemble learning, behavior knowledge space (BKS), Chinese text classification, feature generation
統計 Statistics	本論文已被瀏覽 5677 次，被下載 253 次 The thesis/dissertation has been browsed 5677 times, has been downloaded 253 times.

中文摘要
文件分類（text categorization, document classification, or document categorization）的問題是將文件給定一個預先定義好的類別。這個問題已經在許多領域進行了研究，如：圖書館學，資訊科學與電腦科學。目前為止也有大量的文獻是關於文件分類的研究，但卻鮮少有人討論中文文件分類。本論文中，我們研究了中文文件分類。由於中文文件分類目前並沒有公開的資料集，因此我們的資料來自yahoo新聞網站。其中，約有50,000篇中文新聞，並分成9類。我們將這些資料來源分為五種：（1）全文，（2）標題，（3）第一段，（4）全文和標題，以及（5）標題和第一段，作為訓練資料。我們使用三種特徵產生方法（TF-IDF，χ^2和IG）來產生每個文件的特徵向量。接著，我們採用SVM作為分類器，因此有15個SVM分類器被訓練。下一步，任選三個分類器透過BKS方法進行整體學習，所以將有(15¦3)=455 個整體分類器被建構。根據實驗數據，我們建議使用 TF-IDF（全文和標題），χ^2（標題），IG（標題）作為整體分類器在中文新聞分類上表現較佳，準確度為79.04％。
Abstract
The text classification (text categorization, document classification, or document categorization) problem is to assign a given document to one of the predefined classes. The problem has been studied in many fields, such as library science, information science and computer science. Though several studies were devoted to the text classification problem, few of them discussed the Chinese text classification. In this thesis, we study the Chinese text classification problem. Since there is no public dataset for our problem, our experimental dataset was downloaded from the yahoo news web site. The dataset consists of about the 50,000 Chinese news articles in 9 classes. We constitute these news documents into five types of sources: (1) full text, (2) title, (3) first paragraph, (4) full text and title, and (5) title and first paragraph. We use three feature generation methods (TF-IDF, χ^2 and IG) to produce the feature vector from each document. We adopt the SVM method as our basic classifier, thus 15 SVM classifiers are trained. Next, we choose any three of them to constitute an ensemble classifier by the BKS method, so totally (15¦3)=455 ensemble classifiers are constructed. The experimental results show that our suggestion ensemble classifier formed by TF-IDF(full text and title), χ^2(title) and IG(title) has good prediction accuracy 79.04%.

目次 Table of Contents
VERIFICATION FORM . . . . . . . . . . . . . . . . . . . . . . . . . . . . i THESIS AUTHORIZATION FORM . . . . . . . . . . . . . . . . . . . . iii THANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv CHINESE ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . v ENGLISH ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 The Text Classi cation Problem . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 The Vector Space Model . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 Chinese Segmentation . . . . . . . . . . . . . . . . . . . . . . 4 2.1.3 Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Feature Generation and Feature Selection . . . . . . . . . . . . . . . . 5 2.2.1 Term Frequency-Inverse Document Frequency (TF{IDF) . . . 6 2.2.2 Chi-square Statistic ( 2) . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Information Gain (IG) . . . . . . . . . . . . . . . . . . . . . . 9 2.3 The Ensemble Learning with the Behavior Knowledge Space Method 9 Chapter 3. Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Our Training Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Appendixes A. Art and Education News . . . . . . . . . . . . . . . . . . . . . . . . . . 32 B. Entertainment News . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 C. Finance News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 D. Health News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 E. Politics News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 F. Society News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 G. Sport News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 H. Technology News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 I. Travel News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 J. Chinese Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

參考文獻 References
[1] “Jieba chinese text segmentation." https://github.com/fxsjy/jieba. [2] “Yahoo news." https://tw.news.yahoo.com/. [3] J. Bell, “The most anticipated, and beautifully designed, museums opening in 2017," 2017. http://edition.cnn.com/2017/01/05/arts/new-museums-openingin- 2017/. [4] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation," Journal of machine Learning research, Vol. 3, pp. 993-1022, 2003. [5] A. Broder, M. Fontoura, V. Josifovski, and L. Riedel, “A semantic approach to contextual advertising," Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, New York, USA, pp. 559-566, 2007. [6] N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders, “Word-sequence kernels," Journal of Machine Learning Research, Vol. 3, pp. 1059-1082, 2003. [7] C.-H. Chan, A. Sun, and E.-P. Lim, “Automated online news classification with personalization," In Proceedings of the 4th International Conference of Asian Digital Library (ICADL2001), Bangalore, India, pp. 320{329, Dec. 2001. [8] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector machines," ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 27, pp. 1- 27, 2011. [9] T. G. Dietterich, “Ensemble methods in machine learning," Multiple Classifier Systems, Vol. 1857, pp. 1-15, 2000. [10] H. Drucker, D. Wu, and V. N. Vapnik, “Support vector machines for spam categorization," IEEE Transactions on Neural networks, Vol. 10, No. 5, pp. 1048- 1054, 1999. [11] T. Dunning, “Accurate methods for the statistics of surprise and coincidence," Computational linguistics, Vol. 19, No. 1, pp. 61-74, 1993. [12] Y. S. Huang and C. Y. Suen, “The behavior-knowledge space method for combination of multiple classifiers," Proceedings of IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (CVPR '93), pp. 347-352, June 1993. [13] K. Hyunsoo, P. Howland, and H. Park, “Dimension reduction in text classifi- cation with support vector machines," Journal of Machine Learning Research, Vol. 6, No. 1, pp. 37-53, 2005. [14] M. Ikonomakis, S. Kotsiantis, and V. Tampakas, “Text classification using machine learning techniques," WSEAS Transactions on Computers, Vol. 4, No. 8, pp. 966-974, 2005. [15] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features," Proceedings of the 10th European Conference on Machine Learning (ECML), Chemnitz, Germany, pp. 137-142, 1998. [16] E. Leopold and J. Kindermann, “Text categorization with support vector machines. how to represent texts in input space?," Machine Learning, Vol. 46, pp. 423-444, 2002. [17] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text classification using string kernels," Journal of Machine Learning Research, Vol. 2, No. 2, pp. 419-444, 2002. [18] A. Mccallim and K. Nigam, “A comparison of event models for naive bayes text classification," Proceedings of the AAAI-98 workshop on learning for text categorization, pp. 41-48, 1998. [19] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf, “An introduction to kernel-based learning algorithms," IEEE transactions on neural networks, Vol. 12, No. 2, pp. 181-201, 2001. [20] A. K. Nassirtoussi, S. Aghabozorgi, T. Y. Waha, and D. C. L. Ngo, “Text mining of news-headlines for forex market prediction: A multi-layer dimension reduction algorithm with semantics and sentiment," Expert Systems with Applications, Vol. 42, No. 24, pp. 306-324, 2015. [21] T. H. Nguyena, K. Shirai, and J. Velcinb, “Sentiment analysis on social media for stock movement prediction," Expert Systems with Applications, Vol. 42, No. 24, pp. 9603-9611, 2015. [22] W. Nuij, V. Milea, F. Hogenboom, F. Frasincar, and U. Kaymak, “An automated framework for incorporating news into stock trading strategies," IEEE transactions on knowledge and data engineering, Vol. 26, No. 4, pp. 823-835, 2014. [23] Š. Raudys and F. Roli, “The behavior knowledge space fusion method: Analysis of generalization error and strategies for performance improvement," Multiple Classifier Systems (T. Windeatt and F. Roli, eds.), Vol. 2709 of Lecture Notes in Computer Science, pp. 55-64, 2003. [24] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing," Communications of the ACM, Vol. 18, No. 11, pp. 613-620, 1975. [25] F. Sebastian, “Machine learning in automated text categorization," ACM Com- puting Surveys, Vol. 34, No. 1, pp. 1-47, 2002. [26] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks," Information Processing and Management, Vol. 45, No. 4, pp. 427-437, 2009. [27] H. C. Tu, A Text-Mining Approach to the Authorship Attribution Problem of Dream of the Red Chamber. National Taiwan University, 2014. [28] G. Valentini and F. Masulli, “Ensembles of learning machines," Italian Work- shop on Neural Nets, Heidelberg , Germany, pp. 3-20, 2002. [29] Y. Yang, “A study of thresholding strategies for text categorization," Proceed- ings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, USA, pp. 137-145, 2001. [30] Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text classification," Proceedings of the 14th international conference on machine learning(ICML), Nashville, Tennessee, USA, pp. 412-420, 1997. [31] W. Zhang, T. Yoshida, and X. Tang, “A comparative study of TF*IDF, LSI and multi-words for text classification," Expert Systems with Applications, Vol. 38, No. 3, pp. 2758-2765, 2011.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0802117-142741.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS