Responsive image
博碩士論文 etd-0802117-142741 詳細資訊
Title page for etd-0802117-142741
論文名稱
Title
整體學習應用於中文文件分類
Ensemble Learning for Text Classification
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
63
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2017-08-31
繳交日期
Date of Submission
2017-09-07
關鍵字
Keywords
整體學習、中文文件分類、支援向量機器 (SVM)、特徵產生、行為知識空間 (BKS)
support vector machine (SVM), ensemble learning, behavior knowledge space (BKS), Chinese text classification, feature generation
統計
Statistics
本論文已被瀏覽 5677 次,被下載 253
The thesis/dissertation has been browsed 5677 times, has been downloaded 253 times.
中文摘要
文件分類(text categorization, document classification, or document categorization)的問題是將文件給定一個預先定義好的類別。這個問題已經在許多領域進行了研究,如:圖書館學,資訊科學與電腦科學。目前為止也有大量的文獻是關於文件分類的研究,但卻鮮少有人討論中文文件分類。
本論文中,我們研究了中文文件分類。由於中文文件分類目前並沒有公開的資料集,因此我們的資料來自yahoo新聞網站。其中,約有50,000篇中文新聞,並分成9類。我們將這些資料來源分為五種:(1)全文,(2)標題,(3)第一段,(4)全文和標題,以及(5)標題和第一段,作為訓練資料。我們使用三種特徵產生方法(TF-IDF,χ^2和IG)來產生每個文件的特徵向量。接著,我們採用SVM作為分類器,因此有15個SVM分類器被訓練。下一步,任選三個分類器透過BKS方法進行整體學習,所以將有(15¦3)=455 個整體分類器被建構。根據實驗數據,我們建議使用 TF-IDF(全文和標題),χ^2(標題),IG(標題)作為整體分類器在中文新聞分類上表現較佳,準確度為79.04%。
Abstract
The text classification (text categorization, document classification, or document categorization) problem is to assign a given document to one of the predefined classes. The problem has been studied in many fields, such as library science, information science and computer science. Though several studies were devoted to the text classification problem, few of them discussed the Chinese text classification.
In this thesis, we study the Chinese text classification problem. Since there is no public dataset for our problem, our experimental dataset was downloaded from the yahoo news web site. The dataset consists of about the 50,000 Chinese news articles in 9 classes. We constitute these news documents into five types of sources: (1) full text, (2) title, (3) first paragraph, (4) full text and title, and (5) title and first paragraph. We use three feature generation methods (TF-IDF, χ^2 and IG) to produce the feature vector from each document. We adopt the SVM method as our basic classifier, thus 15 SVM classifiers are trained. Next, we choose any three of them to constitute an ensemble classifier by the BKS method, so totally (15¦3)=455 ensemble classifiers are constructed. The experimental results show that our suggestion ensemble classifier formed by TF-IDF(full text and title), χ^2(title) and IG(title) has good prediction accuracy 79.04%.
目次 Table of Contents
VERIFICATION FORM . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
THESIS AUTHORIZATION FORM . . . . . . . . . . . . . . . . . . . . iii
THANKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
CHINESE ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ENGLISH ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The Text Classi cation Problem . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 The Vector Space Model . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Chinese Segmentation . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Feature Generation and Feature Selection . . . . . . . . . . . . . . . . 5
2.2.1 Term Frequency-Inverse Document Frequency (TF{IDF) . . . 6
2.2.2 Chi-square Statistic ( 2) . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Information Gain (IG) . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The Ensemble Learning with the Behavior Knowledge Space Method 9
Chapter 3. Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Our Training Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Appendixes
A. Art and Education News . . . . . . . . . . . . . . . . . . . . . . . . . . 32
B. Entertainment News . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
C. Finance News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
D. Health News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
E. Politics News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
F. Society News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
G. Sport News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
H. Technology News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
I. Travel News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
J. Chinese Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
參考文獻 References
[1] “Jieba chinese text segmentation." https://github.com/fxsjy/jieba.
[2] “Yahoo news." https://tw.news.yahoo.com/.
[3] J. Bell, “The most anticipated, and beautifully designed, museums opening in
2017," 2017. http://edition.cnn.com/2017/01/05/arts/new-museums-openingin-
2017/.
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation," Journal
of machine Learning research, Vol. 3, pp. 993-1022, 2003.
[5] A. Broder, M. Fontoura, V. Josifovski, and L. Riedel, “A semantic approach
to contextual advertising," Proceedings of the 30th annual international ACM
SIGIR conference on Research and development in information retrieval, New
York, USA, pp. 559-566, 2007.
[6] N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders, “Word-sequence
kernels," Journal of Machine Learning Research, Vol. 3, pp. 1059-1082, 2003.
[7] C.-H. Chan, A. Sun, and E.-P. Lim, “Automated online news classification with
personalization," In Proceedings of the 4th International Conference of Asian
Digital Library (ICADL2001), Bangalore, India, pp. 320{329, Dec. 2001.
[8] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector machines,"
ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 27, pp. 1-
27, 2011.
[9] T. G. Dietterich, “Ensemble methods in machine learning," Multiple Classifier
Systems, Vol. 1857, pp. 1-15, 2000.
[10] H. Drucker, D. Wu, and V. N. Vapnik, “Support vector machines for spam categorization,"
IEEE Transactions on Neural networks, Vol. 10, No. 5, pp. 1048-
1054, 1999.
[11] T. Dunning, “Accurate methods for the statistics of surprise and coincidence,"
Computational linguistics, Vol. 19, No. 1, pp. 61-74, 1993.
[12] Y. S. Huang and C. Y. Suen, “The behavior-knowledge space method for combination
of multiple classifiers," Proceedings of IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition (CVPR '93), pp. 347-352,
June 1993.
[13] K. Hyunsoo, P. Howland, and H. Park, “Dimension reduction in text classifi-
cation with support vector machines," Journal of Machine Learning Research,
Vol. 6, No. 1, pp. 37-53, 2005.
[14] M. Ikonomakis, S. Kotsiantis, and V. Tampakas, “Text classification using machine
learning techniques," WSEAS Transactions on Computers, Vol. 4, No. 8,
pp. 966-974, 2005.
[15] T. Joachims, “Text categorization with support vector machines: Learning
with many relevant features," Proceedings of the 10th European Conference on
Machine Learning (ECML), Chemnitz, Germany, pp. 137-142, 1998.
[16] E. Leopold and J. Kindermann, “Text categorization with support vector machines.
how to represent texts in input space?," Machine Learning, Vol. 46,
pp. 423-444, 2002.
[17] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text
classification using string kernels," Journal of Machine Learning Research,
Vol. 2, No. 2, pp. 419-444, 2002.
[18] A. Mccallim and K. Nigam, “A comparison of event models for naive bayes
text classification," Proceedings of the AAAI-98 workshop on learning for text
categorization, pp. 41-48, 1998.
[19] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf, “An introduction
to kernel-based learning algorithms," IEEE transactions on neural networks,
Vol. 12, No. 2, pp. 181-201, 2001.
[20] A. K. Nassirtoussi, S. Aghabozorgi, T. Y. Waha, and D. C. L. Ngo, “Text
mining of news-headlines for forex market prediction: A multi-layer dimension
reduction algorithm with semantics and sentiment," Expert Systems with
Applications, Vol. 42, No. 24, pp. 306-324, 2015.
[21] T. H. Nguyena, K. Shirai, and J. Velcinb, “Sentiment analysis on social media
for stock movement prediction," Expert Systems with Applications, Vol. 42,
No. 24, pp. 9603-9611, 2015.
[22] W. Nuij, V. Milea, F. Hogenboom, F. Frasincar, and U. Kaymak, “An automated
framework for incorporating news into stock trading strategies," IEEE
transactions on knowledge and data engineering, Vol. 26, No. 4, pp. 823-835,
2014.
[23] Š. Raudys and F. Roli, “The behavior knowledge space fusion method: Analysis
of generalization error and strategies for performance improvement," Multiple
Classifier Systems (T. Windeatt and F. Roli, eds.), Vol. 2709 of Lecture Notes
in Computer Science, pp. 55-64, 2003.
[24] G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic
indexing," Communications of the ACM, Vol. 18, No. 11, pp. 613-620, 1975.
[25] F. Sebastian, “Machine learning in automated text categorization," ACM Com-
puting Surveys, Vol. 34, No. 1, pp. 1-47, 2002.
[26] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures
for classification tasks," Information Processing and Management, Vol. 45,
No. 4, pp. 427-437, 2009.
[27] H. C. Tu, A Text-Mining Approach to the Authorship Attribution Problem of
Dream of the Red Chamber. National Taiwan University, 2014.
[28] G. Valentini and F. Masulli, “Ensembles of learning machines," Italian Work-
shop on Neural Nets, Heidelberg , Germany, pp. 3-20, 2002.
[29] Y. Yang, “A study of thresholding strategies for text categorization," Proceed-
ings of the 24th annual international ACM SIGIR conference on Research and
development in information retrieval, New Orleans, USA, pp. 137-145, 2001.
[30] Y. Yang and J. O. Pedersen, “A comparative study on feature selection in
text classification," Proceedings of the 14th international conference on machine
learning(ICML), Nashville, Tennessee, USA, pp. 412-420, 1997.
[31] W. Zhang, T. Yoshida, and X. Tang, “A comparative study of TF*IDF, LSI and
multi-words for text classification," Expert Systems with Applications, Vol. 38,
No. 3, pp. 2758-2765, 2011.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code