Responsive image
博碩士論文 etd-0729105-110206 詳細資訊
Title page for etd-0729105-110206
論文名稱
Title
以正範例與未分類範例為學習資料之混合式文件分類技術
An Ensemble Approach for Text Categorization with Positive and Unlabeled Examples
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
53
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2005-07-20
繳交日期
Date of Submission
2005-07-29
關鍵字
Keywords
正範例、文件分類、文件探勘、單一類別分類、未分類範例、混合式方法
Single-Class Classification, Text Mining, Positive Examples, Text Categorization, Unlabeled Examples, Ensemble Approach
統計
Statistics
本論文已被瀏覽 5742 次,被下載 2043
The thesis/dissertation has been browsed 5742 times, has been downloaded 2043 times.
中文摘要
文件分類技術可以自動化的從已經分類好的訓練範例中學習出分類模式,並藉由此模式,將未分類的文件歸類到正確的類別之中。傳統二分類情況下的文件分類技術,所需要的訓練範例必須包含正範例與負範例,然而,在很多現實情況下,取得負範例需要很昂貴的成本,相較之下,正範例與未分類範例的取得就容易許多。因此,本研究針對現有只以正範例與未分類範例作為學習範例的演算法的限制,利用Ensemble的概念提出了一個混合式的研究架構,並以垃圾郵件過濾器作為我們評估的例子,實證評估結果顯示,本研究確實比PNB與PEBL此兩種演算法能達到更穩定且可靠的分類結果。
Abstract
Text categorization is the process of assigning new documents to predefined document categories on the basis of a classification model(s) induced from a set of pre-categorized training documents. In a typical dichotomous classification scenario, the set of training documents includes both positive and negative examples; that is, each of the two categories is associated with training documents. However, in many real-world text categorization applications, positive and unlabeled documents are readily available, whereas the acquisition of samples of negative documents is extremely expensive or even impossible. In this study, we propose and develop an ensemble approach, referred to as E2, to address the limitations of existing algorithms for learning from positive and unlabeled training documents. Using the spam email filtering as the evaluation application, our empirical evaluation results suggest that the proposed E2 technique exhibits more stable and reliable performance than PNB and PEBL.
目次 Table of Contents
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION 1
1.1 BACKGROUND 1
1.2 RESEARCH MOTIVATION AND OBJECTIVES 2
1.3 ORGANIZATION OF THE THESIS 4

CHAPTER 2 LITERATURE REVIEW 5
2.1 TEXT CATEGORIZATION 5
2.2 LEARNING FROM POSITIVE AND UNLABELED EXAMPLES 6
2.2.1 Positive Na
參考文獻 References
[ABS00] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-Based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Databases Technology (EDBT00), 2000, pp. 365-379.
[ADW94] Apte, C., Damerau, F., and Weiss, S., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions of Information Systems, Vol. 12, No. 3, 1994, pp. 233-251.
[AKC00] Androutsopoulos, I., Koutsias, J., Chandrinos, K.V. and Spyropoulos, C. D., “An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages,” ACM Special Interest Group on Information Retrieval, 2000, pp.160-167.
[AKCP00] Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G.. and Spyropoulos, C. D., “An Evaluation of Naive Bayesian Anti-Spam Filtering,” Proceedings of Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, 2000.
[BK99] Bauer, E. and Kohavi, R., “An empirical comparison of voting classification algorithms: Bagging, boosting, and variants,” Machine Learning, Vol. 36, 1999, pp.105-139.
[BM98] Baker, L. D. and Mccallum, A. K., “Distributional Clustering of Words for Text Classification,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 96-103.
[C95] Cohen, W. W., “Fast Effective Rule Induction,” in Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, 1995, pp. 115-123.
[CDG99] Comit
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內外都一年後公開 withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code