論文使用權限 Thesis access permission:校內外都一年後公開 withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
以正範例與未分類範例為學習資料之混合式文件分類技術 An Ensemble Approach for Text Categorization with Positive and Unlabeled Examples |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
53 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2005-07-20 |
繳交日期 Date of Submission |
2005-07-29 |
關鍵字 Keywords |
正範例、文件分類、文件探勘、單一類別分類、未分類範例、混合式方法 Single-Class Classification, Text Mining, Positive Examples, Text Categorization, Unlabeled Examples, Ensemble Approach |
||
統計 Statistics |
本論文已被瀏覽 5742 次,被下載 2043 次 The thesis/dissertation has been browsed 5742 times, has been downloaded 2043 times. |
中文摘要 |
文件分類技術可以自動化的從已經分類好的訓練範例中學習出分類模式,並藉由此模式,將未分類的文件歸類到正確的類別之中。傳統二分類情況下的文件分類技術,所需要的訓練範例必須包含正範例與負範例,然而,在很多現實情況下,取得負範例需要很昂貴的成本,相較之下,正範例與未分類範例的取得就容易許多。因此,本研究針對現有只以正範例與未分類範例作為學習範例的演算法的限制,利用Ensemble的概念提出了一個混合式的研究架構,並以垃圾郵件過濾器作為我們評估的例子,實證評估結果顯示,本研究確實比PNB與PEBL此兩種演算法能達到更穩定且可靠的分類結果。 |
Abstract |
Text categorization is the process of assigning new documents to predefined document categories on the basis of a classification model(s) induced from a set of pre-categorized training documents. In a typical dichotomous classification scenario, the set of training documents includes both positive and negative examples; that is, each of the two categories is associated with training documents. However, in many real-world text categorization applications, positive and unlabeled documents are readily available, whereas the acquisition of samples of negative documents is extremely expensive or even impossible. In this study, we propose and develop an ensemble approach, referred to as E2, to address the limitations of existing algorithms for learning from positive and unlabeled training documents. Using the spam email filtering as the evaluation application, our empirical evaluation results suggest that the proposed E2 technique exhibits more stable and reliable performance than PNB and PEBL. |
目次 Table of Contents |
TABLE OF CONTENTS CHAPTER 1 INTRODUCTION 1 1.1 BACKGROUND 1 1.2 RESEARCH MOTIVATION AND OBJECTIVES 2 1.3 ORGANIZATION OF THE THESIS 4 CHAPTER 2 LITERATURE REVIEW 5 2.1 TEXT CATEGORIZATION 5 2.2 LEARNING FROM POSITIVE AND UNLABELED EXAMPLES 6 2.2.1 Positive Na |
參考文獻 References |
[ABS00] Agrawal, R., Bayardo, R., and Srikant, R., “Athena: Mining-Based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Databases Technology (EDBT00), 2000, pp. 365-379. [ADW94] Apte, C., Damerau, F., and Weiss, S., “Automated Learning of Decision Rules for Text Categorization,” ACM Transactions of Information Systems, Vol. 12, No. 3, 1994, pp. 233-251. [AKC00] Androutsopoulos, I., Koutsias, J., Chandrinos, K.V. and Spyropoulos, C. D., “An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages,” ACM Special Interest Group on Information Retrieval, 2000, pp.160-167. [AKCP00] Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G.. and Spyropoulos, C. D., “An Evaluation of Naive Bayesian Anti-Spam Filtering,” Proceedings of Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, 2000. [BK99] Bauer, E. and Kohavi, R., “An empirical comparison of voting classification algorithms: Bagging, boosting, and variants,” Machine Learning, Vol. 36, 1999, pp.105-139. [BM98] Baker, L. D. and Mccallum, A. K., “Distributional Clustering of Words for Text Classification,” Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘98), 1998, pp. 96-103. [C95] Cohen, W. W., “Fast Effective Rule Induction,” in Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, 1995, pp. 115-123. [CDG99] Comit |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內外都一年後公開 withheld 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |