Responsive image
博碩士論文 etd-0816110-145751 詳細資訊
Title page for etd-0816110-145751
論文名稱
Title
隨新進文件學習之漸進式意向模式
Incremental Aspect Model Learning on Streaming Documents
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
68
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2010-07-27
繳交日期
Date of Submission
2010-08-16
關鍵字
Keywords
意向模型、機率式潛在語意索引、漸進式分群
Incremental Clustering, Probabilistic Latent Semantic Indexing, Aspect Model
統計
Statistics
本論文已被瀏覽 5861 次,被下載 1481
The thesis/dissertation has been browsed 5861 times, has been downloaded 1481 times.
中文摘要
隨著網際網路的發展,日益龐大的資料使得使用者必須尋求輔助工具來幫助他們更得到有用資訊。資訊擷取技術就是其中一種主要的工具可以幫助使用者減輕他們於資訊處理時的負擔。然而,現今的資訊擷取技術並沒有辦法處理在動態環境(如網際網路)下文件隨時更新的狀況。過去的方法是只要有新資料就必須重新建立模式才可應用,但是這種方法既不實際、也不具效率、更耗費太大的成本。所以發展可以處理文件隨時更新的資訊擷取技術成為一個重要的研究議題。
因此本研究提出一個資訊擷取相關技術,遞增式意向模型,可以隨著時間文件的更新挖掘資料中蘊含的潛在意向。遞增式意向模式包含二個階段:第一階段是藉由機率式潛在語意索引方法來建立一個起始的意向模型;第二階段則隨著時間文件的收集將舊資料移除,融入新資料,並持續更新現有的意向模型。當有顯著的意向出現時即代表新的概念被發現。
我們提出三個實驗來驗證我們所提的方法。前二個實驗是檢驗遞增式意向模式處理文件分群的能力,實驗結果顯示遞增式意向模式不僅有好的分群表現,同時也具有穩健性能挖掘新的意向。第三個實驗是檢驗遞增式意向模式追蹤故事的能力,我們舉前一陣最熱門的話題之一「2010世界杯足球賽」來分析,結果顯示遞增式意向模式在一段時間內確實可以挖掘圍繞世足杯事件不同的主題。這些實驗的結果驗證遞增式意向模式於實務運用的可行性。
Abstract
Owing to the development of Internet, excessive online data drive users to apply tools to assist them in obtaining desired and useful information. Information retrieval techniques serve as one of the major assistance tools that ease users’ information processing loads. However, most current IR models do not consider processing streaming information which essentially characterizes today’s Web environment. The approach to re-building models based on the full knowledge of data at hand triggered by the new incoming information every time is impractical, inefficient, and costly.
Instead, IR models that can be adapted to streaming information incrementally should be considered under the dynamic environment.
Therefore, this research is to propose an IR related technique, the incremental aspect model (ISM), which not only uncovers latent aspects from the collected
documents but also adapts the aspect model on streaming documents chronologically.
There are two stages in ISM: in Stage I, we employ probabilistic latent semantic indexing (PLSI) technique to build a primary aspect model; and in Stage II, with out-of-date data removing and new data folding-in, the aspect model can be expanded using the derived spectral method if new aspects significantly exist.
Three experiments are conducted accordingly to verify ISM. Results from the first two experiments show the robust performance of ISM in incremental text clustering tasks. In Experiment III, ISM performs the task of storylines tracking on the 2010 Soccer World Cup event. It illustrates ISM’s incremental learning ability to discover different themes around the event at any time. The feasibility of our proposed approach in real applications is thus justified.
目次 Table of Contents
1. Introduction 1
1.1 Overview 1
1.2 Objective of the research 2
1.3 Organization of the thesis 3
2. Literature review 4
2.1 Information retrieval 4
2.1.1 IR Term-weighting Schemes 5
2.1.2 Classic IR model 6
(1) Boolean model 6
(2) Vector Space model 6
(3) Probabilistic model 7
2.1.3 LSI and PLSI models 8
(1) LSI 8
(2) PLSI 10
2.2 Text mining 12
2.2.1 Text Categorization 13
2.2.2 Text Clustering 14
2.3 Incremental Clustering Technuiques 17
(1) Incremental PLSI 17
(2) Incremental built aspect model 18
3. Proposed approach 20
3.1 Initial Aspect model Building Stage 20
Step1 Data Collecting 21
Step2 Document Preprocessing 21
(1) Part-of-Speech Tagging 22
(2) Stemming 23
(3) Term weighting 24
Step3 Primary Aspect Model Building 24
Step4 Topic Interpreting 26
3.2 Incremental Aspect Model Updating Stage 27
Step1 Old Data Discarding 29
Step2 New Data Handling 29
Step3 Incremental Aspect Model Building 30
4. Experiments and Results 33
4.1 Experimental Design 33
4.2 Experiment I 36
4.3 Experiment II 38
4.4 Experiment III 43
5. Conclusions 47
5.1 Concluding Remarks 47
5.2 Future Work 48
Reference 50
Appendix I 55
Appendix II 56
參考文獻 References
Andrews, N. O., and Fox, E. A. (2007). Recent Developments in Document Clustering. Technical Report TR-07-35, Department of Computer Science, Viginia Tech.
Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison-Wesley Longman, Boston, Mass.
Basu, A., Walters, C., and Shepherd, M. (2003). Support vector machines for text categorization. Proceedings of the 36th Annual Hawaii International Conference on System Sciences, 2003, 7.
Beil, F., Ester, M., and Xu, X. (2002). Frequent term-based text clustering. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 436-442.
Berry, M. W., Dumais, S. T., and O'Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), 573-595.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993-1022.
Brants, T., Chen, F., and Tsochantaridis, I. (2002). Topic-based document segmentation with probabilistic latent semantic analysis. Proceedings of the Eleventh International Conference on Information and Knowledge Management, 218.
Chou, T. and Chen, M. Ch. (2008). Using Incremental PLSI for Threshold-Resilient Online Event Analysis. The IEEE Transactions on Knowledge and Data Engineering, vol. 20(3).
Coburn, A. (2008). Lingua::EN::Tagger-Part-of-speech tagger for English natural language processing. Available from http://search.cpan.org/dist/Lingua-EN-Tagger/Tagger.pm.
Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., and Mahoney, M. W. (2007). Feature selection methods for text classification. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 239.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society.Series B (Methodological), 39(1), 1-38.
Feldman, R., and Sanger, J. (2007). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge University Press, Computational Linguistics, 34(1).
Foltz, P. W., and Dumais, S. T. (1992). Personalized information delivery: An analysis of information filtering methods. Communications of the ACM, 35(12), 51-60.
Globerson, A., and Tishby, N. (2003). Sufficient dimensionality reduction. The Journal of Machine Learning Research, 3, 1307-1331.
Guyon, I., and Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182.
Hahn, U., and Mani, I. (2000). The challenges of automatic summarization. IEEE-Computer, 33(11), 29-36.
Hofmann, T. (1999). Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50-57.
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1), 177-196.
Hofmann, T. (2004). Latent semantic models for collaborative filtering. ACM Transactions on Information Systems (TOIS), 22(1), 115.
Jin, X., Zhou, Y., and Mobasher, B. (2004). Web usage mining based on probabilistic latent semantic analysis. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 197-205.
Kakkonen, T., Myller, N., Sutinen, E., and Timonen, J. (2008). Comparison of dimension reduction methods for automated essay grading. Journal of Educational Technology and Society, 11(3), 275-288.
Kim, Y. M., Pessiot, J. F., Amini, M. R., and Gallinari, P. (2008). An extension of PLSA for document clustering. Proceeding of the 17th ACM Conference on Information and Knowledge Management, 1345-1346.
Kim, Y. S., Chang, J. H., and Zhang, B. T. (2002). A comparative evaluation of data-driven models in translation selection of machine translation. Proceedings of the 19th International Conference on Computational Linguistics-Volume 1, 7.
Landauer, T. K., and Dumais, S. T. (1997). A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211-240.
Li, C. H., and Park, S. C. (2007). Neural network for text classification based on singular value decomposition. Proceedings of 7th IEEE International Conference on Computer and Information Technology, 47-52.
McCallumzy, A., and Nigamy, K. (1998). A comparison of event models for naive bayes text classification. AAAI-98 Workshop on Learning for Text Categorization.
Merkl, D., and Rauber, A. (2000). Document classification with unsupervised artificial neural networks. In Soft Computing in Information Retrieval: Techniques and Applications, 50, 102-121.
Pilászy, I. (2005). Text categorization and support vector machines. Proceedings of the 6th International Symposium of Hungarian Researchers on Computational Intelligence.
Porter, M. F. (1980). An algorithm for suffix stripping. Program 14 (3), 130-137.
Praks, P., Dvorský, J., and Snášel, V. (2003). Latent semantic indexing for image retrieval systems. Proceedings of SIAM International Conference on Applied Linear Algebra.
Remeikis, N., Skučas, I., and Melninkaitė, V. (2004). Text categorization using neural networks initialized with decision trees. Informatica, 15(4), 551-564.
Robertson, S. E., and Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129-146.
Rosell , M. (2009). Introduction to information retrieval and text clustering. KTH CSC.
Ruge, G. (1997). Automatic detection of thesaurus relations for information retrieval applications. Foundations of Computer Science, 499-506.
Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 620.
Saravanan, M., and Raman, S. (2002). The term distribution model for summarization of multiple documents. Proceedings of the Indo European Conference on Multilingual Communication Technologies (IEMCT 2002), 182–192.
Sebastiani, F. (2005). Text categorization. Text Mining and its Application to Intelligence, CRM and Knowledge Management.
Song, W., and Park, S. C. (2007). A novel document clustering model based on latent semantic analysis. Third International Conference on Semantics, Knowledge and Grid, 539-542.
Steinbach, M., Karypis, G., and Kumar, V. (2000). A comparison of document clustering techniques. KDD Workshop on Text Mining, 400, 525-526.
Surendran, A., and Sra, S. (2006). Incremental aspect models for mining document streams. Proceedings of 17th European Conference on Knowledge Discovery in Databases: PKDD 2006, 633-640.
Wang, V. M. C. J., and Banerjee, S. (2005). A neuro-SVM model for text classification using latent semantic indexing. Proceedings of International Joint Conference on Neural Networks, 564-569.
Wei, J., Bressan, S., and Ooi, B. C. (2002). Mining term association rules for automatic global query expansion: Methodology and preliminary results. Proceedings of the 1st International Conference of Web Information Systems Engineering, 366-373.
Zamir, O., and Etzioni, O. (1998). Web document clustering: A feasibility demonstration. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 46-54.
Zelikovitz, S., and Hirsh, H. (2001). Using LSI for text classification in the presence of background text. Proceedings of the 10th International Conference on Information and Knowledge Management, 118.
Zelikovitz, S., and Hirsh, H. (2004). Transductive LSI for short text classification problems. Proceedings of the Seventeenth International FLAIRS Conference, 67–72.
Zhang, H. P., Xu, H. B., Bai, S., Wang, B., and Cheng, X. Q. (2004). Experiments in TREC 2004 novelty track at CAS-ICT. NIST Special Publication 500-261: The Thirteenth Text RsEtrieval Conference (TREC 2004).
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內立即公開,校外一年後公開 off campus withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code