Responsive image
博碩士論文 etd-0619117-145750 詳細資訊
Title page for etd-0619117-145750
論文名稱
Title
基於矩陣分解的主題推薦與發現
Topic Recommendation and Discovery based on Matrix Factorization
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
49
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2017-07-13
繳交日期
Date of Submission
2017-07-19
關鍵字
Keywords
主題發現、非負矩陣分解、推薦、主題模型
Non-negative Matrix Factorization, Topic Discovery, Recommendation, Topic Modeling
統計
Statistics
本論文已被瀏覽 5953 次,被下載 122
The thesis/dissertation has been browsed 5953 times, has been downloaded 122 times.
中文摘要
隨著網路的發達,現在在網路上有越來越多的文件,因為很多的資訊都與文字息息相關,像是新聞或文章. 因此,有很多的學者利用這些文件來做文字分析. 而非負矩陣分解是一種用非機率的方法用來分解文集. 在這篇論文中,我們提出利用稀疏限制的非負矩陣分解來做k個主題的主題模型. 此外, 我們想在稀疏限制的非負矩陣分解中加入一個與作者相關的矩陣,並且找出在主題中隱藏的部分. 它可以給我們更多的資訊進而幫助我們找出來的主題更加集中.其中,決定主題個數k是一個困難但我們必須解決的問題,所以我們利用互資訊與穩定度來評估主題數k. 它可以提供對於主題數k的一個參考. 除此之外,我們想要利用Jensen-Shannon divergence來找出每個主題在不同時間裡面的詞語的改變. 他可以計算出主題之間的距離並且我們可以利用Hungarian algorithm找出不同時間中對應的主題.
Abstract
Nowadays, there are more and more text documents on the Internet with the development of the Internet, because much information is related to text. Thus, researchers have used these text documents for text analysis. Non-negative Matrix Factorization is a kind of non-probabilistic method to factorize the matrix. In this thesis, we propose to use sparse-constraint NMF to do topic modeling with k topics. Moreover, we want to incorporate author information into nsNMF and so as to find hidden parts in the topics. It can offer more information and make the topic more concentrated. Among it, how many topic k is also a critical but difficult issue. Here, we use the mutual information and stability to determine the number of topic k. Besides, we want to find the changes of terms in topics in different time using Jensen-Shannon divergence and use Hungarian algorithm to match the topics in different times.
目次 Table of Contents
1. Introduction 1
2. Background and Related works 4
2.1 LDA 4
2.2 SVD 5
2.3 NMF 6
3. Method 9
3.1 How many topics k? 9
3.2 Nonsmooth Non-negative Matrix Factorization (nsNMF) 12
3.2 nsNMF with constraint 14
3.3 Topic Discovery 15
4. Experiment & Result 17
4.1 Data and Preprocessing 17
4.2 COOL3C news 19
4.2.1 Document-Term Matrix 19
4.2.2 TF-IDF 20
4.2.3 SVD 21
4.2.4 nsNMF 22
4.2.5 How many topic k? 24
4.2.6 Topic Modeling 25
4.2.7 Article Recommendation 27
4.3 arXiv.ML papers 29
4.3.1 How many topic k? 29
4.3.2 Topic Modeling 31
4.3.3 nsNMF with constraint 32
4.3.4 Topic Discovery 33
5. Conclusion 35
6. Reference 37
參考文獻 References
Aggarwal, C. C., & Zhai, C. (2012). Mining text data. Springer Science & Business Media. Retrieved from https://www.google.com/books?hl=zh-TW&lr=&id=vFHOx8wfSU0C&oi=fnd&pg=PR3&dq=mutual+information+topic+modeling&ots=obag_JmIVy&sig=fQ_MXiuGSe8t_-QXuxA_1deQRg0
Arora, S., Ge, R., & Moitra, A. (2012). Learning Topic Models - Going beyond SVD. arXiv:1204.1956 [Cs]. Retrieved from http://arxiv.org/abs/1204.1956
Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 96–103). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=290970
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
Cai, D., He, X., Han, J., & Huang, T. S. (2011). Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1548–1560.
Carmel, D., Yom-Tov, E., Darlow, A., & Pelleg, D. (2006). What makes a query difficult? In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 390–397). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1148238
Choo, J., Lee, C., Reddy, C. K., & Park, H. (2013). Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics, 19(12), 1992–2001.
Gillis, N. (2014). The why and how of nonnegative matrix factorization. Regularization, Optimization, Kernels, and Support Vector Machines, 12(257). Retrieved from https://www.google.com/books?hl=zh-TW&lr=&id=5Y_SBQAAQBAJ&oi=fnd&pg=PA257&dq=The+Why+and+How+of+Nonnegative+Matrix+Factorization&ots=nwGtxapMBn&sig=TnywuixkEgkwtbnH5t0n5wrj58Y
Gong, L., & Nandi, A. K. (2013). An enhanced initialization method for non-negative matrix factorization. In 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1–6). https://doi.org/10.1109/MLSP.2013.6661949
Greene, D., & Cross, J. P. (2016). Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach. arXiv:1607.03055 [Cs]. Retrieved from http://arxiv.org/abs/1607.03055
Greene, D., O’Callaghan, D., & Cunningham, P. (2014). How Many Topics? Stability Analysis for Topic Models. arXiv:1404.4606 [Cs]. Retrieved from http://arxiv.org/abs/1404.4606
Grosse, I., Bernaola-Galván, P., Carpena, P., Román-Roldán, R., Oliver, J., & Stanley, H. E. (2002). Analysis of symbolic sequences using the Jensen-Shannon divergence. Physical Review E, 65(4), 41905.
Langville, A. N., Meyer, C. D., Albright, R., Cox, J., & Duling, D. (2014). Algorithms, initializations, and convergence for the nonnegative matrix factorization. arXiv Preprint arXiv:1407.7299. Retrieved from https://arxiv.org/abs/1407.7299
Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788–791.
Li, Z., Tang, Z., & Ding, S. (2013). Dictionary learning by nonnegative matrix factorization with 1/2-norm sparsity constraint. In Cybernetics (CYBCONF), 2013 IEEE International Conference on (pp. 63–67). IEEE. Retrieved from http://ieeexplore.ieee.org/abstract/document/6617435/
Liu, J., Wang, C., Gao, J., & Han, J. (2013). Multi-view clustering via joint nonnegative matrix factorization. In Proceedings of the 2013 SIAM International Conference on Data Mining (pp. 252–260). SIAM. Retrieved from http://epubs.siam.org/doi/abs/10.1137/1.9781611972832.28
Pascual-Montano, A., Carazo, J. M., Kochi, K., Lehmann, D., & Pascual-Marqui, R. D. (2006). Nonsmooth Nonnegative Matrix Factorization (nsNMF). IEEE Trans. Pattern Anal. Mach. Intell., 28(3), 403–415. https://doi.org/10.1109/TPAMI.2006.60
Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring topic coherence over many models and many topics. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 952–961). Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=2391052
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 267–273). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=860485
Zou, H., Zhou, G., & Xi, Y. (2011). Research on Modeling Microblog Posts Scale Based on Nonhomogeneous Poisson Process. In G. Zhiguo, X. Luo, J. Chen, F. L. Wang, & J. Lei (Eds.), Emerging Research in Web Information Systems and Mining (pp. 99–112). Springer Berlin Heidelberg. Retrieved from http://link.springer.com/chapter/10.1007/978-3-642-24273-1_14
機器學習中的數學(5)-強大的矩陣奇異值分解(SVD)及其應用- LeftNotEasy - 博客園. (n.d.). Retrieved November 18, 2016, from http://www.cnblogs.com/LeftNotEasy/archive/2011/01/19/svd-and-applications.html
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code