Responsive image
博碩士論文 etd-0626116-121217 詳細資訊
Title page for etd-0626116-121217
論文名稱
Title
從新聞挖掘社會議題事件之研究
Research on Detecting Emerging Events From News Data
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
51
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2016-07-08
繳交日期
Date of Submission
2016-07-28
關鍵字
Keywords
文字探勘、主題模型、潛在狄氏分配、事件偵測、中文自然語言處理
Topic Model, Online-LDA, Event Detection, Text mining, Chinese Natural Language Processing
統計
Statistics
本論文已被瀏覽 6008 次,被下載 377
The thesis/dissertation has been browsed 6008 times, has been downloaded 377 times.
中文摘要
現今的網路時代裡,網路上的資訊隨著時間源源不絕得被產生出來,例如:網路新聞。因此從網路上大量的資料串流萃取出重要的事件並找出重要社會議題的趨勢是一個重要的研究課題。為了解決這個問題,此研究提出了一個結合中文文字探勘與主題模型的方法從網路上公開的網路新聞自動化偵測社會事件的方法。
為了驗證我們提出的方法,我們以著名的「蘋果日報」網路新聞為例,以關鍵字「陸客」找出從2008到2015年與大陸觀光客相關的新聞,首先對原始資料透過中文自然語言處理的方法進行前處理,並以在線潛在狄氏分配(Online-LDA)主題模型進行建模,找出隨著時間的變化較大的時間區段,進而萃取出其中發生的新事件。我們設計了一個實驗去驗證我們方法的所找出來的新事件的正確性,其結果顯示我們的方法可以有效地針對新出現的中文新聞進行新事件的事件偵測。
Abstract
Nowadays, the Internet provides diversified information. The enormous amounts of information such as online news are generated continuously as time goes by. The rapid-growth amount of online news makes it difficult to manually identify new and emerging events. Thereby, to solve this problem, we propose an approach using text mining techniques and topic modelling to detect the new events from broadcasting Chinese news sources automatically.

To evaluate our method, we select our dataset from scrapping the Chinese news website of “AppleDaily” from 2008 to 2015, where each news articles of the corpus contains the keyword about Tourists from China. We use Chinese Natural Language Processing tool to preprocess our initial data. We implement Online-LDA topic model to find out new events. In the end, we conduct an experiment to measure the performance of our proposed method. The experimental results show that our proposed online event detection method is effective in detecting and tracking Chinese new events as news arrived in streams.
目次 Table of Contents
CHAPTER 1 - Introduction+1
1.1 Background+1
1.2 Motivation+3
CHAPTER 2 - Related Work+5
2.1 Chinese Natural Language Processing+5
2.2 Topic Model–Online LDA+7
2.3 Event Detection and Tracking Systems+9
CHAPTER 3 –The Proposed Approach+11
3.1 Research Skeleton+11
3.2 Data Preprocessing and News Schema Construction+13
3.3 Online LDA clustering+15
3.4 Jensen–Shannon Divergence+18
3.5 Emerging Terms Extracting+19
3.6 Emerging Terms Clustering+20
3.7 New Even Detecting+22
CHAPTER 4 – Dataset and Experiment+24
4.1 Dataset+24
4.2 Dataset Preprocessing+24
4.3 Event Detection+25
4.4 Experiment Settings+28
CHAPTER 5 – Evaluation Results+30
5.1 Experiment Result+30
5.2 Discussion+33
CHAPTER 6 – Conclusion+38
Reference+40
參考文獻 References
Allan, J., Allan, J., Papka, R., Papka, R., Lavrenko, V., & Lavrenko, V. (1998). On-line New Event Detection and Tracking. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 37–45. http://doi.org/10.1.1.45.9162
AlSumait, L., Barbará, D., & Domeniconi, C. (2008). On-line LDA: Adaptive topic models for mining text streams with applications to topic detection and tracking. Proceedings - IEEE International Conference on Data Mining, ICDM, 3–12. http://doi.org/10.1109/ICDM.2008.140
Becker, H. (2011). Identification and Characterization of Events in Social Media.
Blei, D. M., & Lafferty, J. D. (2006). Dynamic Topic Models. International Conference on Machine Learning, 113–120. http://doi.org/10.1145/1143844.1143859
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(4-5), 993–1022. http://doi.org/10.1162/jmlr.2003.3.4-5.993
Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(1), 993–1022.
Cataldi, M., Torino, U., Caro, L. Di, & Schifanella, C. (2010). Emerging Topic Detection on Twitter based on Temporal and Social Terms Evaluation. Mdmkdd’10, 1–10. http://doi.org/10.1145/1814245.1814249
Culotta, A. (2010). Towards detecting influenza epidemics by analyzing Twitter messages. 1st Workshop on Social Media Analytics, (May), 115–122. http://doi.org/10.1145/1964858.1964874
Diao, Q. (2012). Finding Bursty Topics From Microblogs, (July), 8–14.
Fung, G. P. C., Yu, J. X., Yu, P. S., & Lu, H. (2005). Parameter free bursty events detection in text streams. VLDB ’05 Proceedings of the 31st International Conference on Very Large Data Bases, 1, 181–192. http://doi.org/10.1.1.60.2671
He, Q., Chang, K., & Lim, E.-P. (2007). Analyzing feature trajectories for event detection. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’07, 207. http://doi.org/10.1145/1277741.1277779
Hoffman, M. D., Blei, D. M., & Bach, F. (2010). Online Learning for Latent Dirichlet Allocation. Advances in Neural Information Processing Systems, 23, 1–9. http://doi.org/10.1145/1835804.1835928
Kummerfeld, J. K., Tse, D., Curran, J. R., & Klein, D. (2013). An Empirical Examination of Challenges in Chinese Parsing. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 98–103.
Lafferty, D. M. B. and J. D. (2006). Correlated Topic Models. Advances in Neural Information Processing Systems 18, 147–154. http://doi.org/10.1145/1143844.1143859
Landauer, T. K., Dutnais, S. T., Anderson, R., Carroll, D., Fbltz, P., Pumas, G., … Streeter, L. (1997). A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 1(2), 211–240. http://doi.org/10.1037/0033-295X.104.2.211
Lau, J., Collier, N., & Baldwin, T. (2012). On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online. International Conference on Computational Linguistics (COLING), 2(December), 1519–1534. Retrieved from https://www.aclweb.org/anthology/C/C12/C12-1093.pdf
Levy, R., & Manning, C. (2003). Is it harder to parse Chinese, or the Chinese Treebank? Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL ’03, 1, 439–446. http://doi.org/10.3115/1075096.1075152
Lin, J. (1991). Divergence Measures on the Shannon Entropy. IEEE Transactions on Information Theory, 37(I), 145–151.
Ma, W.-Y., & Chen, K.-J. (2003). Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff. Proceedings of the Second SIGHAN Workshop on Chinese Language Processing -, 17, 168–171. http://doi.org/10.3115/1119250.1119276
Osborne, M., Petrovic, S., & McCreadie, R. (2012). Bieber no more: First Story Detection using Twitter and Wikipedia. SIGIR 2012 Workshop on Time-Aware Information Access, (June).
Peng, F., Feng, F., & McCallum, A. (2004). Chinese Segmentation and New Word Detection using Conditional Random Fields. Proceedings of Coling 2004: The 20th International Conference on Computational Linguistics, 562–568. http://doi.org/10.3115/1220355.1220436
Petrović, S., Osborne, M., & Lavrenko, V. (2010). Streaming first story detection with application to twitter. NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference, 181–189. Retrieved from http://www.scopus.com/inward/record.url?eid=2-s2.0-80053272732&partnerID=tZOtx3y1
Qian, X., & Liu, Y. (2012). Joint Chinese word segmentation, POS tagging and parsing. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), (July), 501–511. Retrieved from http://dl.acm.org/citation.cfm?id=2391007
Reuter, T., Papadopoulos, S., Petkos, G., Mezaris, V., Kompatsiaris, Y., Cimiano, P., … Geva, S. (2013). Social event detection at MediaEval 2013: Challenges, datasets, and evaluation. CEUR Workshop Proceedings, 1043, 1–2.
Sun, W. (2010). Word-based and Character-based Word Segmentation Models: Comparison and Combination. Coling 2010: Posters, (August), 1211–1219. Retrieved from http://www.aclweb.org/anthology/C10-2139
Wang, M., Voigt, R., & Manning, C. D. (2014). Two Knives Cut Better Than One: Chinese Word Segmentation with Dual Decomposition. Acl, 193–198. Retrieved from http://www.aclweb.org/anthology/P/P14/P14-2032
Xue, N. (2003). Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing, 8(1), 29–48. http://doi.org/10.3115/1119250.1119278
Yang, Y., Pierce, T., & Carbonell, J. (1998). A Study of Retrospective and On-line Event Detection. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’98, 28–36. http://doi.org/10.1145/290941.290953
Zhang, Y., & Clark, S. (2007). Chinese Segmentation with a Word-Based Perceptron Algorithm. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, (June), 840–847. Retrieved from http://www.aclweb.org/anthology/P07-1106
Zhao, W., Shu, B., Jiang, J., & Song, Y. (2012). Identifying event-related bursts via social media activities. Proceedings of the 2012 …, (July), 1466–1477. Retrieved from http://dl.acm.org/citation.cfm?id=2391116
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code