Responsive image
博碩士論文 etd-0624118-155432 詳細資訊
Title page for etd-0624118-155432
論文名稱
Title
跨語言主題模型分析之研究
A Research On Cross-Lingual Topic Analysis
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
63
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2018-07-23
繳交日期
Date of Submission
2018-07-24
關鍵字
Keywords
跨語言主題模型、文字向量空間、多語言對應文本、多語言主題模型、主題模型
Cross-lingual topic model, Topic modeling, Polylingual topic model, Parallel corpus, word vector space, LDA
統計
Statistics
本論文已被瀏覽 6137 次,被下載 794
The thesis/dissertation has been browsed 6137 times, has been downloaded 794 times.
中文摘要
跨語言主題以往的研究大多建構在多語言文本,其中又以Mimno所發表的多語言主題模型(Polylingual Topic Model)最具代表,然而這類的跨語言主題模型皆受限於文本結構,其表現隨著文本中相對應的語言文章佔有比率減少而衰弛。類似的多語言對應文章像是歐洲議會記錄或是香港政府公告,同樣一份內容會有多語言的對應版本,這樣的資源並不容易取得,其文章種類及數量也相對一般文章而言稀少。在汲取各語言主題的方法上,若是使用翻譯器或是翻譯人員,這樣的方式不僅耗時且成本高昂,不同領域的用詞也會影響翻譯的正確性。
每個地區的人們所討論的主題不盡相同,過往的多語言主題模型研究僅能汲取各語言共同的討論主題。本篇論文提出的方法採用三種對應不同語言的文字向量空間方式來建構跨語言主題模型,不僅不需要多語言對應文本,打破了過往多語言主題模型的限制,除了在多語言主題的表現上能與Mimno的多語言主題模型比擬,還能有效汲取僅在單一語言討論的主題。
Abstract
Most of cross-lingual topic models in the previous work rely on the parallel or comparable corpus. The polylingual topic model (PLTM) proposed by Mimno et al (2009) is the most representative one. However, parallel or comparable corpus like Europarl and Wikipedia are not available in many cases. In this thesis, we propose a method combining the techniques of mapping word vector spaces between languages and topic modeling (LDA). The cross-lingual word vector mapping enables us to map word vector spaces, and LDA helps us group words into topics. Thus, we combine two techniques to construct the cross-lingual topic model.
In contrast to PLTM, our proposed approach does not need the comparable or parallel corpus to construct the cross-lingual topic model and identify the topics discussed only in a single language.
We compare the performance of PLTM and our approach using UM-corpus (Tian, L et al., 2014), an English-Chinese bilingual corpus. The results of the evaluations show that our proposed approach could align the topics across languages properly and the performance is comparable with the PLTM.
目次 Table of Contents
TABLE OF CONTENT

論文審定書 i
摘要 ii
Abstruct iii

CHAPTER 1 – Introduction 1
CHAPTER 2 – Related Work 7
2.1 Cross-lingual Topic Model 7
2.2 Cross-lingual Word Representation 9
2.3 Topic Model with Word Representation 12
CHAPTER 3 – Our Approach 15
3.1 Word representation 16
3.2 Word vector mapping method 16
3.2.1 Linear Projection by Least Squares 17
3.2.2 Linear Projection with CCA 18
3.2.3 Orthogonal Transformations by SVD 19
3.3 Cross-Lingual Topic Model (CLTM) 20
CHAPTER 4 – Experiments 23
4.1 Data Collection 23
4.2 Word representation for each language 24
4.3 Mapping word vectors across languages 25
4.4 Topic number setting 28
4.5 Cross-lingual Topic Model (CLTM) 32
4.6 Experiment Design 35
4.7 Experimental Result 36
4.7.1 Entropy of each topic model 36
4.7.2 Jensen Shannon Divergence (JSD) of document topic distribution 38
4.7.3 Word coherence of topic 40
CHAPTER 5 – Conclusion 45
5.1 Future work 45
Reference 47
Appendix 53
參考文獻 References
Reference
1. Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., & Smith, N. A. (2016). Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.
2. Banerjee, A., Dhillon, I. S., Ghosh, J., & Sra, S. (2005). Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research, 6(Sep), 1345-1382.
3. Batmanghelich, K., Saeedi, A., Narasimhan, K., & Gershman, S. (2016). Nonparametric spherical topic modeling with word embeddings. arXiv preprint arXiv:1604.00126.
4. Bengio, Y., Schwenk, H., et al. (2006). Neural Probabilistic Language Models, 186, 137–186.
5. Blei, D. M., & Jordan, M. I. (2003). Modeling annotated data. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 127–134. https://doi.org/http://doi.acm.org/10.1145/860435.860460
6. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
7. Boyd-Graber, J., & Blei, D. M. (2009, June). Multilingual topic models for unaligned text. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (pp. 75-82). AUAI Press.
8. Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning (pp. 160-167). ACM.
9. Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian lda for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 795-804).
10. Dhillon, I. S., & Sra, S. (2003). Modeling data using directional distributions. Technical Report TR-03-06, Department of Computer Sciences, The University of Texas at Austin. URL ftp://ftp. cs. utexas. edu/pub/techreports/tr03-06. ps. gz.
11. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226-231).
12. Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multilingual correlation. Association for Computational Linguistics.
13. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., & Klein, D. (2008). Learning bilingual lexicons from monolingual corpora. Proceedings of ACL-08: Hlt, 771-779.
14. Hassan, S., & Mihalcea, R. (2011, August). Semantic Relatedness Using Salient Semantic Analysis. In Aaai.
15. Jarmasz, M. (2012). Roget's thesaurus as a lexical resource for natural language processing. arXiv preprint arXiv:1204.0140.
16. Kinga, D., & Adam, J. B. (2015). A method for stochastic optimization. In International Conference on Learning Representations (ICLR) (Vol. 5).
17. Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. Proceedings of COLING 2012, 1459-1474.
18. Koehn, P. (2005, September). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79-86).
19. Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1), 145-151.
20. Liu, X., Duh, K., & Matsumoto, Y. (2015). Multilingual Topic Models for Bilingual Dictionary Extraction. ACM Transactions on Asian and Low-Resource Language Information Processing, 14(3), 11.
21. Liu, X., Duh, K., & Matsumoto, Y. (2015). Multilingual Topic Models for Bilingual Dictionary Extraction. ACM Transactions on Asian and Low-Resource Language Information Processing, 14(3), 11.
22. Lu, A., Wang, W., Bansal, M., Gimpel, K., & Livescu, K. (2015). Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 250-256).
23. Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605.
24. Mann, G. S., Mimno, D., &McCallum, A. (2006). Bibliometric impact measures leveraging topic analysis. ACM/IEEE-CS Joint Conference on Digital Libraries, 65–74. https://doi.org/10.1145/1141753.1141765
25. Mikolov, T., Corrado, G., Chen, K., &Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12. https://doi.org/10.1162/153244303322533223
26. Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168
27. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
28. Mimno, D., Wallach, H. M., Naradowsky, J., Smith, D. A., & McCallum, A. (2009, August). Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2 (pp. 880-889). Association for Computational Linguistics.
29. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing (pp. 262-272). Association for Computational Linguistics.
30. Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019.
31. Ni, X., Sun, J. T., Hu, J., & Chen, Z. (2009, April). Mining multilingual topics from wikipedia. In Proceedings of the 18th international conference on World wide web (pp. 1155-1156). ACM.
32. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
33. Prettenhofer, P., &Stein, B. (2010). Cross-Language Text Classification using Structural Correspondence Learning. ACL ’10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, (July), 1118–1127. https://doi.org/10.1145/2036264.2036277
34. Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.
35. Smith, S. L., Turban, D. H., Hamblin, S., & Hammerla, N. Y. (2017). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859.
36. Strube, M., & Ponzetto, S. P. (2006, July). WikiRelate! Computing semantic relatedness using Wikipedia. In AAAI (Vol. 6, pp. 1419-1424).
37. Tam, Y.-C., &Schultz, T. (2007). Bilingual LSA-Based Translation Lexicon Adaptation for Spoken Language Translation. Interspeech-2007, 2461–2464. Retrieved from http://csl.anthropomatik.kit.edu/downloads/Tam_IS07_LSABasedTranslationLexicon.pdf
38. Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014). UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation. In LREC (pp. 1837-1842).
39. Titov, I., & McDonald, R. (2008, April). Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web (pp. 111-120). ACM.
40. Wei, X., &Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 06, pages, 178. https://doi.org/10.1145/1148170.1148204
41. Xiao, M., & Guo, Y. (2013). Semi-Supervised Representation Learning for Cross-Lingual Text Classification. In EMNLP (pp. 1465-1475).
42. Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1006-1011).
43. Zhao, B., &Xing, E. P. (2006). BiTAM: Bilingual Topic AdMixture Models for Word Alignment. Proceedings of the COLING/ACL on Main Conference Poster Sessions, 969--976. https://doi.org/1273073.1273197
44. Zhou, G., He, T., &Zhao, J. (2014). Bridging the Language Gap: Learning Distributed Semantics for Cross-Lingual Sentiment Classification. Nlpcc, 138–149. Retrieved from http://link.springer.com/chapter/10.1007/978-3-662-45924-9_13
45. Zhou, H., Chen, L., Shi, F., &Huang, D. (2015). Learning Bilingual Sentiment Word Embeddings for Cross-language Sentiment Classification. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 430–440. Retrieved from http://www.aclweb.org/anthology/P15-1042
46. Zhou, X., Wan, X., &Xiao, J. (2016). Cross-Lingual Sentiment Classification with Bilingual Document Representation Learning. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), 1403–1412.
47. Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1393-1398).
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外完全公開 unrestricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code