國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,影像情境對於對話系統的影響,The Impacts of Image Contexts on Dialogue Systems

論文名稱 Title	影像情境對於對話系統的影響 The Impacts of Image Contexts on Dialogue Systems
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	106 學年度第 2 學期 The spring semester of Academic Year 106	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	61
研究生 Author	王豪逸 Hao-Yi Wang
指導教授 Advisor	李偉柏 Wei-Po Lee
召集委員 Convenor	林耕霈 Keng-Pei Lin
口試委員 Advisory Committee	蕭漢威 Han-Wei Hsiao
口試日期 Date of Exam	2018-06-25	繳交日期 Date of Submission	2018-07-17
關鍵字 Keywords	影像辨識、自然語言、對話、機器學習、類神經網路、循環神經網路、捲積神經網路 , Dialogue, Convolutional neural networks, Recurrent neural networks, Image recognition, Natural language, Neural networks, Machine learning
統計 Statistics	本論文已被瀏覽 5961 次，被下載 0 次 The thesis/dissertation has been browsed 5961 times, has been downloaded 0 times.

中文摘要
與電腦對話在日常生活中越來越常見，不僅僅適用於執行指令、查詢資料，更想被用於陪伴、娛樂，如同一個真實的朋友一般隨時能與人對話。在過往的對話系統多中多數只會根據指令做出既有的回覆，無法讓人感覺到他的生命力還是當作電腦系統來看待。為了能做出更真實的對話系統，本研究將使用類神經網路在較接近生活方式的對話中訓練。在生活中對話並不只是簡單的問答問題，回答上不只是簡單名詞或是對錯，需要有多元有趣的回應。對話中也不只是根據文字做出回應還會考慮眼前的人事物，做出符合情境的回應。本研究會將以上兩種元素將入對話系統中，資料會採用影集的字幕。影集除了能有多元的對話回應，還能從中提取出影像給予對話系統更多參考資料。本研究將使用深度模型網路，結合近年來在影像辨識領域有良好效果的捲積神經網路與在自然語言中有優良表現的循環神經網路，完成同時考慮影像與文字的對話系統。研究過程探討了在對話過程中影像給予的幫助、影集字幕開放式的對話是否適合用於訓練、是否有方法能夠量化產出回應的合理性、另外也使用人工評量的方式驗證模型是否能夠產出讓人覺得合理的對話。
Abstract
Chatting with machines is not only possible but also more and more common in our lives these days. With the approach, we can execute commands and obtain companionship and entertainment through interacting with the machines. In the past, most dialogue systems only used existing replies based on the instructions the machines received. However, it is unlikely for people to feel the vitality of the machine. People still regard their chatting partners are computer systems. In order to develop a more realistic dialogue system, this study adopts a deep neural network to train the machines to make more lifestyle-oriented dialogues. People’s common dialogues include not only simple question-answering problems and the answers are not just as short as a noun or a yes-or-no answer. There are also many diverse and interesting responses. Moreover, suitable communicative responses among the conversationalists depend not only on the contents, but also on the environmental contexts. This study develops a dialogue system to take into account the two essential factors, and uses the TV series as the training datasets. These datasets contain not only conversational contents but also video frames which represent the contexts and the situations when the conversations occur. To explore the effect of the image context on the utterance in a dialogue, this work uses a deep neural network model, combining a convolutional neural network (which works well in image recognition) with a recurrent neural network (which achieves excellent performance in natural language). It aims to develop a dialogue system to consider both images and utterances, and to find better ways to evaluate the dialogue system’s responses including defining a quantitative measurement and designing a questionnaire to verify the models learnt from the datasets.

目次 Table of Contents
摘要 ii Abstract iii 1. 緒論 1 1-1. 研究背景 1 1-2. 研究動機與目的 2 2. 文獻探討 3 2-1. 對話 3 2-1-1. 循環神經網路 3 2-1-2. SeqToSeq 模型 4 2-1-3. 對話系統 5 2-2. 影像處理 5 2-2-1. 圖片處理 5 2-2-2. 影片處理 7 2-3. 影像與自然語言 7 3. 研究方法 9 3-1. 資料集 10 3-1-1. 資料集介紹 10 3-1-2. 資料集的選擇 11 3-1-3. 字幕處理 12 3-2. 衡量方法 14 3-2-1. 自動評量 15 3-2-2. 人工評量 16 3-3. 模型 16 3-3-1. 影像與語句模型 16 3-3-2. 純語句模型 20 3-3-3. 相似度模型 20 4. 實驗及結果 22 4-1. 資料集分析 22 4-2. 訓練模型 24 4-2-1. 影像擷取 24 4-2-2. 訓練策略 25 4-2-3. 收斂狀況 26 4-3. 影像的影響 28 4-3-1. 對於模型產出的影響 28 4-3-2. 影像分析 33 4-4. 訓練結果 35 4-4-1. 訓練資料的學習狀況 35 4-4-2. 相似度模型 38 4-4-3. 人工評量結果 40 4-5. 綜合討論 41 5. 結論與未來研究 44 5-1. 結論 44 5-2. 未來研究 44 參考資料 46 附錄-人工評量題目 49

參考文獻 References
[1]. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166. [2]. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. [3]. Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259. [4]. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. [5]. Vinyals, O., & Le, Q. (2015). A neural conversational model. arXiv preprint arXiv:1506.05869. [6]. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. [7]. Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997. [8]. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. [9]. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). [10]. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732). [11]. Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1942-1950). [12]. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). YouTube-8M: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. [13]. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634). [14]. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., ... & Bernstein, M. S. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32-73. [15]. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164). [16]. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., & Girshick, R. (2017, July). CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1988-1997). IEEE. [17]. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433). [18]. Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 21-29). [19]. Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE international conference on computer vision (pp. 1-9). [20]. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Association for Computational Linguistics. [21]. Feng, M., Xiang, B., Glass, M. R., Wang, L., & Zhou, B. (2015, December). Applying deep learning to answer selection: A study and an open task. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on (pp. 813-820). IEEE.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0616118-181354.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS