國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,深度加強式學習結合閘門網路,Deep Reinforcement Learning with a Gating Network

論文名稱 Title	深度加強式學習結合閘門網路 Deep Reinforcement Learning with a Gating Network
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	105 學年度第 2 學期 The spring semester of Academic Year 105	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	63
研究生 Author	林崑達 Kun-da Lin
指導教授 Advisor	黃國勝 Kao-Shing Hwang
召集委員 Convenor	林金玲 Jin-Ling Lin
口試委員 Advisory Committee	陳昱仁, 朱明毅 Yu-Jen Chen; Ming-Yi Ju
口試日期 Date of Exam	2017-03-10	繳交日期 Date of Submission	2017-03-27
關鍵字 Keywords	加強式學習、深度加強式學習、深度學習、閘門網路、類神經網路 Reinforcement Learning, Deep Reinforcement Learning, Deep Learning, Gating network, Neural network
統計 Statistics	本論文已被瀏覽 5644 次，被下載 58 次 The thesis/dissertation has been browsed 5644 times, has been downloaded 58 times.

中文摘要
加強式學習法是由事先配置好的獎懲函數，再透過代理人與動態環境互動獲得經驗以更新策略，進而達到控制最佳化。不必知道環境的正確模型，所需要的只是將機器人放入要學習的環境中，以及提供適當的獎懲函數即可讓機器人學習。但在複雜的環境下，加強式學習法有時很難去學習出複雜任務的獎懲函數，就像讓機器人去學習足球比賽一樣，過於複雜的獎懲函數將導致學習困難、緩慢，且傳統加強式學習法比較適用在狀態稀少時，當狀態數極大時就會遇到維度災難的問題。為了解決學習困難以及維度災難的問題，本文提出了融合深度加強式學習以及閘門網路的演算法，憑藉著深度加強式學習的良好基礎，就算是使用圖片像素資訊這麼樣龐大的狀態當成輸入，深度學習網路也可以從中一層一層的訓練出特徵出來，成功的解決過去維度災難的問題，且藉由閘門網路融合過去已學習的策略可以比一開始沒運用舊有策略的機器人更快得到學習效果。在本論文中採用兩個遊戲當作實驗環境，一個為flappy bird、另一個為打乒乓球，在遊戲中先提供特定的簡單獎懲函數做訓練，之後再賦予機器人一個更困難的任務，就像演算法中的各個擊破法(divide and conquer)一樣，本文提供兩種架構的閘門網路，一個為並聯式的閘門網路，一個為串聯式的閘門網路，兩者方法皆能使機器人縮短訓練時間。
Abstract
Reinforcement Learning (RL) is a good way to train the robot since it doesn't need an exact model of the environment. All need is to let a learning agent interact with the en-vironment by an appropriate reward function, which is associated with the goal of the task that the agent is expected to accomplish. Unfortunately, it’s hard to learn a diffi-cult reward function for a complicated problem, such as a soccer player in the game where the goal of scoring is not directly related to the mission or the role the player is asked to play by the coach. Besides, the tabular method for approximation of returns in RL is more suitable for an environment with less states. In a huge state space, RL methods always face the curse of dimensionality. To alleviate those difficulties, this paper proposed an algorithm – a deep reinforcement learning method regulated by and gating networks. By the merit of deep learning neural networks, even regarding pixels in an image as states, the latent features can be trained and implicitly extracted layer by layer from raw data. In the proposed method, a composed policy can be obtained by a gating network which regulates the outputs from several deep learning modules, each of which is trained for an individual policy. In this thesis, two video games, flappy bird, and ping-pong is adopted as the testbeds to examinate the performance of the proposed method. In the proposed architecture, each policy module of deep learning is trained by a simple reward functions first. By the gating networks, these simple policies can be composed into a more sophisticated one, so as to accommodate with more complicated tasks. This is akin to the divide-and-conquer strategy. The proposed architecture has two kinds of arrangements and structures. One is called the in-parallel gating network, and the other is called the in-serial network. From the outcomes, it’s observed that both can efficiently shorten the training time.

目次 Table of Contents
論文審定書 i 摘要 iii Abstract iv 圖表目錄 viii 表格目錄 x 1. 介紹 1 1-1 動機 1 1-2 歷史回顧 2 1-3 論文架構 3 2. 背景 4 2-1 加強式學習 4 2-2 類神經網路 7 2-3 卷積神經網路 11 2-3.1 基礎概念 11 2-3.2 卷積層的倒傳遞 17 2-3.3 池化層的倒傳遞 26 2-4 深度加強式學習 27 3. 研究方法 32 3-1 並聯閘門式網路 32 3-2 串聯閘門式網路 36 4. 模擬 37 4-1 Flappy bird 40 4-2 PingPong 44 5. 結論 48 5-1 總結 48 5-2 未來計畫 48 參考資料 49

參考文獻 References
[1] C. J. C. H. Watkins and P. Dayan, “Technical note:Q-learning,” Machine Learning, vol. 8, no. 3, pp. 272–292, 1992. [2] J. L. Lin, K. S. Hwang, W. C. Jiang and Y. J. Chen, “Gait Balance and Acceleration of a Biped Robot Based on Q-Learning,” IEEE Access, vol. 4, pp. 2439-2449, 2016. [3] K. S. Hwang, J. L. Lin, T. C. Huang, and H. J. Hsu, “Humanoid Robot Gait Imitation,” SICE Annual Conference, pp. 2124-2128, 2014. [4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra,and M. Ried¬miller, “Playing Atari with Deep Reinforcement Learning,” NIPS, 2013. [5] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, vol. 65, no. 6, pp. 386-408, 1958. [6] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. I. Cambridge, MA: Bradford Books, pp. 318–362, 1986. [7] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998. [8] R. S. Sutton and A. G. Barto. “Reinforcement Learning: An Introduction,” MIT Press, Cambridge, MA, 1998 [9] K. Fukushima, “Neocognitron,” Scholarpedia, vol. 2, no. 1, p. 1717, 2007. [10] D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cat's striate cortex,” J. Physiol, 1959 [11] M. D. Zeiler and R. Fergus. “Visualizing and understanding convolutional networks,” ECCV, 2014. [12] M. Riedmiller, “Neural fitted Q iteration − first experiences with a data efficient neural reinforcement learning method,” Proc. 16th Eur. Conf. Mach. Learn., pp. 317–328, 2005. [13] S. Lange and M. Riedmiller, “Deep auto-encoder neural networks in reinforcement learning,” The 2010 International Joint Conference on Neural Networks(IJCNN), 2010. [14] L. J. Lin, “Reinforcement learning for robots using neural networks,” Technical report, DTIC Document, 1993. [15] G. Tesauro, “Temporal difference learning and TD-Gammon,” Commun. ACM, vol. 38, no. 3, pp. 58–68, 1995. [16] J. A. Boyan. “Modular neural networks for learning context-dependend game strategies,” M.S. thesis, University of Cambridge, 1992 [17] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” in Proc. Int. Conf. Learn. Representations, 2016. [18] V. Dumoulin, F. Visin. “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016 [19] “視覺皮質(Visual cortex),” Smallcollation.blogspot.tw, 2017. [Online]. Available: https://smallcollation.blogspot.tw/2013/06/visual-cortex.html#gsc.tab=0. [Accessed: 15- Mar- 2017] [20] “Stanford University CS231n: Convolutional Neural Networks for Visual Recognition,” CS231n.stanford.edu, 2017. [Online]. Available: http://cs231n.stanford.edu/. [Accessed: 15- Mar- 2017]

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0223117-131536.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS