國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於關鍵狀態的逆向增強式學習演算法,Inverse Reinforcement Learning based on Critical State

論文名稱 Title	基於關鍵狀態的逆向增強式學習演算法 Inverse Reinforcement Learning based on Critical State
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	103 學年度第 1 學期 The fall semester of Academic Year 103	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	57
研究生 Author	鄭天毓 Tien-yu Cheng
指導教授 Advisor	黃國勝 Kao-Shing Hwang
召集委員 Convenor	林金玲 Jin-Ling Lin
口試委員 Advisory Committee	陳昱仁, 朱明毅 Yu-Jen Chen; Ming-Yi Ju
口試日期 Date of Exam	2014-11-22	繳交日期 Date of Submission	2014-11-28
關鍵字 Keywords	加強式學習、逆向加強式學習、獎懲函數、獎懲特徵建構、學徒學習 reward feature construction, Apprenticeship Learning, Inverse Reinforcement learning, reward function, Reinforcement learning
統計 Statistics	本論文已被瀏覽 5702 次，被下載 663 次 The thesis/dissertation has been browsed 5702 times, has been downloaded 663 times.

中文摘要
增強式學習法透過學習代理人與動態環境互動獲得獎懲資訊，以此更新策略，達到控制最佳化，增強式學習中有一個重要的依據，獎懲函數。獎懲函數是一組最簡潔扼要的資訊表達專家意圖，在一些複雜的問題中，獎懲函數往往難以決定。為了解決這樣的問題，逆向增強式學習開始受到重視。逆向增強式學習主要用於尋找馬可夫決策程序中的獎懲函數。使用傳統逆向增強式學習演算法必須提供獎懲函數索引，和一組範例軌跡，但在複雜的問題中，往往難以挑出適當的獎懲函數索引，有時會直接採用整個狀態空間作為獎懲函數索引。本論文提出基於關鍵狀態的逆向增強式學習演算法，可以提供一組正確範例軌跡，和一組錯誤的範例軌跡，透過比較兩者之異同，從整個狀態空間中萃取出適當的關鍵狀態作為獎懲函數索引，並且求出簡潔、有意義的獎懲函數，由實驗結果得知，比起使用整個狀態空間，可以得到更接近專家的策略，且大量節省計算成本。本論文之成果將以影片呈現在 YouTube: http://youtu.be/cMaOdoTt4Hw。
Abstract
Reinforcement Learning (RL) makes an agent learn through interacting with a dynamic environment. One fundamental assumption of existing RL algorithms is that reward function, the most succinct representation of the designer’s intention, needs to be provided beforehand. It is difficult to provide appropriate reward functions in complex problems. The goal of the inverse reinforcement learning is finding a reward function in Markov Decision Process. A set of reward indexes and good example traces demonstrated by expert are needed in an IRL process. However, it is difficult to select a set of reward indexes in complex problems. In this thesis, Inverse Reinforcement Learning based on Critical State (IRLCS) algorithm is proposed to search a succinct and meaningful reward function. IRLCS select a set of reward indexes from whole state space through comparing the difference between the good and bad demonstrations. According to the results of experiment, IRLCS can find a good strategy that closes to the expert strategy. Besides, IRLCS save a lot of computational time. The Research results are presented by the video at YouTube: http://youtu.be/cMaOdoTt4Hw .

目次 Table of Contents
摘要 i Abstract ii LIST OF FIGURES v LIST OF TABLES vi I. INTRODUCTION 1 1.1 Preface 1 1.2 Motivation and Objective 2 1.3 Markov Decision Process 3 1.4 Reinforcement Learning 3 1.5 Q-learning Algorithm 5 1.6 Inverse Reinforcement Learning 7 1.7 TrAdaboost 8 1.8 Related Works 9 1.9 Organization of thesis 10 II. PROPOSED METHOD 11 2.1 Apprenticeship Learning 11 2.2 Inverse Reinforcement Learning Via Orthogonal Projection 12 2.2.1 Reward Index 12 2.2.2 Iteration Algorithm 14 2.3 Reward Index Construction 18 2.3.1 Impurity function 19 2.3.1 Visit frequency of states 21 2.3.2 Visit frequency of state-action pairs 22 2.4 Inverse Reinforcement Learning Based on Critical State 25 III. EXPERIMENT 32 3.1 Experiment Environment 32 3.2 Purpose of the experiment 35 3.3 The Results of Experiment 35 3.3.1 Collision-avoidance behavior 36 3.3.2 Driving-as-fast-as-possible behavior 40 3.4 Conclusion of Experiment Results 43 IV. CONCLUSION 44 4.1 Summary 44 4.2 Future Work 45 REFERENCES 46

參考文獻 References
[1] S. Levine, Z. Popovic, and V. Koltun, “Feature construction for inverse reinforcement learning,” Advances in Neural Information Processing Systems, volume 23, 2010. [2] C. J. C. H. Watkins, and P. Dayan, “Technical note: Q-Learning,” Machine Learning, 8(3-4): pp. 279-292, 1992. [3] A. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” Proceedings of the 17th International Conference on Machine Learning, pp. 663–670, 2000. [4] P. Abbeel and A. Ng, “Apprenticeship learning via inverse reinforcement learning,” Proceedings of the 21st international conference on Machine learning, p. 1, 2004. [5] R. E. Schapire, “A brief introduction to boosting,” Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 1401-1406, 1999. [6] W. Dai, Q. Yang, G. Xue, and Y. Yu, “Boosting for transfer learning,” Proceedings of the 24th International Conference on Machine Learning, pp. 193–200, New York, NY, USA, 2007. [7] J. Kolter, P. Abbeel, and A. Ng, “Hierarchical apprenticeship learning with application to quadruped locomotion,” Advances in Neural Information Processing Systems, vol. 20, 2008. [8] P. Abbeel, A. Coates, and A. Ng, “Autonomous helicopter aerobatics through apprenticeship learning,” International Journal of Robotics Research, vol. 29, no. 13, pp. 1608–1639, 2010. [9] P. Abbeel, D. Dolgov, A. Ng, and S. Thrun, “Apprenticeship learning for motion planning with application to parking lot navigation,” IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1083–1090, 2008. [10] S. Chung and H. Huang, “A mobile robot that understands pedestrian spatial behaviors,” IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5861–5866, 2010. [11] S.-Y. Chen, H. Qian, J. Fan, Z.-J. Jin, and M.-L. Zhu, “Modified reward function on abstract features in inverse reinforcement learning,” Journal of Zhejlang University - Science C, vol. 11, no. 9, pp. 718-723, 2010. [12] D. Grollman and A. Billard, “Donut as i do: Learning from failed demonstrations,” International Conference on Robotics and Automation, Shanghai, 2011. [13] R. Balian, “Entropy, a Protean concept,” Poincaré Seminar 2003, pp. 119–144. [14] M. Lopes, F. Melo, and L. Montesano, “Active learning for reward estimation in inverse reinforcement learning,” Machine Learning and Knowledge Discovery in Databases, pp. 31–46, 2009. [15] R. S. Sutton, and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, 1998. [16] J. Tang, A. Singh, N. Goehausen, and P. Abbeel, “Parameterized maneuver learning for autonomous helicopter flight,” IEEE International Conference on Robotics and Automation (ICRA), pp. 1142–1148, 2010.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-1028114-170500.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS