國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,透過反加強式學習模仿作行為分段及學習,Action Segmentation and Learning by Inverse Reinforcement Learning

論文名稱 Title	透過反加強式學習模仿作行為分段及學習 Action Segmentation and Learning by Inverse Reinforcement Learning
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	104 學年度第 1 學期 The fall semester of Academic Year 104	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	70
研究生 Author	江炫儀 Hsuan-yi Chiang
指導教授 Advisor	黃國勝 Kao-Shing Hwang
召集委員 Convenor	林金玲 Jin-Ling Lin
口試委員 Advisory Committee	陳昱仁, 朱明毅 Yu-Jen Chen; Ming-Yi Ju
口試日期 Date of Exam	2015-10-02	繳交日期 Date of Submission	2015-10-06
關鍵字 Keywords	獎懲函數、Adaboost分類器、上信賴區間、逆向增強式學習、增強式學習 Upper Confidence Bounds, Adaboost classifier, reward function, Inverse Reinforcement learning, Reinforcement learning
統計 Statistics	本論文已被瀏覽 5716 次，被下載 502 次 The thesis/dissertation has been browsed 5716 times, has been downloaded 502 times.

中文摘要
透過增強式學習可使代理人以錯誤嘗試之方式學習完成任務的各項行為，但是當代理人面對不同困難度的任務時，其獎懲函數 (Reward Function) 往往不易定義。為解決此問題，本論文以逆增強式學習為基礎，並結合Adaboost分類器及Upper Confidence Bounds (UCB) 概念的加權方式，建構複雜行為的獎懲函數。逆向增強式學習法利用專家與環境互動的過程，使代理人以模仿的方式建構與專家有相似意圖的獎懲函數。在模仿的過程中，代理人持續比較與專家間的誤差，利用Adaboost賦予每個狀態不同的權重。此權重再結合以UCB決定地每個狀態之信任程度，將衍伸出適合的獎懲函數。本論文將針對複雜的任務使用狀態編碼法及行為動作分段來簡化任務並使用逆向增強式學習與加權方法找出適合的回饋函數，藉以幫助代理人可更快速地模仿與專家相同的行為。最後，以迷宮環境及足球機器人環境模擬驗證所提方法的實用性，並由模擬的結果證明，所提方法的學習速度的確有明顯提升。
Abstract
Reinforcement learning allows agents to learn behaviors through trial and error. However, as the level of difficulty increases, the reward function of the mission also becomes harder to be defined. By combining the concepts of Adaboost classifier and Upper Confidence Bounds (UCB), a method based on inverse reinforcement learning is proposed to construct the reward function of a complex mission. Inverse reinforcement learning allows the agent to rebuild a reward function that imitates the process of interaction between the expert and the environment. During the imitation, the agent continuously compares the difference between the expert and itself, and then the proposed methods determines a specific weight for each state via Adaboost. The weight is then combined with the state confidence from UCB to construct an approximate reward function. This thesis uses a state encoding method and action segmentation to simplify the problem, then utilize the proposed method to determine the optimal reward function. Finally, a maze environment and a soccer robot environment simulation are used to validate the proposed method, further to decreasing the computational time.

目次 Table of Contents
摘要 i Abstract ii TABLE OF CONTENTS iii LIST OF FIGURES v LIST OF TABLES vii I. INTRODUCTION 1 1.1 Motivation 1 1.2 Organization of Thesis 2 II. BACKGROUND KNOWLEDGE 3 2.1 Reinforcement Learning 3 2.2 Inverse Reinforcement Learning 5 2.3 Adaboost Classifier 8 III. ADABOOST-LIKE INVERSE REINFORCEMENT LEARNING (I) 10 3.1 Adaboost-Like Inverse Reinforcement Learning (I) in Detail 10 3.2 An Example for AL-IRL(I) 14 3.3 Proof of Gradient Searching Method 22 IV. PROPOSED METHOD AND RELATIVE WORK 27 4.1 Adaboost-Like Inverse Reinforcement Learning (II) 28 4.2 Action Segment 33 4.3 State Encoding Method 34 V. SIMULATION RESULT 37 5.1 Simulation of Maze Environment 37 5.1.1 Behavior 1: Seeking Goal A 38 5.1.2 Behavior 2: Seeking Goal B 41 5.1.3 State Encoding Method 43 5.1.4 Adaboost-Like Inverse Reinforcement Learning (II) 44 5.2 Simulation of Soccer Robot 50 5.2.1 Behavior 1: Chasing Ball 51 5.2.2 Behavior 2: Obstacle Avoidance 53 5.2.3 Behavior 3: Positioning 54 5.2.4 State Encoding Method 56 5.2.5 Adaboost-Like Inverse Reinforcement Learning (II) 57 VI. CONCLUSION AND FUTURE RESEARCH DIRECTION 63 6.1 Conclusion of Thesis 63 6.2 Future Research Direction 64 REFERENCES 65

參考文獻 References
[1] C. J. C. H. Watkins, Learning from Delayed Rewards. Ph.D. Thesis, Cambridge University, 1989. [2] C. J. C. H. Watkins, and P. Dayan, “Technical note: Q-Learning,” Machine Learning, 8: pp. 279-292, 1992. [3] R. S, Sutton, and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press Cambridge, 1998. [4] L. P. Kaebling, M. L. Littman, and A. W. Moore, “Reinforcement Learning: A Survey,” Journal of Artificial Intelligence Research 4, pp237-285, 1996. [5] S. Z. Shao, and J. M. Er, “A Review of Inverse Reinforcement Learning Theory and Recent Advances,” IEEE World Congress on Computational Intelligence, pp. 1-8, Brisbane, Australia, 2012. [6] A. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” Proceedings of the 17th International Conference on Machine Learning, pp. 663-670, 2000. [7] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” Proceedings of the 21st international conference on Machine learning, pp. 1-8, 2004. [8] B. Michini, T. J. Walsh, A. A. Mohammadi, and J. P. How, “Bayesian Nonparametric Reward Learning from Demonstration,” IEEE Transactions on Robotics, VOL. 31, pp. 369-386, 2015. [9] K. S. Hwang, and C. C. Lee, Imitation Learning Based on Inverse Reinforcement Learning. Ms. Thesis, National Chung Cheng University, 2014. [10] T. K. An, and M. H. Kim “A New Diverse AdaBoost Classifier,” 2010 International Conference on Artificial Intelligence and Computational Intelligence (AICI), VOL. 1, pp. 359-363, 2010. [11] G. Eibl, and K. P. Pfeiffer, “How to Make AdaBoost.m1 Work for Weak Base Classifiers by Changing only one Line of the Code,” Processing of the 13th European Conference on Machine Learning Helsinki, pp. 72-83, 2002. [12] P. Auer, “Using Confidence Bounds for Exploitation-Exploration Trade-offs,” Journal of Machine Learning Research, pp. 397-422, 2002. [13] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P.I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis ,and S. Colton, “A Survey of Monte Carlo Tree Search Methods,” IEEE Transactions on Computational Intelligence and AI in Games, VOL. 4, PP. 1-49, 2012. [14] M. Nicolescu, O. C. Jenkins, A. Olenderski, and E. Fritzinger, “Learning Behavior Fusion from Demonstration,” Human and Robot Interactive Communication, PP. 319-352, 2008. [15] C. W. Hsu, and A. Liu, “High-Level Behavior Control of an E-Pet with Reinforcement Learning,” IEEE International Conference on Systems Man and Cybernetics (SMC), pp. 29-34, 2010. [16] J. Simpson, C. L. Jacobsen and M. C. Jadud, “Mobile Robot Control the Subsumption Architecture and Occam-pi,” Communicating Process Architectures, VOL. 64 ,pp. 225-236, 2006. [17] K. S. Hwang, Y. J. Chen, C. J. Wu, and W. C. Jiang, “Model Learning and Knowledge Sharing for Multiagent System with Dyna-Q Learning,” IEEE Transactions on Cybernetics, VOL. 45 ,PP. 964-976. [18] K. S. Hwang, Y. J. Chen, C. J. Wu, and C. S. Wu, “Behavioral-fusion control based on reinforcement learning,” IEEE International Conference on Systems, Man and Cybernetics, pp. 401-406, 2009. [19] K. S. Hwang, Y. J. Chen, C. J. Wu, and C. S. Wu, “Fusion of Multiple Behaviors Using Layered Reinforcement Learning,” IEEE Transaction on Systems, Man, and Cybernetics, VOL. 42, pp. 999-1004, 2012. [20] K. S. Hwang, and T. Y. Cheng, Inverse Reinforcement Learning Based on Critical State, Ms. Thesis, National Sun Yat-sen University, 2014.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0906115-151230.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS