國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,Shaped-Q學習於多代理人合作研究,A Study on Multi-Agent Cooperation by Shaped-Q Learning

論文名稱 Title	Shaped-Q學習於多代理人合作研究 A Study on Multi-Agent Cooperation by Shaped-Q Learning
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	102 學年度第 2 學期 The spring semester of Academic Year 102	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	60
研究生 Author	李豐全 Feng-Quan Li
指導教授 Advisor	黃國勝 Kao-Shing Huang
召集委員 Convenor	林金玲 Jin-Ling Lin
口試委員 Advisory Committee	朱明毅, 陳昱仁 Ming-Yi Ju; Yu-Jen Chen
口試日期 Date of Exam	2014-07-25	繳交日期 Date of Submission	2014-09-03
關鍵字 Keywords	多代理人系統、合作、加強式學習 Reinforcement learning, Multi-agent system, Cooperation
統計 Statistics	本論文已被瀏覽 5655 次，被下載 44 次 The thesis/dissertation has been browsed 5655 times, has been downloaded 44 times.

中文摘要
在論文中我們主要探討多代理人在無通訊的環境中合作的問題，也就是說，代理人們無法在環境中彼此通訊，由於這個原因，代理人在學習合作的過程中無法藉由協調達成彼此的共識，我們提出概念是，為了要達到彼此互相合作，在做出決策前，每位代理人利用自己過去的經驗去推測出對方的動作。代理人藉由使用這個方式，可以讓代理人在無通訊的環境中達成合作效果，並且成功地完成任務。加強式學習(Reinforcement Learning)是一種嘗試錯誤的學習方式，換句話說，代理人可以藉由加強式學習來學習如何達成目標，當代理人在無通訊的環境下決定動作時，彼此無法達成共識，會造成代理人發生停滯，因此，在無通訊的環境下如何設計一個策略，這個策略可以減少停滯的發生，進而提高學習效率成為一項重要的議題。本論文提出一種方式，在學習合作的過程中，每位代理人建立Cooperative Tendency Table(CTT)，以便於紀錄每項動作的合作傾向值，此外CTT會隨著學習的過程而進行更新，每項動作的合作傾向值乘以每項動作的q-value得到每項動作的Shaped-Q值，利用這個策略來決定目前所要採取的動作，因此，代理人可以透過這種方法讓彼此迅速達成共識，以降低停滯的發生，並且提高學習效率。除此之外，我們所提出的方式記憶體的需求量不僅比 Win or Learning Fast Policy hill-climbing (WoLF-PHC)還要少，效率也比WoLF-PHC還要好，¬換句話說，我們提出的方式可以讓代理人在無通訊的環境下使用較少的記憶體且能更有效率地完成任務。研究結果以影片呈在http://youtu.be/CFS-KzOtMOg
Abstract
In this thesis, we primarily discuss the problem about multi-agent can cooperate together in no communication environment. That is to say, each agent is unable to communicate with each other in the environment. For this reason, agents cannot reach a consensus with each other by coordinating in the process of learning cooperation. We proposed the concept that in order to cooperate with each other, each agent uses own past experiences to speculate about other agent's actions before making a decision. By using this concept, it could let agents reach to cooperate with each other and successfully complete the tasks in no communication environment. Reinforcement learning is a trial-and-error method. In other words, agents can learn how to achieve the goal by reinforcement learning. When agents have to decide to take an action, they fail to reach a consensus with each other in no communication environment, this situation will cause stagnation. Therefore, it is an important issue that how to design a policy which can reduce the occurrence of stagnation and enhance learning efficiency in no communication environment. In the process of learning cooperation, this thesis proposes a method that each agent creates Cooperative Tendency Table (CTT) for the purpose of recording the cooperative tendency value of each action; moreover, CTT will be updated in learning process. We use this policy, cooperative tendency value of each action multiplied by q-value of each action is Shaped-Q of each action, to determine the action to be taken at present. Therefore, agents could use this method to quickly reach a consensus with each other in order to enhance learning efficiency and reduce the occurrence of stagnation. In addition, not only memory requirement in the proposed method is less than Win or Learning Fast Policy hill-climbing (WoLF-PHC) but also performance is better than WoLF-PHC. In other words, this proposed method could let agents use less memory and complete the task more efficiently in no communication environment. The research results are presented by the video at YouTube: http://youtu.be/CFS-KzOtMOg

目次 Table of Contents
致謝 i 摘要 ii Abstract iiii LIST OF FIGURES vi LIST OF TABLES vii I. INTRODUCTION 1 1.1 Motivation 1 1.2 Organization of Thesis 3 II. BACKGROUND 4 2.1 Reinforcement Learning 4 2.2 Q-learning Algorithm 6 2.3 Literature Surveys 9 III. PROPOSED METHOD 10 3.1 Dilemma Problem of Multi-Agent System 10 3.2 The Definition of Stagnation 11 3.3 Cooperative Tendency Table 12 3.4 The Definition of Shaped-Q 15 IV. SIMULATION 21 4.1 Task Description - Transport Object 22 4.2 Simulation Results - Transport Object 24 4.3 Task Description – Mountain Car 27 4.4 Simulation Results –Mountain Car 29 4.5 Task Description – Seesaw 31 4.6 Simulation Results – Seesaw 34 4.7 Simulation Tools – Webots 35 4.8 Design of the Scene 36 V. CONCLUSION 45 REFERENCES 46

參考文獻 References
[1] K.Hirai, M.Hirose, Y.Haikawa, and T.Takenaka, “The Development of Honda Humanoid Robot,” IEEE International Conference on Robotics & Automation, pp. 1321-1326, 1998. [2] P.Stone, R.S. Sutton and G.Kuhlmann, “Reinforcement Learning for RoboCup-Soccer Keepaway,” SAGE journals, pp. 165-188, 2005. [3] E.Yang and D.Gu, “Multiagent Reinforcement Learning for Multi-robot Systems: A Survey,” Technical Report CSM-404.University of Essex, 2004. [4] M.Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” Proceedings of the Tenth International Conference on Machine Learning, pp. 330-337, 1993. [5] C.Claus and C.Boutilier, “The Dynamics of Reinforcement Learning in Cooperative Multiagent systems,” Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, pp. 746-752, 1998. [6] R. S. Sutton, and A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. [7] Y.Lizhi, Q.Haiyan, J.Rozenblit and F.Szidarovszky, “Multi-agent Learning Model with Bargaining,” Proceedings of the Winter Simulation Conference, pp. 934-940, 2006. [8] M.Bowling and M.Veloso, “Rational and Convergent Learning in Stochastic Games,” Proceedings of the Seventeenth international joint conference on Artificial intelligence, pp. 1021-1026, 2001. [9] M.Bowling and M.Veloso, “Multiagent Learning Using a Variable Learning Rate,” Artificial Intelligence, pp. 215-250, 2002. [10] 謝淑貞, 賽局理論, 三民書局, 1999. [11] W.Ying and L.Haoxiang, “Q-learning Based Multi-robot Box-Pushing with Minimal Switching of Actions,” Proceedings of the IEEE International Conference on Automation and Logistics, pp. 640-643, 2008. [12] N.J. Nilsson, Introduction to Machine Learning, Robotics Laboratory Department of Computer Science Stanford University. [13] C. J. C. H. Watkins, Learning from Delayed Rewards, PhD thesis Cambridge University, 1989. [14] M. Abramson and H. Wechsler, “Tabu Search Exploration for On-policy Reinforcement Learning,” Proceedings of the International Joint Conference on Neural Networks, pp. 2910-2915, 2003. [15] Z.Xiaogang and L.Zhijing, “An Optimized Q-Learning Algorithm Based on the Thinking of Tabu Search,” International Symposium on Computational Intelligence and Design, pp. 533-536, 2008. [16] E.Monacelli, C.Riman, R.Thieffry, I.Mougharbel and S.Delaplace, “A Reactive Assistive Role Switching For Interaction Management in Cooperative Tasks,” Proceedings of the International Conference on Intelligent Robots and Systems, pp. 5118 - 5123, 2006. [17] M.N. Ahmadabadi and M.Asadpour, “Expertness Based Cooperative Q-learning,” IEEE Transactions on systems, Man and Cybernetics-Part B: Cybernetics, pp. 66-76, 2002. [18] W.Zhidong, Y.Hirata and K.Kosuge, “Dynamic Object Closure by Multiple Mobile Robot and Random Caging Formation Testing,” Proceedings of the International Conference on Intelligent Robots and Systems, pp. 3675-3681, 2006. [19] "Webots Reference Manual," [Online]. Available: http://www.cyberbotics.com/reference/. [20] "Webots User Guide," [Online]. Available: http://www.cyberbotics.com/guide/.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0801114-040509.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS