Responsive image
博碩士論文 etd-0801114-040509 詳細資訊
Title page for etd-0801114-040509
論文名稱
Title
Shaped-Q學習於多代理人合作研究
A Study on Multi-Agent Cooperation by Shaped-Q Learning
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
60
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2014-07-25
繳交日期
Date of Submission
2014-09-03
關鍵字
Keywords
多代理人系統、合作、加強式學習
Reinforcement learning, Multi-agent system, Cooperation
統計
Statistics
本論文已被瀏覽 5655 次,被下載 44
The thesis/dissertation has been browsed 5655 times, has been downloaded 44 times.
中文摘要
在論文中我們主要探討多代理人在無通訊的環境中合作的問題,也就是說,代理人們無法在環境中彼此通訊,由於這個原因,代理人在學習合作的過程中無法藉由協調達成彼此的共識,我們提出概念是,為了要達到彼此互相合作,在做出決策前,每位代理人利用自己過去的經驗去推測出對方的動作。代理人藉由使用這個方式,可以讓代理人在無通訊的環境中達成合作效果,並且成功地完成任務。
加強式學習(Reinforcement Learning)是一種嘗試錯誤的學習方式,換句話說,代理人可以藉由加強式學習來學習如何達成目標,當代理人在無通訊的環境下決定動作時,彼此無法達成共識,會造成代理人發生停滯,因此,在無通訊的環境下如何設計一個策略,這個策略可以減少停滯的發生,進而提高學習效率成為一項重要的議題。
本論文提出一種方式,在學習合作的過程中,每位代理人建立Cooperative Tendency Table(CTT),以便於紀錄每項動作的合作傾向值,此外CTT會隨著學習的過程而進行更新,每項動作的合作傾向值乘以每項動作的q-value得到每項動作的Shaped-Q值,利用這個策略來決定目前所要採取的動作,因此,代理人可以透過這種方法讓彼此迅速達成共識,以降低停滯的發生,並且提高學習效率。
除此之外,我們所提出的方式記憶體的需求量不僅比 Win or Learning Fast Policy hill-climbing (WoLF-PHC)還要少,效率也比WoLF-PHC還要好,¬換句話說,我們提出的方式可以讓代理人在無通訊的環境下使用較少的記憶體且能更有效率地完成任務。研究結果以影片呈在http://youtu.be/CFS-KzOtMOg
Abstract
In this thesis, we primarily discuss the problem about multi-agent can cooperate together in no communication environment. That is to say, each agent is unable to communicate with each other in the environment. For this reason, agents cannot reach a consensus with each other by coordinating in the process of learning cooperation. We proposed the concept that in order to cooperate with each other, each agent uses own past experiences to speculate about other agent's actions before making a decision. By using this concept, it could let agents reach to cooperate with each other and successfully complete the tasks in no communication environment.
Reinforcement learning is a trial-and-error method. In other words, agents can learn how to achieve the goal by reinforcement learning. When agents have to decide to take an action, they fail to reach a consensus with each other in no communication environment, this situation will cause stagnation. Therefore, it is an important issue that how to design a policy which can reduce the occurrence of stagnation and enhance learning efficiency in no communication environment.
In the process of learning cooperation, this thesis proposes a method that each agent creates Cooperative Tendency Table (CTT) for the purpose of recording the cooperative tendency value of each action; moreover, CTT will be updated in learning process. We use this policy, cooperative tendency value of each action multiplied by q-value of each action is Shaped-Q of each action, to determine the action to be taken at present. Therefore, agents could use this method to quickly reach a consensus with each other in order to enhance learning efficiency and reduce the occurrence of stagnation.
In addition, not only memory requirement in the proposed method is less than Win or Learning Fast Policy hill-climbing (WoLF-PHC) but also performance is better than WoLF-PHC. In other words, this proposed method could let agents use less memory and complete the task more efficiently in no communication environment. The research results are presented by the video at YouTube: http://youtu.be/CFS-KzOtMOg
目次 Table of Contents
致謝 i
摘要 ii
Abstract iiii
LIST OF FIGURES vi
LIST OF TABLES vii
I. INTRODUCTION 1
1.1 Motivation 1
1.2 Organization of Thesis 3
II. BACKGROUND 4
2.1 Reinforcement Learning 4
2.2 Q-learning Algorithm 6
2.3 Literature Surveys 9
III. PROPOSED METHOD 10
3.1 Dilemma Problem of Multi-Agent System 10
3.2 The Definition of Stagnation 11
3.3 Cooperative Tendency Table 12
3.4 The Definition of Shaped-Q 15
IV. SIMULATION 21
4.1 Task Description - Transport Object 22
4.2 Simulation Results - Transport Object 24
4.3 Task Description – Mountain Car 27
4.4 Simulation Results –Mountain Car 29
4.5 Task Description – Seesaw 31
4.6 Simulation Results – Seesaw 34
4.7 Simulation Tools – Webots 35
4.8 Design of the Scene 36
V. CONCLUSION 45
REFERENCES 46
參考文獻 References
[1] K.Hirai, M.Hirose, Y.Haikawa, and T.Takenaka, “The Development of Honda Humanoid Robot,” IEEE International Conference on Robotics & Automation, pp. 1321-1326, 1998.
[2] P.Stone, R.S. Sutton and G.Kuhlmann, “Reinforcement Learning for RoboCup-Soccer Keepaway,” SAGE journals, pp. 165-188, 2005.
[3] E.Yang and D.Gu, “Multiagent Reinforcement Learning for Multi-robot Systems: A Survey,” Technical Report CSM-404.University of Essex, 2004.
[4] M.Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” Proceedings of the Tenth International Conference on Machine Learning, pp. 330-337, 1993.
[5] C.Claus and C.Boutilier, “The Dynamics of Reinforcement Learning in Cooperative Multiagent systems,” Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, pp. 746-752, 1998.
[6] R. S. Sutton, and A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998.
[7] Y.Lizhi, Q.Haiyan, J.Rozenblit and F.Szidarovszky, “Multi-agent Learning Model with Bargaining,” Proceedings of the Winter Simulation Conference, pp. 934-940, 2006.
[8] M.Bowling and M.Veloso, “Rational and Convergent Learning in Stochastic Games,” Proceedings of the Seventeenth international joint conference on Artificial intelligence, pp. 1021-1026, 2001.
[9] M.Bowling and M.Veloso, “Multiagent Learning Using a Variable Learning Rate,” Artificial Intelligence, pp. 215-250, 2002.
[10] 謝淑貞, 賽局理論, 三民書局, 1999.
[11] W.Ying and L.Haoxiang, “Q-learning Based Multi-robot Box-Pushing with Minimal Switching of Actions,” Proceedings of the IEEE International Conference on Automation and Logistics, pp. 640-643, 2008.
[12] N.J. Nilsson, Introduction to Machine Learning, Robotics Laboratory Department of Computer Science Stanford University.
[13] C. J. C. H. Watkins, Learning from Delayed Rewards, PhD thesis Cambridge University, 1989.
[14] M. Abramson and H. Wechsler, “Tabu Search Exploration for On-policy Reinforcement Learning,” Proceedings of the International Joint Conference on Neural Networks, pp. 2910-2915, 2003.
[15] Z.Xiaogang and L.Zhijing, “An Optimized Q-Learning Algorithm Based on the Thinking of Tabu Search,” International Symposium on Computational Intelligence and Design, pp. 533-536, 2008.
[16] E.Monacelli, C.Riman, R.Thieffry, I.Mougharbel and S.Delaplace, “A Reactive Assistive Role Switching For Interaction Management in Cooperative Tasks,” Proceedings of the International Conference on Intelligent Robots and Systems, pp. 5118 - 5123, 2006.
[17] M.N. Ahmadabadi and M.Asadpour, “Expertness Based Cooperative Q-learning,” IEEE Transactions on systems, Man and Cybernetics-Part B: Cybernetics, pp. 66-76, 2002.
[18] W.Zhidong, Y.Hirata and K.Kosuge, “Dynamic Object Closure by Multiple Mobile Robot and Random Caging Formation Testing,” Proceedings of the International Conference on Intelligent Robots and Systems, pp. 3675-3681, 2006.
[19] "Webots Reference Manual," [Online]. Available: http://www.cyberbotics.com/reference/.
[20] "Webots User Guide," [Online]. Available: http://www.cyberbotics.com/guide/.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code