國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,自適應探索策略於加強式學習,Adaptive Exploration Strategies for Reinforcement Learning

論文名稱 Title	自適應探索策略於加強式學習 Adaptive Exploration Strategies for Reinforcement Learning
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	104 學年度第 1 學期 The fall semester of Academic Year 104	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	34
研究生 Author	李志文 Chih-wen Lee
指導教授 Advisor	黃國勝 Kao-Shing Huang
召集委員 Convenor	朱明毅 Ming-Yi Ju
口試委員 Advisory Committee	陳昱仁 Yu-Jen Chen
口試日期 Date of Exam	2016-01-23	繳交日期 Date of Submission	2016-01-25
關鍵字 Keywords	狀態聚合、開發與探索之權衡問題、決策樹、ε-greedy、增強式學習、禁忌搜尋 Reinforce learning, Tabu search, State aggregation, trade-off between Exploration and Exploitation, decision tree, ε-greedy
統計 Statistics	本論文已被瀏覽 5683 次，被下載 15 次 The thesis/dissertation has been browsed 5683 times, has been downloaded 15 times.

中文摘要
增強式學習透過代理人以嘗試錯誤的方法來學習完成目標，但當我們要將其應用在真實環境中時，狀態如何的切割變得很難去決定。並且當代理人在學習過程中根據策略採取動作時，會遇到如何權衡開發與探索的問題，採取動作時是該探索新的區域以獲得經驗，還是要就現有的知識去獲得最大的回饋。為了解決遇到的問題，本論文先提出以決策樹為基礎的自適應性分割狀態空間演算法，在此方法上引入縮減禁忌搜尋和自適應探索策略，來解決開發與探索的問題。縮減禁忌搜尋會將此採取過的動作放進禁忌列表中，當禁忌列表滿時，則釋放禁忌列表中第一個動作，而此列表的大小會根據成功到達目標的次數作縮減。自適應探索策略是一種基於信息熵作探索率的調整，而不是固定或者手動。最後，以迷宮環境模擬驗證所提出方法的實用性，並由模擬的結果證明，提出方法的學習速度確實有明顯提升。
Abstract
Reinforcement learning through an agent to learn policy use trial and error method to achieve the goal, but when we want to apply it in a real environment, how to dividing state space becomes difficult to decide, another problem in reinforcement learning, agent takes an action in the learning process according to the policy, we will encounter how to balance exploitation and exploration, to explore a new areas in order to gain experience, or to get the maximum reward on existing knowledge. To solve problems, we proposed the decision tree-based adaptive state space segmentation algorithm and then use decreasing Tabu search and adaptive exploration strategies to solve the problem of exploitation and exploration on this method. Decreasing Tabu search will put the action into the Tabu list, after agent take an action. If the Tabu list is full, release the action, but the size of Tabu list will decreasing according to the number of successful reaching goals. Adaptive exploration strategy is based on information entropy, not tuning exploration rate by manually. Finally, a maze environment simulation is used to validate the proposed method, further to decrease the learning time.

目次 Table of Contents
論文審定書 i 摘要 ii Abstract iii 目錄 iv 圖次 vi 表次 vii 符號定義表 viii 第一章導論 1 1.1 動機 1 1.2 論文架構 1 第二章背景介紹 2 2.1 馬可夫決策程序 2 2.2 增強式學習法 3 2.2.1 Q-Learning 4 2.3 禁忌搜尋(Tabu Search) 5 第三章提出方法 6 3.1 以決策樹為架構的適應性切割狀態方法 6 3.1.1 部分馬可夫決策程序 6 3.1.2 建決策樹 8 3.2 開發與探索之權衡問題 12 3.2.1 縮減禁忌搜尋 12 3.2.2 自適應探索策略 12 第四章實驗結果 15 4.1 迷宮的模擬實驗 15 第五章結論與未來展望 21 5.1 結論 21 5.2 未來展望 21 REFERENCES 22

參考文獻 References
[1] R. S. Sutton, and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, 1998. [2] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement Learning: A Survey,” Journal of Artificial Intelligence Research, vol. 4, pp. 237-285, 1996. [3] 李天岩,熵(Entropy) ,數學傳播第十三卷第三期。 [4] C. J. C. H. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, vol. 8, no. 3, pp. 279-292, 1992. [5] M. Abramson and H. Wechsler, “Tabu Search Exploration for On-policy Reinforcement Learning,” in Proceedings of the International Joint Conference on Neural Networks, vol. 4, pp. 2910-2915, 2003. [6]X. Zhang and Z. Liu, “An Optimized Q-Learning Algorithm Based on the Thinking of Tabu Search,” in Proceedings of the International Symposium on Computational Intelligence and Design, ISCID ’08, vol. 1, pp. 533-536, 2008. [7] M. Tokic,”Adaptive ε-greedy Exploration in Reinforcement Learning based on Value Difference”, Advances in Artificial Intelligence, vol.6359, pp.203-210, 2010. [8] A. F. Atiya, A. G. Parlos, and L. Ingber, “A reinforcement learning method based on adaptive simulated annealing,” in Proceedings of the 46th IEEE International Midwest Symposium on Circuits and Systems, (MWSCAS '03), vol. 1, pp. 121-124, 2003. [9] M. A. Wiering and H. V. Hasselt ,"Ensemble algorithms in reinforcement learning," IEEE Trans. Syst., Man., Cybern. B, vol.38 ,no. 4 ,pp.930 -936 ,2008 [10] M. Coggan ,"Exploration and exploitation in reinforcement learning" , in Proc. of the 4th Int. Conf. Comput. Intell. Multimedia Appl. , pp.1 -44, 2001. [11] T. K. Das, A. Gosavi, S. Mahadevan, and N. Marchalleck, “Solving Semi-markov Decision Problems Using Average Reward Reinforcement Learning, ”Management Science, vol. 45, pp. 560-574, 1999. [12] R. Dearden, N. Friedman and D. Andre ,"Model based bayesian exploration" , in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence,pp.150 -159, 1999. [13] 陳昱仁, “基於自我組織決策樹多重代理人之策略分享機制”, 博士論文, 國立中正大學, 嘉義, 2009 [14] 羅嘉耀, ”以一種創新的自適應性探索策略革新加強式學習理論之架構”, 博士論文, 國立中正大學, 嘉義, 2012 [15] W. Y. Loh, and Y. S. Shih, “Split Selection Methods for Classification Trees,” Statistica Sinica, vol. 7, pp. 815-840, 1997

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0025116-130314.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS