國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,可參考旁觀者暗示性評斷修正之加強式學習,A reinforcement learning method with implicit critics from a bystander

論文名稱 Title	可參考旁觀者暗示性評斷修正之加強式學習 A reinforcement learning method with implicit critics from a bystander
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	105 學年度第 1 學期 The fall semester of Academic Year 105	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	54
研究生 Author	王至光 Chih-kuang Wang
指導教授 Advisor	黃國勝 Kao-Shing Huang
召集委員 Convenor	朱明毅 Ming-Yi Ju
口試委員 Advisory Committee	陳昱仁, 蔣惟丞 Yu-Jen Chen; Wei-Cheng Jiang
口試日期 Date of Exam	2016-08-17	繳交日期 Date of Submission	2016-08-28
關鍵字 Keywords	隨機式加強式學習、表情辨識、Actor critic、深度學習、加強式學習 Stochastic reinforcement learning, Reinforcement learning, Actor critic, Deep learning, Facial expression recognition
統計 Statistics	本論文已被瀏覽 5679 次，被下載 769 次 The thesis/dissertation has been browsed 5679 times, has been downloaded 769 times.

中文摘要
在加強式學習中，代理人透過無數的嘗試得到學習的經驗，經過不斷的訓練，代理人能夠學習完成不同任務的各種行為，但在人機協同工作環境中，代理人除了與環境有互動外，其實與使用者行為或意向也是環環相扣的。本論文以加強式學習中著名的action-critic的架構作為Q學習機構的鷹架，並提出以隨機行動產生的概念來解決傳統Q學習無法產生連續動作的問題，同時延伸所提出的學習架構Actor Critic-Q (ACQ)學習，讓機器人能夠就以即有的獎勵函式去學習某些設定的行為外，也能夠會透過經常性觀察使用者表情而修正內建的行為模式，而從接觸使用者的身上得到經驗，達到客製化行為的學習，也就是說，讓學習代理人能夠除透過內定的獎勵函數去學習行為策略外，也能夠從人的身上得到的互動而修正原有行動策略，而達到策略客製化的目地。在辨識使用者表情的技術，本文是以深度學習(Deep Learning)去訓練，訓練完成後，將辨識到的表情轉化成正向與負向兩種情況，此一訊號係作為另一個ACQ獎勵訊號。從跨越沼澤的實驗結果中可觀查到，本文所提出的雙ACQ的學習架構，的確可讓機器人能夠就以即有的獎勵函式所學習到的慣性行為轉化到符合使用者暗示性評斷的行為。
Abstract
In reinforcement learning, agents try several times to get experience to complete the behavior policy of different missions. But, agents are not only interactive with the environment but also human beings in the environment of human computer cooperation. This thesis applied an actor critic model, which is one of the popular reinforcement learning methods, as the scaffold of the proposed Q-learning. The proposed method introduced a concept of generating actions by a stochastic function to solve the problem of generating continuous actions that can’t be solved by traditional Q-learnings. Synthetically, this thesis designed a reinforcement learning architecture, called Actor Critic-Q (ACQ), to allow agents to learn a behavioral policy by original reward function, and can also modify its built-in behavior through observing users’ emotions. That is, agents get experiences from users’ implicit critics, facial expressions in this case, to achieve the learning of customized behavior. For facial recognition, this thesis applied a Deep Learning to training recognition abilities of the learning agents. In recall of the neural network, facial expressions are classified into a dichotomy status, good or bad, and this signal is the reward signal to the other module of dual ACQs. From the experiments, it is observed that the dual ACQ architecture proposed can allow agents to transfer from the behavior policy learned by the original reward function to a compromised one taking account of implicit critics from users.

目次 Table of Contents
摘要..................................................................i Abstract...........................................................ii 目錄................................................................iv 圖次................................................................vi 表次...............................................................viii Ⅰ. 導論..............................................................1 1.1 動機........................................................1 1.2 論文架構.................................................2 II. 背景介紹......................................................3 2.1 加強式學習............................................3 2.2 Actor Critic............................................3 2.3 Continuous State and Action Q-Learning..........4 2.3.1 Adaptive Critic Methods....................4 2.3.2 CMAC Based Q-learning..................5 2.3.3 Q-AHC..............................................5 2.3.4 提出的方法.......................................5 2.4 Stochastic reinforcement learning..... .6 2.5 深度學習............................................. 7 2.6 光流....................................................12 III.提出方法...................................................13 3.1 使用stacked sparse autoencoder辨識表情.....13 3.2 Actor-Critic-Q......................................16 3.3 Actor-Critic-Q with continuous actions............20 3.4 雙ACQ.................................................24 Ⅳ.模擬結果....................................................28 4.1表情辨識結果........................................28 4.2 比較離散與連續的動作.........................29 4.3 比較不同的影響...................................31 4.4人為介入................................................34 4.4.1 迷宮一................................................34 4.4.2 迷宮二................................................37 Ⅴ.結論與未來展望...........................................40 5.1 Conclusion............................................40 5.2 Future work...........................................40 REFERENCES...............................................41

參考文獻 References
[1] R. S, Sutton, and A. G. Barto, “Reinforcement Learning: An Introduction,” MIT Press Cambridge, 1998. [2] L. P. Kaebling, M. L. Littman, and A. W. Moore, “Reinforcement Learning: A Survey,” Journal of Artificial Intelligence Research 4, pp. 237-285, 1996. [3]. C. J. C. H. Watkins, “Learning from Delayed Rewards,” Ph.D. Thesis, Cambridge University, 1989. [4]. V. Konda and J. Tsitsiklis. “Actor-critic algorithms”. In Advances in Neural Information Processing Systems 12, 2000. [5] A. G. Batro, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man, Cyben., Vol. SMC-13, pp.834-846, 1993. [6] Paul J. Werbos. “Approximate dynamic programming for real-time control and neural modeling,” In D. A. White and D. A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. Van Nostrand Reinhold, 1992. [7] Juan C. Santamaria, Richard S. Sutton, and Ashwin Ram. “Experiments with reinforcement learning in problems with continuous state and action spaces,” Adaptive Behaviour, 6(2):163-218, 1998. [8] J. S. Albus. “A new approach to manipulator control: the cerebrellar model ar- ticulated controller (CMAC),” J. Dynamic Systems, Measurement and Control,97:220-227, 1975. [9] Gavin Adrian Rummery. “Problem solving with reinforcement learning,” PhD thesis, Cambridge University, 1995. [10] V. Gullapalli, “A stochastic reinforcement learning algorithm for learning real valued functions,” Neural Net, Vol.3, pp.671-692, 1990. [11] V. Gullapalli, “Associative reinforcement learning of real valued functions,” Proc. IEEE, Syst., Man, Cybern, Charlottesville, VA, Oct. 1991. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks”, in NIPS, 2012. [13] G. Hinton, S. Osindero, Y. The, “A fast learning algorithm for deep belief nets”, Neural Computations, 2006 [14] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, “Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion”, The Journal of Machine Learning Research archive, Vol.11, pp.3371-3408, 2010. [15] Pierre Baldi, “Autoencoders, Unsupervised Learning, and Deep Architectures”, JMLR Workshop and Conference Proceedings Vol.27, pp.37–50, 2012. [16] B. Widrow and M. A. Lehr, “30 years of adaptive neural networks: perceptron, madaline , and backpropagation,” Proc. IEEE, Vol.78, No.9, pp.1415-1442, 1990. [17] Lucey, Patrick, et al. "The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression." Computer Vision and Pattern Recognition Workshops, 2010. [18] B. Horn and B. Schunck. “Determining optical flow”, Artificial Intelligence, 16:185–203, Aug. 1981. [19] Yunfan Liu, “Facial Expression Recognition and Generation using Sparse Autoencoder,” International Conference on Smart Computing , pp. 125-130, 2014. [20] Sun, Deqing, Stefan Roth, and Michael J. Black. “Secrets of optical flow estimation and their principles.”IEEE Conference on Computer Vision and Pattern Recognition, pp.2432-2439, 2010.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0728116-144121.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS