Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Reinforcement Learning Freek Stulp.

Similar presentations


Presentation on theme: "1 Introduction to Reinforcement Learning Freek Stulp."— Presentation transcript:

1 1 Introduction to Reinforcement Learning Freek Stulp

2 2 Overview General principles of RL Markov Decision Process as model Values of states: V(s) Values of state-actions: Q(a,s) Exploration vs. Exploitation Issues in RL Conclusion

3 3 General principles of RL Neural Networks are supervised learning algorithms: for each input, we know the output. What if we don‘t know the output for each input? Flight control system example Let the agent learn how to achieve certain goals itself, through interaction with the environment.

4 4 General principles of RL This does not solve the problem! Environment Agent actionpercept reward Rewards to specify goals (example: dogs) Let the agent learn how to achieve certain goals itself, through interaction with the environment.

5 5 Popular model: MDPs Markov Decision Process = {S,A,R,T} Set of states S Set of actions A Reward function R Transition function T Markov property T ss´ only depends on s, s´ Policy:  (S)=>A Problem: Find policy  that maximizes the reward Discounted reward: r 0 +   r 1 +   r 2...  n r n s0s0 a0a0 r0r0 s1s1 a1a1 r1r1 s2s2 a2a2 r2r2 s3s3

6 6 Values of states: V  (s) Definition of value V  (s) Cumulative reward when starting in state s, and executing some policy untill terminal state is reached. Optimal policy yields V*(s) -3 -2 0 0 V*(s) (Optimal policy) -22 -20 -14 0 0 V  (s) (Random policy) 0 0 R (Rewards)

7 7 Determining V  (s) Dynamic programming V(s) = R(s) +  V  s´ (T ss´ V(s´)) + Only visited states are used ss - Necessary to consider all states. TD-learning V(s) = V(s) +  (R(s)+V(s´)-V(s))

8 8 Values of state-action: Q(a,s) Q-values: Q(a,s) Value of doing an action in a certain state. Dynamic Programming: Q(a,s) =R(s) +  s´ T ss´ max a Q(a´,s´) TD-learning Q(a,s) = Q(a,s) +  (R(s) + max a´ Q(a´,s´) - Q(a,s)) T is not in this formula: Model free learning!

9 9 Exploration vs. Exploitation Only exploitation: New (maybe better) paths never discovered Only exploration: What is learned is never exploited Good trade-off: Explore first to learn, exploit later to benefit

10 10 Some issues Hidden state If you don‘t know where you are, you can‘t know what to do. Curse of dimensionality Very large state spaces. Continuous states/action spaces All algorithms use discrete tables spaces. What about continuous values? Many of your articles discuss solutions to these problems.

11 11 Conclusion RL: Learning through interaction and rewards. Markov Decision Process popular model Values of states: V(s) Values of action/states: Q(a,s) (model free!) Still some problems... not quite ready for complex real-world problems yet, but research underway!

12 12 Literature Artificial Intelligence: A Modern Approach Stuart Russel and Peter Norvig Machine Learning Tom M. Mitchell Reinforcement learning: A Tutorial Mance E. Harmon and Stephanie S. Harmon


Download ppt "1 Introduction to Reinforcement Learning Freek Stulp."

Similar presentations


Ads by Google