Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Similar presentations


Presentation on theme: "Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011."— Presentation transcript:

1 Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

2 Outline Introduction to Reinforcement Learning Overview of the field Model-based BRL Model-free RL

3 References ICML-07 Tutorial –P. Poupart, M. Ghavamzadeh, Y. Engel Reinforcement Learning: An Introduction –Richard S. Sutton and Andrew G. Barto

4 Machine Learning Unsupervised Learning Reinforcement Learning Supervised Learning

5 Definitions StateActionReward Policy £££££ Reward function

6 Markov Decision Process x0x0 a0a0 x1x1 Policy Transition Probability r0r0 a1a1 r1r1 Reward function

7 Value Function

8 Optimal Policy Assume one optimal action per state Unknown Value Iteration

9 Reinforcement Learning RL Problem: Solve MDP when reward/transition models are unknown Basic Idea: Use samples obtained from agent’s interaction with environment

10 Model-Based vs Model-Free RL Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy Model-Free: Directly learn value function/policy

11 RL Solutions

12 Value Function Algorithms –Define a form for the value function –Sample state-action-reward sequence –Update value function –Extract optimal policy SARSA, Q-learning

13 RL Solutions Actor-Critic –Define a policy structure (actor) –Define a value function (critic) –Sample state-action-reward –Update both actor & critic

14 RL Solutions Policy Search Algorithm –Define a form for the policy –Sample state-action-reward sequence –Update policy PEGASUS –(Policy Evaluation-of-Goodness And Search Using Scenarios)

15 Online - Offline Offline –Use a simulator –Policy fixed for each ‘episode’ –Updates made at the end of episode Online –Directly interact with environment –Learning happens step-by-step

16 Model-Free Solutions 1.Prediction: Estimate V(x) or Q(x,a) 2.Control: Extract policy On-Policy Off-Policy

17 Monte-Carlo Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State -90-83-55-11 -15-61-11 Updated -100-87 -72 -11

18 Temporal Difference Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State -90-83-55-11 -15-61-11 Updated -96-70 -72 -11

19 Advantages of TD Don’t need a model of reward/transitions Online, fully incremental Proved to converge given conditions on step-size “Usually” faster than MC methods

20 From TD to TD(λ) State Reward Terminal state

21 From TD to TD(λ) State Reward Terminal state

22 SARSA & Q-learning TD-Learning SARSA Q-Learning On-Policy Estimate value function for current policy Off-Policy Estimate value function for optimal policy

23 GP Temporal Difference x x x x x x x x x x x 1 2

24 x x x x x x x x x x x 12

25

26

27

28


Download ppt "Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011."

Similar presentations


Ads by Google