Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011
Outline Introduction to Reinforcement Learning Overview of the field Model-based BRL Model-free RL
References ICML-07 Tutorial –P. Poupart, M. Ghavamzadeh, Y. Engel Reinforcement Learning: An Introduction –Richard S. Sutton and Andrew G. Barto
Machine Learning Unsupervised Learning Reinforcement Learning Supervised Learning
Definitions StateActionReward Policy £££££ Reward function
Markov Decision Process x0x0 a0a0 x1x1 Policy Transition Probability r0r0 a1a1 r1r1 Reward function
Value Function
Optimal Policy Assume one optimal action per state Unknown Value Iteration
Reinforcement Learning RL Problem: Solve MDP when reward/transition models are unknown Basic Idea: Use samples obtained from agent’s interaction with environment
Model-Based vs Model-Free RL Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy Model-Free: Directly learn value function/policy
RL Solutions
Value Function Algorithms –Define a form for the value function –Sample state-action-reward sequence –Update value function –Extract optimal policy SARSA, Q-learning
RL Solutions Actor-Critic –Define a policy structure (actor) –Define a value function (critic) –Sample state-action-reward –Update both actor & critic
RL Solutions Policy Search Algorithm –Define a form for the policy –Sample state-action-reward sequence –Update policy PEGASUS –(Policy Evaluation-of-Goodness And Search Using Scenarios)
Online - Offline Offline –Use a simulator –Policy fixed for each ‘episode’ –Updates made at the end of episode Online –Directly interact with environment –Learning happens step-by-step
Model-Free Solutions 1.Prediction: Estimate V(x) or Q(x,a) 2.Control: Extract policy On-Policy Off-Policy
Monte-Carlo Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State Updated
Temporal Difference Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State Updated
Advantages of TD Don’t need a model of reward/transitions Online, fully incremental Proved to converge given conditions on step-size “Usually” faster than MC methods
From TD to TD(λ) State Reward Terminal state
From TD to TD(λ) State Reward Terminal state
SARSA & Q-learning TD-Learning SARSA Q-Learning On-Policy Estimate value function for current policy Off-Policy Estimate value function for optimal policy
GP Temporal Difference x x x x x x x x x x x 1 2
x x x x x x x x x x x 12