Chapter 5: Monte Carlo Methods

Slides:



Advertisements
Similar presentations
Lecture 18: Temporal-Difference Learning
Advertisements

Dialogue Policy Optimisation
Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning Tutorial
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 6: Temporal Difference Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 4: Dynamic Programming pOverview of a collection of classical solution.
Reinforcement Learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Formulating MDPs pFormulating MDPs Rewards Returns Values pEscalator pElevators.
Chapter 7: Eligibility Traces
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.
Retraction: I’m actually 35 years old. Q-Learning.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.
Reinforcement Learning Elementary Solution Methods
Sutton & Barto, Chapter 4 Dynamic Programming. Programming Assignments? Course Discussions?
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Biomedical Data & Markov Decision Process
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
Chapter 4: Dynamic Programming
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 9: Planning and Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 7: Eligibility Traces
Chapter 4: Dynamic Programming
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Presentation transcript:

Chapter 5: Monte Carlo Methods Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Monte Carlo methods learn directly from experience On-line: No model necessary and still attains optimality Simulated: No need for a full model R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Monte Carlo Policy Evaluation Goal: learn Vp(s) Given: some number of episodes under p which contain s Idea: Average returns observed after visits to s 1 2 3 4 5 Every-Visit MC: average returns for every time s is visited in an episode First-visit MC: average returns only for first time s is visited in an episode Both converge asymptotically R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

First-visit Monte Carlo Policy Evaluation Initialize: p policy to be evaluated V an arbitrary state-value function Returns(s) empty list, for all seS Repeat forever: Generate an episode using p For each state s appearing in the episode R return following the first occurrence of s Append R to Return(s) V(s) average(Returns(s)) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Blackjack example Object: Have your card sum be greater than the dealers without exceeding 21. States (200 of them): current sum (12-21) dealer’s showing card (ace-10) do I have a useable ace? Reward: +1 for winning, 0 for a draw, -1 for losing Actions: stick (stop receiving cards), hit (receive another card) Policy: Stick if my sum is 20 or 21, else hit R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Blackjack Value Functions After many MC state visit evaluations the state-value function is well approximated Dynamic Programming parameters difficult to formulate here! For instance with a given hand and a decision to stay what would be the expected return value? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Backup Diagram for Monte Carlo Entire episode is included while in DP only one step transitions Only one choice at each state (unlike DP which uses all possible transitions in one step) Estimates for all states are independent so MC does not bootstrap (build on other estimates) Time required to estimate one state does not depend on the total number of states R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

The Power of Monte Carlo Example - Elastic Membrane (Dirichlet Problem) How do we compute the shape of the membrane or bubble attached to a fixed frame? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Two Approaches Relaxation Kakutani’s algorithm, 1945 Iterate on the grid and compute averages (like DP iterations) Kakutani’s algorithm, 1945 Use many random walks and average the boundary points values (like MC approach) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Monte Carlo Estimation of Action Values (Q) Monte Carlo is most useful when a model is not available We want to learn Q* Qp(s,a) - average return starting from state s and action a following p Converges asymptotically if every state-action pair is visited To assure this we must maintain exploration to visit many state-action pairs Exploring starts: Every state-action pair has a non-zero probability of being the starting pair R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Monte Carlo Control (to approximate an optimal policy) Generalized Policy Iteration GPI MC policy iteration: Policy evaluation by approximating Qp using MC methods, followed by policy improvement Policy improvement step: greedy with respect to Qp (action-value) function – no model needed to construct greedy policy R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Convergence of MC Control Policy improvement theorem tells us: This assumes exploring starts and infinite number of episodes for MC policy evaluation To solve the latter: update only to a given level of performance alternate between evaluation and improvement per episode R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Monte Carlo Exploring Starts Fixed point is optimal policy p* Proof is one of the fundamental questions of RL R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Blackjack Example continued Exploring starts – easy to enforce by generating state-action pairs randomly Initial policy as described before (sticks only at 20 or 21) Initial action-value function equal to zero R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

On-policy Monte Carlo Control On-policy: learn or improve the policy currently executing How do we get rid of exploring starts? Need soft policies: p(s,a) > 0 for all s and a e.g. e-soft policy probability of action selection: greedy action non-max actions Similar to GPI: move policy towards greedy policy (i.e. e-soft) Converges to best e-soft policy R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

On-policy MC Control R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Learning about p while following another R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Off-policy Monte Carlo control On-policy estimates the policy value while following it so the policy to generate behavior and estimate its result is identical Off-policy assumes separate behavior and estimation policies Behavior policy generates behavior in environment may be randomized to sample all actions Estimation policy is policy being learned about may be deterministic (greedy) Off-policy estimates one policy while following another Average returns from behavior policy by using nonzero probabilities of selecting all actions for the estimation policy May be slow to find improvement after nongreedy action R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Off-policy MC control R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Incremental Implementation MC can be implemented incrementally saves memory Compute the weighted average of each return non-incremental incremental equivalent R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Racetrack Exercise States: grid squares, velocity horizontal and vertical Rewards: -1 on track, -5 off track Only the right turns allowed Actions: +1, -1, 0 to velocity 0 < Velocity < 5 Stochastic: 50% of the time it moves 1 extra square up or right R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Summary MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book) MC methods provide an alternate policy evaluation process One issue to watch for: maintaining sufficient exploration exploring starts, soft policies No bootstrapping (as opposed to DP) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction