Download presentation

Presentation is loading. Please wait.

Published byQuintin Hounsell Modified about 1 year ago

1
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

2
Differences with DP/TD Differences with DP methods: – Real RL: Complete transition model not necessary They sample experience; can be used for direct learning – They do not bootstrap No evaluation of successor states Differences with TD methods – Well, they do not bootstrap – they average episodic returns 2Slides prepared by Georgios Chalkiadakis

3
Overview and Advantages Learn from experience – sample episodes – Sample sequences of states, actions, rewards – Either on-line, or from simulated (model-based) interactions with environment. But no complete model required. Advantages – Provably learn optimal policy without model – Can be used with sample /easy-to-produce models – Can focus on interesting state regions easily – More robust wrt Markov property violations 3Slides prepared by Georgios Chalkiadakis

4
Policy Evaluation Slides prepared by Georgios Chalkiadakis4

5
Action-value functions required Without a model, we need Q-value estimates MC methods now average returns following visits to state-action pairs All such pairs “need” to be visited! …sufficient exploration required – Randomize episode starts (“exploring-starts”) – …or behave using a stochastic (e.g. ε-greedy) policy – …thus “Monte-Carlo” 5Slides prepared by Georgios Chalkiadakis

6
Monte-Carlo Control (to generate optimal policy) For now, assume “exploring starts” Does “policy iteration” work? – Yes! 6Slides prepared by Georgios Chalkiadakis Where evaluation of each policy is over multiple episodes And improvement make policy greedy wrt current Q-value function

7
Monte-Carlo Control (to generate optimal policy) Why? Slides prepared by Georgios Chalkiadakis7 is greedy wrt Then, policy-improvement theorem applies because, for all s : Thus is uniformly better than

8
A Monte-Carlo control algorithm Slides prepared by Georgios Chalkiadakis8

9
ε-greedy Exploration If not “greedy”, select with 9Slides prepared by Georgios Chalkiadakis Otherwise: What about ε-greedy policies?

10
Yes, policy iteration works See the details in book ε-soft on-policy algorithm: 10

11
…and you can have off-policy learning as well… Why? Slides prepared by Georgios Chalkiadakis11

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google