Download presentation

Presentation is loading. Please wait.

Published byQuintin Hounsell Modified over 3 years ago

1
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

2
Differences with DP/TD Differences with DP methods: – Real RL: Complete transition model not necessary They sample experience; can be used for direct learning – They do not bootstrap No evaluation of successor states Differences with TD methods – Well, they do not bootstrap – they average episodic returns 2Slides prepared by Georgios Chalkiadakis

3
Overview and Advantages Learn from experience – sample episodes – Sample sequences of states, actions, rewards – Either on-line, or from simulated (model-based) interactions with environment. But no complete model required. Advantages – Provably learn optimal policy without model – Can be used with sample /easy-to-produce models – Can focus on interesting state regions easily – More robust wrt Markov property violations 3Slides prepared by Georgios Chalkiadakis

4
Policy Evaluation Slides prepared by Georgios Chalkiadakis4

5
Action-value functions required Without a model, we need Q-value estimates MC methods now average returns following visits to state-action pairs All such pairs “need” to be visited! …sufficient exploration required – Randomize episode starts (“exploring-starts”) – …or behave using a stochastic (e.g. ε-greedy) policy – …thus “Monte-Carlo” 5Slides prepared by Georgios Chalkiadakis

6
Monte-Carlo Control (to generate optimal policy) For now, assume “exploring starts” Does “policy iteration” work? – Yes! 6Slides prepared by Georgios Chalkiadakis Where evaluation of each policy is over multiple episodes And improvement make policy greedy wrt current Q-value function

7
Monte-Carlo Control (to generate optimal policy) Why? Slides prepared by Georgios Chalkiadakis7 is greedy wrt Then, policy-improvement theorem applies because, for all s : Thus is uniformly better than

8
A Monte-Carlo control algorithm Slides prepared by Georgios Chalkiadakis8

9
ε-greedy Exploration If not “greedy”, select with 9Slides prepared by Georgios Chalkiadakis Otherwise: What about ε-greedy policies?

10
Yes, policy iteration works See the details in book ε-soft on-policy algorithm: 10

11
…and you can have off-policy learning as well… Why? Slides prepared by Georgios Chalkiadakis11

Similar presentations

OK

CS 478 - Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

CS 478 - Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on chapter 3 atoms and molecules youtube Ppt on principles of object-oriented programming with c++ Ppt on power diode application Ppt on network switching methods Ppt on census 2001 kerala Ppt on formal non-formal and informal education Ppt on mercury and venus Ppt on non agricultural activities in jamaica Ppt on fdi in retail sector 2012 Ppt on network theory of memory