Richard S. Sutton & Andrew G. Barto

Richard S. Sutton & Andrew G. Barto
Reinforcement Learning | Part I Tabular Solution Methods Mini-Bootcamp Richard S. Sutton & Andrew G. Barto 1st ed. (1998), 2nd ed. (2018) Presented by Nicholas Roy Pillow Lab Meeting June 27, 2019

RL of the tabular variety
What is special about RL? “The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior. Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible.” What is the point of Part I? “We describe almost all the core ideas of reinforcement learning algorithms in their simplest forms: that in which the state and action spaces are small enough for the approximate value functions to be represented as arrays, or tables. In this case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Part I: Tabular Solution Methods
Ch2: Multi-armed Bandits Ch3: Finite Markov Decision Processes Ch4: Dynamic Programming Ch5: Monte Carlo Methods Ch6: Temporal-Difference Learning Ch7: n-step Bootstrapping Ch8: Planning and Learning with Tabular Methods Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Let’s get through the basics…
Agent/Environment Returns States Discount factors Actions Episodic/Continuing tasks Rewards Policies Markov State-/Action-value function MDP Bellman equation Dynamics p(s’,r|s,a) Optimal policies Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Agent vs. Environment Agent: the learner and decision maker, interacts with… Environment: everything else In a finite Markov Decision Process (MDP), the sets of states S, actions A, and rewards R have finite number of elements Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

MDP Dynamics Dynamics defined completely by p
Dynamics have Markov property, only depend on current (s,a) Can collapse this 4D table to get other functions of interest: state-transitions: expected reward: Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Rewards and Returns Reward hypothesis: “…goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).” way of communicating what you want to achieve, not how Return: Discounted Return: With discount, even infinite time steps have finite Return Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Policies & Value Functions
Policy: a mapping from states to probabilities of selecting each possible action, 𝜋(a|s) State-value function: for a given policy 𝜋 and state s, is defined as the expected return G when starting in s and following 𝜋 thereafter Action-value function: for a given policy 𝜋, state s, and action a, the expected return G when starting in s, doing a, then following 𝜋 The existence and uniqueness of v𝜋 and q𝜋 are guaranteed as long as either 𝛾<1 or eventual termination from all states under the policy 𝜋. Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Bellman Property Bellman Property: a recursive relationship satisfied by the unique functions v𝜋 and q𝜋 between the value of a state and the value of its successor states Analogous for q𝜋(s,a), can also easily convert between v𝜋 and q𝜋 Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Bellman optimality equations
Optimal Policies There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by 𝜋* They share the same state- and action-value functions, v* and q* Bellman optimality equations Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Dynamic Programming DP algorithms can be used to compute optimal policies if given the complete dynamics, p(s’,r|s,a), of an MDP A strong assumption and computationally expensive, but provides theoretical best case that other algorithms attempt to achieve Chapter introduces Policy Evaluation then Policy Iteration: Initialize an arbitrary value function v and random policy 𝜋 Use Bellman update to move v toward v𝜋 until convergence Update 𝜋’ to be greedy w.r.t. v𝜋 Repeat from (2.) until v𝜋′ = v𝜋 convergence, implying that v𝜋 = v∗ 𝜋* is just the greedy policy w.r.t. v* Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Generalized Policy Iteration (GPI)
We can actually skip the strict iteration and just update the policy to be greedy in real-time… Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

A Quick Example… Reinforcement Learning Mini-Bootcamp
Pillow Lab Meeting, 06/27/19 Nicholas Roy

DP Summary DP suffers from “curse of dimensionality”
(Coined by Bellman, as is “dynamic programming”!) But exponentially better than direct search Modern computers can handle millions of states, can run asynchronously DP is essentially just Bellman equations turned into updates Generalized Policy Methods proven to converge for DP Bootstrapping: DP bootstraps, that is it updates estimates of values using other estimated values Unlike the next set of methods… Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Motivation for Monte Carlo
What is v𝜋? The expected return G from each state under 𝜋 So, why not just learn v𝜋 by averaging returns G? Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

The difference between Monte Carlo and DP
MC operates on sample experience, not with full dynamics DP : computing v𝜋 :: MC : learning v𝜋 MC does not bootstrap, estimates v𝜋 directly from returns G Advantages of MC > DP: Can be used to learn optimal behavior directly from interaction with the environment, with no model of the environment’s dynamics If there is a model, can learn from simulation (ex: Blackjack) Easy and efficient to focus Monte Carlo methods on a small subset of states No bootstrapping means less harmed by violations of the Markov property Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Problems of Exploration
Problem is now non-stationary — the return after taking an action in one state depends on the actions taken in later states in the same episode If 𝜋 is a deterministic policy, then in following 𝜋 one will observe returns for only one of the actions from each state. With no returns, the Monte Carlo estimates of the other actions will not improve with experience. Must assure continual exploration for policy evaluation to work Solutions: Exploring starts (small chance of starting in each state) On-policy : epsilon-greedy (choose a random action epsilon-% of the time) Off-policy : importance sampling (use distinct policy b to explore and improve 𝜋) Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

On-policy vs. Off-policy for Exploration
In on-policy methods, the agent commits to always exploring and tries to find the best policy that still explores. In off-policy methods, the agent also explores, but learns a deterministic optimal policy (𝜋, usually greedy) that may be unrelated to the policy followed (b, the behavioral policy). Off-policy prediction learning methods are based on some form of importance sampling, that is, on weighting returns by the ratio of the probabilities of taking the observed actions under the two policies, thereby transforming their expectations from the behavior policy to the target policy. Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

A Comparison of Updates
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Temporal-Difference Learning
“If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.” Like MC methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). Advantages of TD methods: can be applied online with a minimal amount of computation using experience generated from interaction with an environment expressed nearly completely by single equations, implemented with small computer programs. Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

TD Update & Error TD Update: TD Error:

Example: TD(0) vs. MC Reinforcement Learning Mini-Bootcamp
Pillow Lab Meeting, 06/27/19 Nicholas Roy

Sarsa: on-policy TD control

Q-learning: off-policy TD control

Case Study: Cliff Walking

n-step Methods Specifically, n-step TD methods
Bridges gap between one-step TD(0) and ∞-step Monte Carlo With TD(0), the same time step determines both how often the action can be changed & the time interval for bootstrapping want to update action values very fast to take into account any changes but bootstrapping works best if it is over a length of time in which a significant and recognizable state change has occurred. Will be superseded by Ch12 Eligibility Traces, continuous version Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Model-free vs. Model-based
Planning methods (model-based) : DP Learning methods (model-free) : MC, TD Both method types: look ahead to future events, compute a backed-up value, using it as an update target for an approximate value function DP: MC: TD: Now seek to unify model-free/based Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Dyna From experience you can:
(1) improve your value function & policy (direct RL) (2) improve your model (model-learning, or indirect RL) Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

A Final Overview 3 key ideas in common Ch7: n-step Methods
+ 3rd dimension: On vs. Off-policy Ch7: n-step Methods Ch12: Eligibility Traces 3 key ideas in common They all seek to estimate value functions They all operate by backing up values along actual or possible state trajectories They all follow the general strategy of generalized policy iteration (GPI), meaning that they maintain an approximate value function and approximate policy, and continually try to improve each on the basis of the other. Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Other method dimensions to consider…

The Rest of the Book Part I: Tabular Solution Methods
Part II: Approximate Solution Methods Ch 9: On-policy Prediction with Approximation Ch10: On-policy Control with Approximation Ch11: Off-policy Methods with Approximation Ch12: Eligibility Traces Ch13: Policy Gradient Methods Part III: Looking Deeper Neuroscience, Psychology, Applications and Case Studies, Frontiers Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19 Nicholas Roy

Richard S. Sutton & Andrew G. Barto

Similar presentations

Presentation on theme: "Richard S. Sutton & Andrew G. Barto"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Richard S. Sutton & Andrew G. Barto

Similar presentations

Presentation on theme: "Richard S. Sutton & Andrew G. Barto"— Presentation transcript:

Similar presentations

About project

Feedback