A Crash Course in Reinforcement Learning

A Crash Course in Reinforcement Learning
Oliver Schulte Simon Fraser University emphasis on connections with neural net learning

Outline What is Reinforcement Learning? Key Definitions
Key Learning Tasks Reinforcement Learning Techniques Reinforcement Learning with Neural Nets

Overview

Learning To Act So far: learning to predict Now: learn to act
In engineering: control theory Economics, operations research: decision and game theory Examples: fly helicopter drive car play Go play soccer

RL at a glance

Acting in Action Autonomous Helicopter Learning to play video games
An example of imitation learning: start by observing human actions Learning to play video games “Deep Q works best when it lives in the moment” Learn to flip pancakes helicopter

MARKOV DECISION PROCESSES

Markov Decision Processes
Recall Markov process (MP) state = vector x ≅ s of input variable values can contain hidden variables = partially observable (POMDP) transition probability P(s’|s) Markov reward process (MRP) = MP + rewards r Markov decision process (MDP) = MRP + actions a Markov game = MDP with actions, rewards for > 1 agent

Model Parameters: transition probabilities
Markov process: P(s(t+1)|s(t)) MDP: P(s(t+1)|s(t),a(t)) E(r(t+1)|s(t),a(t)) expected reward recall basketball example also hockey example grid example David Poole’s demo

derived concepts

Returns and discounting
A trajectory is a (possibly infinite) sequence s(0),a(0),r(0),s(1),a(1),r(1),...,s(n),a(n),r(n),... The return is the total sum of rewards. But: if the trajectory is infinite, we have an infinite sum! Solution: Weight by discount factor γ between 0 and 1. Return = r(0)+γr(1)+γ2r(n)+... can also interpret as probability of process eding

RL Concepts These 3 functions can be computed by neural networks

Policies and Values A deterministic policy π is a function that maps states to actions. i.e. tells us how to act. Can also be probabilistic. Can be implemented using neural nets. Given a policy and an MDP, we have the expected return from using the policy at a state. Notation: Vπ(s)

Optimal Policies A policy π* is optimal if for any other policy and for all states s Vπ*(s) ≥ Vπ(s) The value of the optimal policy is written as V*(s).

The action value function
Given a policy, the expected reward at a state given an action is denoted as Qπ(s,a). Similarly Q*(s,a) for the value of an action given the optimal policy. grid example Show Mitchell example

LEARNING

Two Learning Problems Prediction: For a fixed policy, learn Vπ(s).
Control: For a given MDP, learn V*(s) (optimal policy). Variants for Q-function.

Model-Based Learning Transition Probabilities Value Function Data
dynamic programming Transition Probabilities Value Function Data Bellmann equation: Vπ(s) = Ps’,a π(a) x ( E(r)|s,a + P(s’|s,a) x Vπ (s’) ) Developed for transition probabilities that are “nice” discrete, Gaussian, Poisson,... grid example

Model-free Learning By-pass estimating transition probabilities
Why? Continuous state variables, no “nice” functional form. (How about using LSTM/RNN dynamic model? deep dynamic programming?)

Model-free Learning Directly learn optimal policy π* (policy iteration) Directly learn optimal value function V*. Directly learn optimal action-value function Q*. All of these functions can be implemented in a neural network. NN learning = reinforcement learning

Model-free Learning: What are the data?
Data is simply a sequence of events s(0),a(0),r(0),s(1),a(1),r(1),... doesn’t tell us expected values or optimal actions. Monte Carlo learning: to learn V, observe return at end of episode. e.g. chessbase gives percentage of wins by white for any position

Temporal Difference Learning
Consistency idea: using current model, and given data, s(0),a(0),r(0),s(1),a(1),r(1), estimate the value V(s(t)) at current state the next-step value V1(s(t)) = r(t)+γV(s(t+1)) Minimize the “error” [V1(s(t))-V(s(t))]2 s(0),a(0),r(0),s(1),a(1),r(1),

Model-Free Learning Example

A Crash Course in Reinforcement Learning

Similar presentations

Presentation on theme: "A Crash Course in Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Crash Course in Reinforcement Learning

Similar presentations

Presentation on theme: "A Crash Course in Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback