Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Crash Course in Reinforcement Learning

Similar presentations


Presentation on theme: "A Crash Course in Reinforcement Learning"— Presentation transcript:

1 A Crash Course in Reinforcement Learning
Oliver Schulte Simon Fraser University emphasis on connections with neural net learning

2 Outline What is Reinforcement Learning? Key Definitions
Key Learning Tasks Reinforcement Learning Techniques Reinforcement Learning with Neural Nets

3 Overview

4 Learning To Act So far: learning to predict Now: learn to act
In engineering: control theory Economics, operations research: decision and game theory Examples: fly helicopter drive car play Go play soccer

5 RL at a glance

6 Acting in Action Autonomous Helicopter Learning to play video games
An example of imitation learning: start by observing human actions Learning to play video games “Deep Q works best when it lives in the moment” Learn to flip pancakes helicopter

7 MARKOV DECISION PROCESSES

8 Markov Decision Processes
Recall Markov process (MP) state = vector x ≅ s of input variable values can contain hidden variables = partially observable (POMDP) transition probability P(s’|s) Markov reward process (MRP) = MP + rewards r Markov decision process (MDP) = MRP + actions a Markov game = MDP with actions, rewards for > 1 agent

9 Model Parameters: transition probabilities
Markov process: P(s(t+1)|s(t)) MDP: P(s(t+1)|s(t),a(t)) E(r(t+1)|s(t),a(t)) expected reward recall basketball example also hockey example grid example David Poole’s demo

10 derived concepts

11 Returns and discounting
A trajectory is a (possibly infinite) sequence s(0),a(0),r(0),s(1),a(1),r(1),...,s(n),a(n),r(n),... The return is the total sum of rewards. But: if the trajectory is infinite, we have an infinite sum! Solution: Weight by discount factor γ between 0 and 1. Return = r(0)+γr(1)+γ2r(n)+... can also interpret as probability of process eding

12 RL Concepts These 3 functions can be computed by neural networks

13 Policies and Values A deterministic policy π is a function that maps states to actions. i.e. tells us how to act. Can also be probabilistic. Can be implemented using neural nets. Given a policy and an MDP, we have the expected return from using the policy at a state. Notation: Vπ(s)

14 Optimal Policies A policy π* is optimal if for any other policy and for all states s Vπ*(s) ≥ Vπ(s) The value of the optimal policy is written as V*(s).

15 The action value function
Given a policy, the expected reward at a state given an action is denoted as Qπ(s,a). Similarly Q*(s,a) for the value of an action given the optimal policy. grid example Show Mitchell example

16 LEARNING

17 Two Learning Problems Prediction: For a fixed policy, learn Vπ(s).
Control: For a given MDP, learn V*(s) (optimal policy). Variants for Q-function.

18 Model-Based Learning Transition Probabilities Value Function Data
dynamic programming Transition Probabilities Value Function Data Bellmann equation: Vπ(s) = Ps’,a π(a) x ( E(r)|s,a + P(s’|s,a) x Vπ (s’) ) Developed for transition probabilities that are “nice” discrete, Gaussian, Poisson,... grid example

19 Model-free Learning By-pass estimating transition probabilities
Why? Continuous state variables, no “nice” functional form. (How about using LSTM/RNN dynamic model? deep dynamic programming?)

20 Model-free Learning Directly learn optimal policy π* (policy iteration) Directly learn optimal value function V*. Directly learn optimal action-value function Q*. All of these functions can be implemented in a neural network. NN learning = reinforcement learning

21 Model-free Learning: What are the data?
Data is simply a sequence of events s(0),a(0),r(0),s(1),a(1),r(1),... doesn’t tell us expected values or optimal actions. Monte Carlo learning: to learn V, observe return at end of episode. e.g. chessbase gives percentage of wins by white for any position

22 Temporal Difference Learning
Consistency idea: using current model, and given data, s(0),a(0),r(0),s(1),a(1),r(1), estimate the value V(s(t)) at current state the next-step value V1(s(t)) = r(t)+γV(s(t+1)) Minimize the “error” [V1(s(t))-V(s(t))]2 s(0),a(0),r(0),s(1),a(1),r(1),

23 Model-Free Learning Example


Download ppt "A Crash Course in Reinforcement Learning"

Similar presentations


Ads by Google