# Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

## Presentation on theme: "Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014."— Presentation transcript:

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014

Project 2 due tonight Project 3 is online (more later) –due in two weeks 1 Reminder

Recap: MDPs 2 Markov decision processes: –States S –Start state s 0 –Actions A –Transition p(s’|s,a) (or T(s,a,s’)) –Reward R(s,a,s’) (and discount ϒ ) MDP quantities: –Policy = Choice of action for each (MAX) state –Utility (or return) = sum of discounted rewards

Optimal Utilities 3 The value of a state s: –V*(s) = expected utility starting in s and acting optimally The value of a Q-state (s,a): –Q*(s,a) = expected utility starting in s, taking action a and thereafter acting optimally The optimal policy: – π *(s) = optimal action from state s

Solving MDPs 4 Value iteration –Start with V 1 (s) = 0 –Given V i, calculate the values for all states for depth i+1: –Repeat until converge –Use V i as evaluation function when computing V i+1 Policy iteration –Step 1: policy evaluation: calculate utilities for some fixed policy –Step 2: policy improvement: update policy using one-step look-ahead with resulting utilities as future values –Repeat until policy converges

Don’t know T and/or R, but can observe R –Learn by doing –can have multiple episodes (trials) 5 Reinforcement learning

The Story So Far: MDPs and RL 6 Things we know how to do: If we know the MDP –Compute V*, Q*, π* exactly –Evaluate a fixed policy π If we don’t know the MDP –If we can estimate the MDP then solve –We can estimate V for a fixed policy π –We can estimate Q*(s,a) for the optimal policy while executing an exploration policy Techniques: Computation Value and policy iteration Policy evaluation Model-based RL sampling Model-free RL: Q-learning

Model-Free Learning 7 Model-free (temporal difference) learning –Experience world through episodes (s,a,r,s’,a’,r’,s’’,a’’,r’’,s’’’…) –Update estimates each transition (s,a,r,s’) –Over time, updates will mimic Bellman updates

Detour: Q-Value Iteration 8 We’d like to do Q-value updates to each Q-state: –But can’t compute this update without knowing T,R Instead, compute average as we go –Receive a sample transition (s,a,r,s’) –This sample suggests –But we want to merge the new observation to the old ones –So keep a running average

Q-Learning Properties 9 Will converge to optimal policy –If you explore enough (i.e. visit each q-state many times) –If you make the learning rate small enough –Basically doesn’t matter how you select actions (!) Off-policy learning: learns optimal q-values, not the values of the policy you are following

Q-Learning 10 Q-learning produces tables of q-values:

Exploration / Exploitation 11 R andom actions (ε greedy) –Every time step, flip a coin –With probability ε, act randomly –With probability 1-ε, act according to current policy

Today: Q-Learning with state abstraction 12 In realistic situations, we cannot possibly learn about every single state! –Too many states to visit them all in training –Too many states to hold the Q-tables in memory Instead, we want to generalize: –Learn about some small number of training states from experience –Generalize that experience to new, similar states –This is a fundamental idea in machine learning, and we’ll see it over and over again

Example: Pacman 13 Let’s say we discover through experience that this state is bad: In naive Q-learning, we know nothing about this state or its Q-states: Or even this one!

Feature-Based Representations 14 Solution: describe a state using a vector of features (properties) –Features are functions from states to real numbers (often 0/1) that capture important properties of the state –Example features: Distance to closest ghost Distance to closest dot Number of ghosts 1/ (dist to dot) 2 Is Pacman in a tunnel? (0/1) …etc Is it the exact state on this slide? –Can also describe a Q-state (s,a) with features (e.g. action moves closer to food) Similar to a evaluation function

Linear Feature Functions 15 Using a feature representation, we can write a Q function (or value function) for any state using a few weights: Advantage: our experience is summed up in a few powerful numbers Disadvantage: states may share features but actually be very different in value!

Function Approximation 16 Q-learning with linear Q-functions: transition = (s,a,r,s’) Intuitive interpretation: –Adjust weights of active features –E.g. if something unexpectedly bad happens, disprefer all states with that state’s features Formal justification: online least squares Exact Q’s Approximate Q’s

Example: Q-Pacman 17 Q(s,a)= 4.0f DOT (s,a)-1.0f GST (s,a) f DOT (s,NORTH)=0.5 f GST (s,NORTH)=1.0 Q(s,a)=+1 R(s,a,s’)=-500 difference=-501 w DOT ←4.0+α[-501]0.5 w GST ←-1.0+α[-501]1.0 Q(s,a)= 3.0f DOT (s,a)-3.0f GST (s,a) α= North r = -500 s s’

Linear Regression 18

Ordinary Least Squares (OLS) 19

Minimizing Error 20 Imagine we had only one point x with features f(x): Approximate q update explained: “target” “prediction”

As many as possible? –computational burden –overfitting Feature selection is important –requires domain expertise 21 How many features should we use?

Overfitting 22

The elements of statistical learning. Fig 2.11. 23

MDPs –Q1: value iteration –Q2: find parameters that lead to certain optimal policy –Q3: similar to Q2 Q-learning –Q4: implement the Q-learning algorithm –Q5: implement ε greedy action selection –Q6: try the algorithm Approximate Q-learning and state abstraction –Q7: Pacman –Q8 (bonus): design and implement a state-abstraction Q- learning algorithm Hints –make your implementation general –try to use class methods as much as possible 24 Overview of Project 3