# R EINFORCEMENT L EARNING. A GENDA Online learning Reinforcement learning Model-free vs. model-based Passive vs. active learning Exploration-exploitation.

## Presentation on theme: "R EINFORCEMENT L EARNING. A GENDA Online learning Reinforcement learning Model-free vs. model-based Passive vs. active learning Exploration-exploitation."— Presentation transcript:

R EINFORCEMENT L EARNING

A GENDA Online learning Reinforcement learning Model-free vs. model-based Passive vs. active learning Exploration-exploitation tradeoff

I NCREMENTAL (O NLINE ) F UNCTION L EARNING Data is streaming into learner x 1,y 1, …, x n,y n y i = f(x i ) Observes x n+1 and must make prediction for next time step y n+1 Batch approach: Store all data at step n Use your learner of choice on all data up to time n, predict for time n+1 Can we do this using less memory? 3

E XAMPLE : M EAN E STIMATION y i = + error term (no xs) Current estimate n = 1/n i=1…n y i n+1 = 1/(n+1) i=1…n+1 y i = 1/(n+1) (y n+1 + i=1…n y i ) = 1/(n+1) (y n+1 + n n ) = n + 1/(n+1) (y n+1 - n ) 4 5

E XAMPLE : M EAN E STIMATION y i = + error term (no xs) Current estimate t = 1/n i=1…n y i n+1 = 1/(n+1) i=1…n+1 y i = 1/(n+1) (y n+1 + i=1…n y i ) = 1/(n+1) (y n+1 + n n ) = n + 1/(n+1) (y n+1 - n ) 5 5 y6y6

E XAMPLE : M EAN E STIMATION y i = + error term (no xs) Current estimate t = 1/n i=1…n y i n+1 = 1/(n+1) i=1…n+1 y i = 1/(n+1) (y n+1 + i=1…n y i ) = 1/(n+1) (y n+1 + n n ) = n + 1/(n+1) (y n+1 - n ) 6 5 6 = 5/6 5 + 1/6 y 6

E XAMPLE : M EAN E STIMATION n+1 = n + 1/(n+1) (y n+1 - n ) Only need to store n, n 7 5 6 = 5/6 6 + 1/6 y 6

L EARNING R ATES In fact, n+1 = n + n (y n+1 - n ) converges to the mean for any n such that: n 0 as n n n 2 C < O(1/n) does the trick If n is close to 1, then the estimate shifts strongly to recent data; close to 0, and the old estimate is preserved

R EINFORCEMENT L EARNING RL problem : given only observations of actions, states, and rewards, learn a (near) optimal policy No prior knowledge of transition or reward models We consider: fully-observable, episodic environment, finite state space, uncertainty in action (MDP)

W HAT TO L EARN ? Policy Action-utility function Q(s,a) Less online deliberation More online deliberation Utility function U Model of R and T Learn: Online: (s) arg max a Q(s,a) Model freeModel based Simpler executionFewer examples needed to learn? arg max a s P(s|s,a)U(s) Solve MDP Method: Learning from demonstration Q-learning, SARSA Direct utility estimation, TD-learning Adaptive dynamic programming

F IRST STEPS : P ASSIVE RL Observe execution trials of an agent that acts according to some unobserved policy Problem: estimate the utility function U [Recall U (s) = E[ t t R(S t )] where S t is the random variable denoting the distribution of states at time t]

D IRECT U TILITY E STIMATION 1. Observe trials t (i) =(s 0 (i),a 1 (i),s 1 (i),r 1 (i),…,a ti (i),s ti (i),r ti (i) ) for i=1,…,n 2. For each state s S: 3. Find all trials t (i) that pass through s 4. Compute subsequent utility U t(i) (s)= t=k to ti t-k r t (i) 5. Set U (s) to the average observed utility 3 2 1 4321 +1 0 0000 0 000 3 2 1 4321 +1 0.66 0.390.610.660.71 0.76 0.870.810.92

O NLINE I MPLEMENTATION 1. Store counts N[s] and estimated utilities U (s) 2. After a trial t, for each state s in the trial: 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(U t (s)-U (s)) 3 2 1 4321 +1 0 0000 0 000 3 2 1 4321 +1 0.66 0.390.610.660.71 0.76 0.870.810.92 Simply supervised learning on trials Slow learning, because Bellman equation is not used to pass knowledge between adjacent states

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0 0000 0 000

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0 000 0 0 000

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0 000 -0.02 0 000 With learning rate =0.5

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0 000 -0.02 0 With learning rate =0.5

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0 000 -0.02 0.48 With learning rate =0.5

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0 000 -0.04 0.21-0.040.72 With learning rate =0.5

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0 000 -0.06 0.440.070.84 With learning rate =0.5

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0 000 -0.08 -0.03 0.620.230.42 With learning rate =0.5

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0.19 000 -0.08 -0.03 0.620.230.42 With learning rate =0.5

T EMPORAL D IFFERENCE L EARNING 1. Store counts N[s] and estimated utilities U (s) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1 4. Adjust utility U (s) U (s)+ (N[s])(r+ U (s)-U (s)) 3 2 1 4321 +1 0.19 000 -0.08 -0.03 0.620.230.69 For any s, distribution of s approaches P(s|s, (s)) Uses relationships between adjacent states to adjust utilities toward equilibrium Unlike direct estimation, learns before trial is terminated With learning rate =0.5

O FFLINE INTERPRETATION OF TD L EARNING 1. Observe trials t (i) =(s 0 (i),a 1 (i),s 1 (i),r 1 (i),…,a ti (i),s ti (i),r ti (i) ) for i=1,…,n 2. For each state s S: 3. Find all trials t (i) that pass through s 4. Extract local history at (s,r (i),a (i),s (i) ) for each trial 5. Set up constraint U (s) = r (i) + U (s (i) ) 6. Solve all constraints in least squares fashion using stochastic gradient descent [Recall linear system in policy iteration: u = r +T u ] 3 2 1 4321 +1 0 0000 0 000 3 2 1 4321 +1 ?

1. Store counts N[s],N[s,a],N[s,a,s], estimated rewards R(s), and transition model P(s|s,a) 2. For each observed transition (s,r,a,s): 3. Set N[s] N[s]+1, N[s,a] N[s,a]+1, N[s,a,s] N[s,a,s]+1 4. Adjust reward R(s) R(s)+ (N[s])(r-R(s)) 5. Set P(s|s,a) = N[s,a,s]/N[s,a] 6. Solve policy evaluation using P, R, A DAPTIVE D YNAMIC P ROGRAMMING 3 2 1 4321 +1 0 0000 0 000 Faster learning than TD, because Bellman equation is exploited across all states Modified policy evaluation algorithms make updates faster than solving linear system (O(n 3 )) +1 -.04 ? ? R(s) P(s|s,a)

A CTIVE RL Rather than assume a policy is given, can we use the learned utilities to pick good actions? At each state s, the agent must learn outcomes for all actions, not just the action (s)

G REEDY RL Maintain current estimates U (s) Idea: At state s, take action a that maximizes s P(s|s,a) U (s) Very seldom works well! Why?

E XPLORATION VS. E XPLOITATION Greedy strategy purely exploits its current knowledge The quality of this knowledge improves only for those states that the agent observes often A good learner must perform exploration in order to improve its knowledge about states that are not often observed But pure exploration is useless (and costly) if it is never exploited

R ESTAURANT P ROBLEM

O PTIMISTIC E XPLORATION S TRATEGY Behave initially as if there were wonderful rewards R + scattered all over the place Define a modified optimistic Bellman update U + (s) R(s)+ max a f( s P(s|s,a)U + (s), N[s,a]) Truncated exploration function: f(u,n) = R + if n < N e uotherwise [Here the agent will try each action in each state at least N e times.]

C OMPLEXITY Truncated: at least N e ·|S|·|A| steps are needed in order to explore every action in every state Some costly explorations might not be necessary, or the reward from far-off explorations may be highly discounted Convergence to optimal policy guaranteed only if each action is tried in each state an infinite number of times! This works with ADP… But how to perform action selection in TD? Must also learn the transition model P(s|s,a)

Q-V ALUES Learning U is not enough for action selection because a transition model is needed Solution: learn Q-values: Q(s,a) is the utility of choosing action a in state s Shift Bellman equation U(s) = max a Q(s,a) Q(s,a) = R(s) + s P(s|s,a) max a Q(s,a) So far, everything is the same… but what about the learning rule?

Q- LEARNING U PDATE Recall TD: Update: U(s) U(s)+ (N[s])(r+ U(s)-U(s)) Select action: a arg max a f( s P(s|s,a)U(s), N[s,a]) Q-Learning: Update: Q(s,a) Q(s,a)+ (N[s,a])(r+ max a Q(s,a)-Q(s,a)) Select action: a arg max a f( Q(s,a), N[s,a]) Key difference: average over P(s|s,a) is baked in to the Q function Q-learning is therefore a model-free active learner

M ORE I SSUES IN RL Model-free vs. model-based Model-based techniques are typically better at incorporating prior knowledge Generalization Value function approximation Policy search methods

L ARGE S CALE A PPLICATIONS Game playing TD-Gammon: neural network representation of Q- functions, trained via self-play Robot control

R ECAP Online learning: learn incrementally with low memory overhead Key differences between RL methods: what to learn? Temporal differencing: learn U through incremental updates. Cheap, somewhat slow learning. Adaptive DP: learn P and R, derive U through policy evaluation. Fast learning but computationally expensive. Q-learning: learn state-action function Q(s,a), allows model-free action selection Action selection requires trading off exploration vs. exploitation Infinite exploration needed to guarantee that the optimal policy is found!

I NCREMENTAL L EAST S QUARES Recall Least Squares estimate = (A T A) -1 A T b Where A is matrix of x (i) s, b is vector of y (i) s (laid out in rows) 37 A = x (1) x (2) x (N ) … b = y (1) y (2) y (N) … NxMNx1

D ELTA R ULE FOR L INEAR L EAST S QUARES Delta rule (Widrow-Hoff rule): stochastic gradient descent (t+1) = (t) + x (y- (t)T x) O(n) time and space 38

I NCREMENTAL L EAST S QUARES Let A (t), b (t) be A matrix, b vector up to time t (t) = (A (t)T A (t) ) -1 A (t)T b (t) 39 A (t+1) = x (t+1) b (t+1) = y (t+1) (T+1)xM(t+1)x1 b (t) A (t)

I NCREMENTAL L EAST S QUARES Let A (t), b (t) be A matrix, b vector up to time t (t+1) = (A (t+1)T A (t+1) ) -1 A (t+1)T b (t+1) A (t+1)T b (t+1) =A (t)T b (t) + y (t+1) x (t+1) 40 A (t+1) = x (t+1) b (t+1) = y (t+1) (T+1)xM(t+1)x1 b (t) A (t)

I NCREMENTAL L EAST S QUARES Let A (t), b (t) be A matrix, b vector up to time t (t+1) = (A (t+1)T A (t+1) ) -1 A (t+1)T b (t+1) A (t+1)T b (t+1) =A (t)T b (t) + y (t+1) x (t+1) A (t+1)T A (t+1) = A (t)T A (t) + x (t+1) x (t+1)T 41 A (t+1) = x (t+1) b (t+1) = y (t+1) (T+1)xM(t+1)x1 b (t) A (t)

I NCREMENTAL L EAST S QUARES Let A (t), b (t) be A matrix, b vector up to time t (t+1) = (A (t+1)T A (t+1) ) -1 A (t+1)T b (t+1) A (t+1)T b (t+1) =A (t)T b (t) + y (t+1) x (t+1) A (t+1)T A (t+1) = A (t)T A (t) + x (t+1) x (t+1)T 42 A (t+1) = x (t+1) b (t+1) = y (t+1) (T+1)xM(t+1)x1 b (t) A (t)

I NCREMENTAL L EAST S QUARES Let A (t), b (t) be A matrix, b vector up to time t (t+1) = (A (t+1)T A (t+1) ) -1 A (t+1)T b (t+1) A (t+1)T b (t+1) =A (t)T b (t) + y (t+1) x (t+1) A (t+1)T A (t+1) = A (t)T A (t) + x (t+1) x (t+1)T Sherman-Morrison Update (Y + xx T ) -1 = Y -1 - Y -1 xx T Y -1 / (1 – x T Y -1 x) 43

I NCREMENTAL L EAST S QUARES Putting it all together Store p (t) = A (t)T b (t) Q (t) = (A (t)T A (t) ) -1 Update p (t+1) = p (t) + y x Q (t+1) = Q (t) - Q (t) xx T Q (t) / (1 – x T Q (t) x) (t+1) = Q (t+1) p (t+1) O(M 2 ) time and space instead of O(M 3 +MN) time and O(MN) space for OLS True least squares estimator for any t, (delta rule works only for large t) 44

Download ppt "R EINFORCEMENT L EARNING. A GENDA Online learning Reinforcement learning Model-free vs. model-based Passive vs. active learning Exploration-exploitation."

Similar presentations