Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision Making Under Uncertainty Lec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.

Similar presentations


Presentation on theme: "Decision Making Under Uncertainty Lec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy."— Presentation transcript:

1 Decision Making Under Uncertainty Lec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy Wyatt (U Birmingham), Ron Parr Eduardo Alonso, and Craig Boutilier (Toronto) (Duke), Eduardo Alonso, and Craig Boutilier (Toronto)

2 So Far MDPs, RL –Transition matrix (nxn) –Reward vector (n entries) –Value function – table (n entries) –Policy – table (n entries) Time to solve is linear programming with O(n) variables Problem: Number of states (n) too large in many systems

3 Today: Approximate Values Motivation: –Representation of value function is smaller –Generalization: fewer state-action pairs must be observed –Some spaces (e.g., continuous) must be approximated Function approximation –Have a parametrized representation for value function

4 Overview: Approximate Reinforcement Learning Consider V^(  ;  ): X  R, where  denotes the vectors of parameters of the approximator Want to learn  so that V^ approximates V* well Gradient-descent: find direction  in which V^(  ;  ) can change to improve performance the most Use step size,  : V^(s ;  new ) = V^(s ;  ) + 

5 Overview: Approximate Reinforcement Learning In RL, at each time step, the learning system uses a function approximator to select an action a=  ^(s ;  ) Should adjust  so as to produce the maximum possible value for each s Would like  ^ to solve the parametric global optimization problem  ^(s ;  ) = max a  A Q  (s,a)  s  S

6 Supervised learning and Reinforcement learning SL: s and a =  (s) are given –Regression uses direction  =a –  ^(s ;  ). SL: we can check  =0 and decide if the correct map is given by  ^ at s. RL: direction is not available RL: cannot make conclusion on correctness without exploring the values of V(s,a) for all a

7 How to measure the performance of FA methods? Regression SL minimizes the mean-square error (MSE) over some distribution P of the inputs Our value prediction problem: inputs = states, and target function = V , so MSE(  t ) =  s  S P(s)[V  (s)- V t (s)] 2 P is important because it is usually not possible to reduce the error to zero at all states.

8 More on MSE Convergence results available to on-policy distributions with frequency of states that are encountered while interacting with environment Question: why minimizing MSE? Our goal is to make predictions to aid in finding a better policy. The best predictions for that purpose are not necessarily the best for minimizing MSE There is no better alternative at the moment

9 Gradient-Descent for Reinforcement Learning  t parameter vector of real valued components V t (s) is smooth differentiable function of  t for all s  S Minimize error on observed example : Adjust  t by a small amount in the direction that reduces the error the most on that example  t+1 =  t +  [V  (s t ) - V t (s t )]   t V t (s t ) Where   t V t (s t ) denotes the derivative vector of the gradient of V t (s t ) with respect to  t

10 Value Prediction If v t, the t-th training example, is not the true value, V  (s t ), but some approximation:  t+1 =  t +  [v t - V t (s t )]   t V t (s t ) Similarly, for n-step returns  t+1 =  t +  [R( ) t - V t (s t )]   t V t (s t ), or  t+1 =  t +   t e  t, for  t the Bellman error and e  t =  e  t-1   t V t (s t )

11 Gradient-Descent TD( ) for V  Initialize  arbitrarily and e  = 0  Repeat (for each episode) s  initial state of episode Repeat (for each step of the episode) a  action given by  for s Take action a, observe reward, r, and next state, s’   r +  V(s’) – V(s) e    e    V(s)    +   e  Until s is terminal

12 Approximation Models Most widely used (and simple) G-D: –NN using the error backpropagation algorithm –Linear form Factored value functions (your present.) [GPK] First-order value functions (your present.) [BRP] Restricted policies (your present.) [KMN]

13 Approximation Models Most widely used (and simple) G-D: –NN using the error backpropagation algorithm –Linear form Factored value functions (your present.) [GPK] First-order value functions (your present.) [BRP] Restricted policies (your present.) [KMN]

14 Function Approximation Common approach to solving MDPs –find a functional form f(  )  for VF that is tractable e.g., not exponential in number of variables –attempt to find parameters  s.t. f(  ) offers “best fit” to “true” VF Example: –use neural net to approximate VF inputs: state features; output: value or Q-value –generate samples of “true VF” to train NN e.g., use dynamics to sample transitions and train on Bellman backups (bootstrap on current approximation given by NN)

15 Linear Function Approximation Assume a set of basis functions B = { b 1... b k } –each b i : S →  generally compactly representable A linear approximator is a linear combination of these basis functions; for some weight vector w : Several questions: –what is best weight vector w ? –what is a “good” basis set B ? –what does this buy us computationally?

16 Flexibility of Linear Decomposition Assume each basis function is compact –e.g., refers only a few vars; b 1 (X,Y), b 2 (W,Z), b 3 (A) Then VF is compact: –V(X,Y,W,Z,A) = w 1 b 1 (X,Y) + w 2 b 2 (W,Z) + w 3 b 3 (A) For given representation size (10 parameters), we get more value flexibility (32 distinct values) compared to a piecewise constant rep’n

17 Interpretation of Averagers Averagers Interpolate: x y2y2 y3y3 y4y4 y1y1 Grid vertices = Y

18 Linear methods? So, if we can find good basis sets (that allow a good fit), VF can be more compact Several have been proposed: coarse coding, tile coding, radial basis functions, factored (project on a few variables), feature selection, first-order

19 Linear Approx: Components Assume basis set B = { b 1... b k } –each b i : S →  –we view each b i as an n-vector –let A be the n x k matrix [ b 1... b k ] Linear VF: V(s) =  w i b i (s) Equivalently: V = Aw –so our approximation of V must lie in subspace spanned by B –let B be that subspace

20 Policy Evaluation Approximate state-value function is given by V t (s) =  w i b i (s) The gradient of the approximation value function with respect to  t in this case is  wi V t (s) = b i (s)

21 Approximate Value Iteration We might compute approximate V by –Let V 0 = Aw 0 for some weight vector w 0 –Perform Bellman backups to produce V 1 = Aw 1 ; V 2 = Aw 2 ; V 3 = Aw 3 ; etc... Unfortunately, even if V 0 in subspace spanned by B, L*(V 0 ) = L*(Aw 0 ) will generally not be So we need to find best approximation to L*(Aw 0 ) in B before we can proceed

22 Projection We wish to find a projection of our VF estimates into B minimizing some error criterion –We’ll use max norm Given V lying outside B, we want a w s.t: || Aw – V || is minimal

23 Projection as Linear Program Finding a w that minimizes || Aw – V || can be accomplished with a simple LP Number of variables is small (k+1); but number of constraints is large (2 per state) –this defeats the purpose of function approximation –but let’s ignore for the moment Vars: w 1,..., w k,  Minimize:  S.T.   V(s) – Aw(s),  s   Aw(s) - V(s),  s  measures max norm difference between V and “best fit”

24 Approximate Value Iteration Run value iteration; but after each Bellman backup, project result back into subspace B Choose arbitrary w 0 and let V 0 = Aw 0 Then iterate –Compute V t =L*(Aw t-1 ) –Let V t = Aw t be projection of V t into B Error at each step given by  –final error, convergence not assured Analog for policy iteration as well

25 Convergence Guarantees Monte Carlo: converges to minimal MSE || Q’ MC - Q     min w || Q’ w - Q      TD(  ) converges close to MSE || Q’ TD - Q      [TV] DP may diverge There exists counter examples [B,TV]

26 VFA Convergence results Linear TD  converges if we visit states using the on- policy distribution Off policy Linear TD(  and linear Q learning are known to diverge in some cases Q-learning, and value iteration used with some averagers (including k-Nearest Neighbour and decision trees) has almost sure convergence if particular exploration policies are used A special case of policy iteration with Sarsa style updates and linear function approximation converges Residual algorithms are guaranteed to converge but only very slowly

27 VFA TD-gammon TD( ) learning and a Backprop net with one hidden layer 1,500,000 training games (self play) Equivalent in skill to the top dozen human players Backgammon has ~10 20 states, so can’t be solved using DP

28 Homework Read about POMDPs: [Littman ’96] Ch.6-7

29 THE END

30 FA, is it the answer? Not all FA methods are suited for the use of RL. For example, NN and statistical methods assume a static training set over which multiple passes are made. In RL, however, it is important that learning be able to ccur on-line, while interacting with the environment (or a model of the environment) In particular, Q learning seems to lose its convergence properties when integrated with FA

31 A linear, gradient-descent version of Watkins’s Q( ) Initialize  arbitrarily and e  = 0  Repeat (for each episode) s  initial state of episode For all a  A(s) F(a)  set of features present in s,a Q a   i  F(a)  (i) Repeat (for each step of episode) With probability 1-  a  arg max a Q a e    e  else a  a random action  A(s) e   0 For all i  F(a): e(i)  e(i)+1 Take action a, observe reward r and next state s’   r – Q a For all a  A(s’) F(a)  set of features present in s’,a Q a   i  F(a)  (i) a’  arg max a Q a    +  Q a’    +   e  until s’ is terminal


Download ppt "Decision Making Under Uncertainty Lec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy."

Similar presentations


Ads by Google