Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University

Pieter Abbeel and Andrew Y. Ng Example of Reinforcement Learning (RL) Problem Highway driving.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning (RL) formalism Dynamics Model P sa Reward Function R Reinforcement Learning Control policy 

Pieter Abbeel and Andrew Y. Ng RL formalism Assume that at each time step, our system is in some state s t. Upon taking an action a, our state randomly transitions to some new state s t+1. We are also given a reward function R. The goal: Pick actions over time so as to maximize the expected score: E[R(s 0 ) + R(s 1 ) + … + R(s T )]. System dynamics s0s0 s1s1 System dynamics … System dynamics s T-1 sTsT s2s2 R(s 0 )R(s 2 )R(s T-1 )R(s 1 )R(s T )+++…++

Pieter Abbeel and Andrew Y. Ng RL formalism Markov Decision Process (S,A,P sa,s 0,R) W.l.o.g. we assume Policy Utility of a policy  for reward R=w T 

Pieter Abbeel and Andrew Y. Ng RL formalism Dynamics Model P sa Reward Function R Reinforcement Learning Control policy 

Pieter Abbeel and Andrew Y. Ng Part I Apprenticeship learning via inverse reinforcement learning

Pieter Abbeel and Andrew Y. Ng Motivation Reinforcement learning (RL) gives powerful tools for solving MDPs. It can be difficult to specify the reward function. Example: Highway driving.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning Learning from observing an expert. Previous work: –Learn to predict expert’s actions as a function of states. –Usually lacks strong performance guarantees. –(E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …) Our approach: –Based on inverse reinforcement learning (Ng & Russell, 2000). –Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function.

Pieter Abbeel and Andrew Y. Ng Algorithm For t = 1,2,… Inverse RL step: Estimate expert’s reward function R(s)= w T  (s) such that under R(s) the expert performs better than all previously found policies {  i }. RL step: Compute optimal policy  t for the estimated reward weights w.

Pieter Abbeel and Andrew Y. Ng Algorithm: Inverse RL step

Pieter Abbeel and Andrew Y. Ng Feature Expectation Closeness and Performance If we can find a policy  such that ||  (  E ) -  (  )|| 2  , then for any underlying reward R*(s) =w* T  (s), we have that |U w* (  E ) - U w* (  )| = |w* T  (  E ) - w* T  (  )|  ||w*|| 2 ||  (  E ) -  (  )|| 2  .

Pieter Abbeel and Andrew Y. Ng Theoretical Results: Convergence Theorem. Let an MDP (without reward function), a k-dimensional feature vector  and the expert’s feature expectations  (  E ) be given. Then after at most kT 2 /  2 iterations, the algorithm outputs a policy  that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w* T  (s), i.e., U w* (  )  U w* (  E ) - .

Pieter Abbeel and Andrew Y. Ng Gridworld Experiments Reward function is piecewise constant over small regions. Features  for IRL are these small regions. 128x128 grid, small regions of size 16x16.

Pieter Abbeel and Andrew Y. Ng Gridworld Experiments

Pieter Abbeel and Andrew Y. Ng Case study: Highway driving The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided. Input: Driving demonstration Output: Learned behavior

Pieter Abbeel and Andrew Y. Ng More driving examples In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Pieter Abbeel and Andrew Y. Ng Our algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function. Algorithm is guaranteed to converge in poly(k,1/  ) iterations. The algorithm exploits reward “simplicity” (vs. policy “simplicity” in previous approaches). Conclusions for part I

Pieter Abbeel and Andrew Y. Ng Part II Apprenticeship learning for learning the transition model

Pieter Abbeel and Andrew Y. Ng Learning the dynamics model P sa from data Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  Estimate P sa from data

Pieter Abbeel and Andrew Y. Ng Transition model So we need to estimate the dynamics from data. Have to collect enough data to model all relevant parts of the flight envelop. Consider the problem of controlling a complicated system like a helicopter. No models are available that specify the dynamics accurately as a function of the helicopter’s specifications.

Pieter Abbeel and Andrew Y. Ng Collecting data to learn dynamical model State-of-the-art: E 3 algorithm (Kearns and Singh, 2002) Have good model of dynamics? YES “Exploit” NO “Explore”

Pieter Abbeel and Andrew Y. Ng Learning the dynamics (a 1, s 1, a 2, s 2, a 3, s 3, ….) Expert human pilot flight Learn P sa Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  (a 1, s 1, a 2, s 2, a 3, s 3, ….) Autonomous flight Learn P sa

Pieter Abbeel and Andrew Y. Ng Apprenticeship learning of model Theorem. Suppose that we obtain m = O(poly(S, A, T, 1/  )) examples from a human expert demonstrating the task. Then after a polynomial number N of iterations of testing/re- learning, with high probability, we will obtain a policy  whose performance is comparable to the expert’s: U(  )  U(  E ) -  Thus, so long as a demonstration is available, it isn’t necessary to explicitly explore.

Pieter Abbeel and Andrew Y. Ng Proof idea From initial pilot demonstrations, our model/simulator P sa will be accurate for the part of the flight envelop (s,a) visited by the pilot. Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s policy  E. Consequently, there is at least one policy (namely  E ) that looks like it’s able to fly the helicopter well in our simulation. Thus, each time we solve the MDP using the current simulator P sa, we will find a policy that successfully flies the helicopter according to P sa. If, on the actual helicopter, this policy fails to fly the helicopter--- despite the model P sa predicting that it should---then it must be visiting parts of the flight envelop that the model is failing to accurately model. Hence, this gives useful training data to model new parts of the flight envelop.

Pieter Abbeel and Andrew Y. Ng Conclusions

Pieter Abbeel and Andrew Y. Ng Conclusions Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  Given expert demonstrations, our inverse RL algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function. Given an initial demonstration, there is no need to explicitly explore the state/action space. Even if you repeatedly “exploit” (use your best policy), you will collect enough data to learn a sufficiently accurate dynamical model to carry out your control task.

Pieter Abbeel and Andrew Y. Ng Thanks for your attention.

Pieter Abbeel and Andrew Y. Ng Different Formulation LP formulation for RL problem max.  s,a (s,a) R(s) s.t.  s  a (s,a) =  s’,a P(s|s’,a) (s’,a) QP formulation for Apprenticeship Learning min.,   i (  E,i -  i ) 2 s.t.  s  a (s,a) =  s’,a P(s|s’,a) (s’,a)  i  i =  s,a  i (s) (s,a)

Pieter Abbeel and Andrew Y. Ng Different Formulation (ctd.) Our algorithm is equivalent to iteratively linearizing QP at current point (Inverse RL step), solve resulting LP (RL step). Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality.]

Pieter Abbeel and Andrew Y. Ng Simplification of Inverse RL step: QP  Euclidean projection In the Inverse RL step –set  (i-1) = orthogonal projection of  E onto line through {  (i-1),  (  (i-1) ) } –set w (i) =  E -  (i-1) Note: the theoretical results on convergence and sample complexity hold unchanged for the simpler algorithm.

Pieter Abbeel and Andrew Y. Ng Algorithm (projection version) 11 (0)(0) w (1) w (2) (1)(1) (2)(2) 22 w (3)  (1)  (2) (E)(E)

Pieter Abbeel and Andrew Y. Ng More driving examples In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Pieter Abbeel and Andrew Y. Ng Proof (sketch) 11 (0)(0) w (1) (1)(1) 22  (1) (E)(E) d0d0 d1d1

Pieter Abbeel and Andrew Y. Ng Proof (sketch)

Pieter Abbeel and Andrew Y. Ng Algorithm (projection version) 11 EE (0)(0) w (1) (1)(1) 22

Pieter Abbeel and Andrew Y. Ng Algorithm (projection version) 11 EE (0)(0) w (1) w (2) (1)(1) (2)(2) 22  (1)

Pieter Abbeel and Andrew Y. Ng Algorithm (projection version) 11 EE (0)(0) w (1) w (2) (1)(1) (2)(2) 22 w (3)  (1)  (2)

Pieter Abbeel and Andrew Y. Ng Appendix: Different View Bellman LP for solving MDPs Min. V c’V s.t.  s,a V(s)  R(s,a) +   s’ P(s,a,s’)V(s’) Dual LP Max.  s,a (s,a)R(s,a) s.t.  s c(s) -  a (s,a) +   s’,a P(s’,a,s) (s’,a) =0 Apprenticeship Learning as QP Min.  i (  E,i -  s,a (s,a)  i (s)) 2 s.t.  s c(s) -  a (s,a) +   s’,a P(s’,a,s) (s’,a) =0

Pieter Abbeel and Andrew Y. Ng Different View (ctd.) Our algorithm is equivalent to iteratively linearize QP at current point (Inverse RL step), solve resulting LP (RL step). Why not solving QP directly? Typically only possible for very small toy problems (curse of dimensionality). [Our algorithm makes use of existing RL solvers to deal with the curse of dimensionality.]

Collisio n Offroad Left Left Lane Middle Lane Right Lane Offroad Right 1Feature Distr. Expert000.13250.20330.59830.0658 Feature Distr. Learned5.00E-050.00040.09040.22860.6040.0764 Weights Learned-0.0767-0.04390.00770.00780.0318-0.0035 2Feature Distr. Expert0.116700.06330.46670.470 Feature Distr. Learned0.133200.10450.31960.57590 Weights Learned0.234-0.10980.00920.04870.0576-0.0056 3Feature Distr. Expert0000.00330.70580.2908 Feature Distr. Learned00000.74470.2554 Weights Learned-0.1056-0.0051-0.0573-0.03860.09290.0081 4Feature Distr. Expert0.06000.00330.29080.7058 Feature Distr. Learned0.05690000.26660.7334 Weights Learned0.1079-0.0001-0.0487-0.06660.0590.0564 5Feature Distr. Expert0.0600100 Feature Distr. Learned0.054200100 Weights Learned0.0094-0.0108-0.27650.8126-0.51-0.0153 Car driving results (more detail)

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Similar presentations

Presentation on theme: "Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Similar presentations

Presentation on theme: "Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback