Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.

Slides:

Advertisements

Similar presentations

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Advertisements

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,

Reinforcement Learning

1. Algorithms for Inverse Reinforcement Learning 2

Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.

Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University.

Reinforcement learning (Chapter 21)

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

Infinite Horizon Problems

STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.

Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,

Reinforcement learning

Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.

Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning Introduction Presented by Alp Sardağ.

7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.

Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Our acceleration prediction model Predict accelerations: f : learned from data. Obtain velocity, angular rates, position and orientation from numerical.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Reinforcement Learning

Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,

Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths & Weaknesses for Practical Deployment Tim Paek Microsoft Research Dialogue on Dialogues.

Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.

Design and Implementation of General Purpose Reinforcement Learning Agents Tyler Streeter November 17, 2005.

Reinforcement learning (Chapter 21)

Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Generative Adversarial Imitation Learning

Reinforcement learning (Chapter 21)

István Szita & András Lőrincz

Reinforcement Learning (1)

CMSC 471 – Spring 2014 Class #25 – Thursday, May 1

Reinforcement learning (Chapter 21)

Apprenticeship Learning Using Linear Programming

Learning Preferences on Trajectories via Iterative Improvement

Apprenticeship Learning via Inverse Reinforcement Learning

Dr. Unnikrishnan P.C. Professor, EEE

یادگیری تقویتی Reinforcement Learning

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2008

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

CS 440/ECE448 Lecture 22: Reinforcement Learning

Presentation transcript:

Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.

This talk Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  Recurring theme: Apprenticeship learning.

Motivation In practice reward functions are hard to specify, and people tend to tweak them a lot. Motivating example: helicopter tasks, e.g. flip. Another motivating example: Highway driving.

Apprenticeship Learning Learning from observing an expert. Previous work: –Learn to predict expert’s actions as a function of states. –Usually lacks strong performance guarantees. –(E.g.,. Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002; Atkeson & Schaal, 1997; …) Our approach: –Based on inverse reinforcement learning (Ng & Russell, 2000). –Returns policy with performance as good as the expert as measured according to the expert’s unknown reward function. –[Most closely related work: Ratliff et al. 2005, 2006.]

Algorithm For t = 1,2,… Inverse RL step: Estimate expert’s reward function R(s)= w T  (s) such that under R(s) the expert performs better than all previously found policies {  i }. RL step: Compute optimal policy  t for the estimated reward w. [Abbeel & Ng, 2004]

Maximize , w:||w|| 2 ≤ 1  s.t. U w (  E )  U w (  i ) +  i=1,…,t-1  = margin of expert’s performance over the performance of previously found policies. U w (  ) = E [  t=1 R(s t )|  ] = E [  t=1 w T  (s t )|  ] = w T E [  t=1  (s t )|  ] = w T  (  )  (  ) = E [  t=1  (s t )|  ] are the “feature expectations” Algorithm: IRL step T T T T

Feature Expectation Closeness and Performance If we can find a policy  such that ||  (  E ) -  (  )|| 2  , then for any underlying reward R*(s) =w* T  (s), we have that |U w* (  E ) - U w* (  )| = |w* T  (  E ) - w* T  (  )|  ||w*|| 2 ||  (  E ) -  (  )|| 2  .

Theoretical Results: Convergence Theorem. Let an MDP (without reward function), a k- dimensional feature vector  and the expert’s feature expectations  (  E ) be given. Then after at most kT 2 /  2 iterations, the algorithm outputs a policy  that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w* T  (s), i.e., U w* (  )  U w* (  E ) - .

Case study: Highway driving Input: Driving demonstration Output: Learned behavior The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.

More driving examples In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration. Driving demonstration Driving demonstration Learned behavior Learned behavior

Our algorithm returns a policy with performance as good as the expert as evaluated according to the expert’s unknown reward function. Algorithm is guaranteed to converge in poly(k,1/  ) iterations. The algorithm exploits reward “simplicity” (vs. policy “simplicity” in previous approaches). Inverse reinforcement learning summary

The dynamics model Dynamics Model P sa Reward Function R Reinforcement Learning Control policy 

Collecting data to learn the dynamics model

Learning the dynamics model P sa from data Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  For example, in discrete-state problems, estimate P sa (s’) to be the fraction of times you transitioned to state s’ after taking action a in state s. Challenge: Collecting enough data to guarantee that you can model the entire flight envelop. Estimate P sa from data

Collecting data to learn dynamical model State-of-the-art: E 3 algorithm (Kearns and Singh, 2002) Have good model of dynamics? YES “Exploit” NO “Explore”

Aggressive exploration (Manual flight) Aggressively exploring the edges of the flight envelope isn’t always a good idea.

Learning the dynamics (a 1, s 1, a 2, s 2, a 3, s 3, ….) Expert human pilot flight Learn P sa Dynamics Model P sa Reward Function R Reinforcement Learning Control policy  (a 1, s 1, a 2, s 2, a 3, s 3, ….) Autonomous flight Learn P sa

Apprenticeship learning of model Theorem. Suppose that we obtain m = O(poly(S, A, T, 1/  )) examples from a human expert demonstrating the task. Then after a polynomial number k of iterations of testing/re-learning, with high probability, we will obtain a policy  whose performance is comparable to the expert’s: U(  )  U(  E ) -  Thus, so long as a demonstration is available, it isn’t necessary to explicitly explore. In practice, k=1 or 2 is almost always enough. [Abbeel & Ng, 2005]

Proof idea From initial pilot demonstrations, our model/simulator P sa will be accurate for the part of the flight envelop (s,a) visited by the pilot. Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s policy  E. Consequently, there is at least one policy (namely  E ) that looks like it’s able to fly the helicopter in our simulation. Thus, each time we solve the MDP using the current simulator P sa, we will find a policy that successfully flies the helicopter according to P sa. If, on the actual helicopter, this policy fails to fly the helicopter---despite the model P sa predicting that it should---then it must be visiting parts of the flight envelop that the model is failing to accurately model. Hence, this gives useful training data to model new parts of the flight envelop.

Configurations flown (exploitation only)

Tail-in funnel

Nose-in funnel

In-place rolls

In place flips

Acknowledgements Andrew Ng, Adam Coates, Morgan Quigley

Thank You!