Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
From Cognitive Science and Machine Learning Summer School 2010
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,
David Wingate Reinforcement Learning for Complex System Management.
1. Algorithms for Inverse Reinforcement Learning 2
1 An Application of Reinforcement Learning to Aerobatic Helicopter Greg McChesney Texas Tech University Apr 08, 2009 CS5331: Autonomous.
Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.
Learning Parameterized Maneuvers for Autonomous Helicopter Flight Jie Tang, Arjun Singh, Nimbus Goehausen, Pieter Abbeel UC Berkeley.
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University.
Learning from Demonstrations Jur van den Berg. Kalman Filtering and Smoothing Dynamics and Observation model Kalman Filter: – Compute – Real-time, given.
Markov Decision Processes
Infinite Horizon Problems
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,
Fusing Machine Learning & Control Theory With Applications to Smart Buildings & ActionWebs UC Berkeley ActionWebs Meeting November 03, 2010 By Jeremy Gillula.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Inverse Reinforcement Learning Pieter Abbeel UC Berkeley EECS
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discriminative Training of Kalman Filters P. Abbeel, A. Coates, M
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Discretization Pieter Abbeel UC Berkeley EECS
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Our acceleration prediction model Predict accelerations: f : learned from data. Obtain velocity, angular rates, position and orientation from numerical.
Reinforcement Learning (1)
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.
RL for Large State Spaces: Policy Gradient
Introduction to Adaptive Digital Filters Algorithms
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.
Simultaneous Localization and Mapping Presented by Lihan He Apr. 21, 2006.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
2/theres-an-algorithm-for-that-algorithmia- helps-you-find-it/ 2/theres-an-algorithm-for-that-algorithmia-
Reinforcement learning (Chapter 21)
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Generative Adversarial Imitation Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Apprenticeship Learning via Inverse Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley

Pieter Abbeel Preliminaries: reinforcement learning. Apprenticeship learning algorithms. Experimental results on various robotic platforms. Outline

Pieter Abbeel Reinforcement learning (RL) System Dynamics P sa state s 0 s1s1 System dynamics P sa … System Dynamics P sa s T-1 sTsT s2s2 a0a0 a1a1 a T-1 reward R(s 0 ) R(s 2 )R(s T-1 )R(s 1 )R(s T )+++…++ Example reward function: R(s) = - || s – s * || Goal: Pick actions over time so as to maximize the expected score: E[ R ( s 0 ) + R ( s 1 ) + … + R ( s T )] Solution: policy  which specifies an action for each possible state for all times t = 0, 1, …, T.

Pieter Abbeel Model-based reinforcement learning Run RL algorithm in simulator. Control policy 

Pieter Abbeel Apprenticeship learning algorithms use a demonstration to help us find a good reward function, a good dynamics model, a good control policy. Reinforcement learning (RL) Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa

Pieter Abbeel Apprenticeship learning: reward Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa

Pieter Abbeel Reward function trades off: Height differential of terrain. Gradient of terrain around each foot. Height differential between feet. … (25 features total for our setup) Many reward functions: complex trade-off

Pieter Abbeel Example result [ICML 2004, NIPS 2008]

Pieter Abbeel Compact description: reward function ~ trajectory (rather than a trade-off). Reward function for aerobatics?

Pieter Abbeel Perfect demonstrations are extremely hard to obtain. Multiple trajectory demonstrations: Every demonstration is a noisy instantiation of the intended trajectory. Noise model captures (among others): Position drift. Time warping. If different demonstrations are suboptimal in different ways, they can capture the “intended” trajectory implicitly. [Related work: Atkeson & Schaal, 1997.] Reward: Intended trajectory

Pieter Abbeel Example: airshow demos

Pieter Abbeel Probabilistic graphical model for multiple demonstrations

Pieter Abbeel Step 1: find the time-warping, and the distributional parameters We use EM, and dynamic time warping to alternatingly optimize over the different parameters. Step 2: find the intended trajectory Learning algorithm

Pieter Abbeel After time-alignment

Pieter Abbeel Apprenticeship learning for the dynamics model Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa

Pieter Abbeel Algorithms such as E 3 (Kearns and Singh, 2002) learn the dynamics by using exploration policies, which are dangerous/impractical for many systems. Our algorithm Initializes model from a demonstration. Repeatedly executes “exploitation policies'' that try to maximize rewards. Provably achieves near-optimal performance (compared to teacher). Machine learning theory: Complicated non-IID sample generating process. Standard learning theory bounds not applicable. Proof uses martingale construction over relative losses. Apprenticeship learning for the dynamics model [ICML 2005]

Pieter Abbeel Learning the dynamics model Details of algorithm for learning dynamics from data: Exploiting structure from physics. Lagged learning criterion. [NIPS 2005, 2006]

Pieter Abbeel Related work Bagnell & Schneider, 2001; LaCivita et al., 2006; Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.

Pieter Abbeel Autonomous nose-in funnel

Pieter Abbeel Accuracy

Pieter Abbeel Modeling extremely complex: Our dynamics model state: Position, orientation, velocity, angular rate. True state: Air (!), head-speed, servos, deformation, etc. Key observation: In the vicinity of a specific point along a specific trajectory, these unknown state variables tend to take on similar values. Non-stationary maneuvers

Pieter Abbeel Example: z-acceleration

Pieter Abbeel 1. Time align trajectories. 2. Learn locally weighted models in the vicinity of the trajectory. W(t’) = exp(- (t – t’) 2 /  2 ) Local model learning algorithm

Pieter Abbeel Autonomous flips

Pieter Abbeel Apprenticeship learning: RL algorithm Reward Function R Reinforcement Learning Control policy  Dynamics Model P sa (Crude) model [None of the demos exactly equal to intended trajectory.] (Sloppy) demonstration or initial trial Small number of real-life trials

Pieter Abbeel Input to algorithm: approximate model. Start by computing the optimal policy according to the model. Algorithm Idea Real-life trajectory Target trajectory The policy is optimal according to the model, so no improvement is possible based on the model.

Pieter Abbeel Algorithm Idea (2) Update the model such that it becomes exact for the current policy.

Pieter Abbeel Algorithm Idea (2) Update the model such that it becomes exact for the current policy.

Pieter Abbeel Algorithm Idea (2) The updated model perfectly predicts the state sequence obtained under the current policy. We can use the updated model to find an improved policy.

Pieter Abbeel Algorithm 1.Find the (locally) optimal policy   for the model. 2.Execute the current policy   and record the state trajectory. 3.Update the model such that the new model is exact for the current policy  . 4.Use the new model to compute the policy gradient  and update the policy:  :=  +   5.Go back to Step 2. Notes: The step-size parameter  is determined by a line search. Instead of the policy gradient, any algorithm that provides a local policy improvement direction can be used. In our experiments we used differential dynamic programming.

Pieter Abbeel Performance Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give the same performance guarantees for model-based RL. The constant K depends only on the dimensionality of the state, action, and policy (  ), the horizon H and an upper bound on the 1st and 2nd derivatives of the transition model, the policy and the reward function.

Pieter Abbeel Our expert pilot provides 5-10 demonstrations. Our algorithm aligns trajectories, extracts intended trajectory as target, learns local models. We repeatedly run controller, collect model errors, until satisfactory performance is obtained. We use receding-horizon differential dynamic programming (DDP) to find the controller Experimental Setup

Pieter Abbeel [Switch to Quicktime for HD airshow.] Airshow

Pieter Abbeel Airshow accuracy

Pieter Abbeel Tic-toc

Pieter Abbeel [Switch to Quicktime for HD chaos.] Chaos

Pieter Abbeel Conclusion Apprenticeship learning algorithms help us find better controllers by exploiting teacher demonstrations. Algorithmic instantiations: Inverse reinforcement learning Learn trade-offs in reward. Learn “intended” trajectory. Model learning No explicit exploration. Local models. Control with crude model + small number of trials.

Pieter Abbeel Automate more general advice taking. Guaranteed safe exploration---safely learning to outperform the teacher. Autonomous helicopters Assist in wildland fire fighting. Auto-rotation landings. Fixed-wing formation flight. Potential savings for even three aircraft formation: 20%. Current and future work

Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, Learning First Order Markov Models for Control, Pieter Abbeel and Andrew Y. Ng. In NIPS 17, Exploration and Apprenticeship Learning in Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, Modeling Vehicular Dynamics, with Application to Modeling Helicopters, Pieter Abbeel, Varun Ganapathi and Andrew Y. Ng. In NIPS 18, Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proc. ICML, An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19, Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.

Pieter Abbeel Full multiple demonstration model