Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University.

Slides:



Advertisements
Similar presentations
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Advertisements

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,
1. Algorithms for Inverse Reinforcement Learning 2
Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.
Learning Parameterized Maneuvers for Autonomous Helicopter Flight Jie Tang, Arjun Singh, Nimbus Goehausen, Pieter Abbeel UC Berkeley.
Learning from Demonstrations Jur van den Berg. Kalman Filtering and Smoothing Dynamics and Observation model Kalman Filter: – Compute – Real-time, given.
Nonlinear Optimization for Optimal Control
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
SA-1 Body Scheme Learning Through Self-Perception Jürgen Sturm, Christian Plagemann, Wolfram Burgard.
Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Inverse Reinforcement Learning Pieter Abbeel UC Berkeley EECS
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.
Apprenticeship Learning for the Dynamics Model Overview  Challenges in reinforcement learning for complex physical systems such as helicopters:  Data.
Ratbert: Nearest Sequence Memory Based Prediction Model Applied to Robot Navigation by Sergey Alexandrov iCML 2003.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discriminative Training of Kalman Filters P. Abbeel, A. Coates, M
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning Jeff Michels Ashutosh Saxena Andrew Y. Ng Stanford University ICML 2005.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Our acceleration prediction model Predict accelerations: f : learned from data. Obtain velocity, angular rates, position and orientation from numerical.
RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.
RL for Large State Spaces: Policy Gradient
Function Approximation for Imitation Learning in Humanoid Robots Rajesh P. N. Rao Dept of Computer Science and Engineering University of Washington,
Space-Indexed Dynamic Programming: Learning to Follow Trajectories J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway Computer Science.
Apprenticeship Learning for Robotics, with Application to Autonomous Helicopter Flight Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng,
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Motor Control. Beyond babbling Three problems with motor babbling: –Random exploration is slow –Error-based learning algorithms are faster but error signals.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
MDPs (cont) & Reinforcement Learning
Reinforcement learning (Chapter 21)
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Shape2Pose: Human Centric Shape Analysis CMPT888 Vladimir G. Kim Siddhartha Chaudhuri Leonidas Guibas Thomas Funkhouser Stanford University Princeton University.
Generative Adversarial Imitation Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Apprenticeship Learning Using Linear Programming
Learning optimal behavior
Daniel Brown and Scott Niekum The University of Texas at Austin
"Playing Atari with deep reinforcement learning."
End-to-end Driving via Conditional Imitation Learning
Apprenticeship Learning via Inverse Reinforcement Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
CS249: Neural Language Model
Reinforcement Learning (2)
Presentation transcript:

Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University (Variation hereof) presented at Cornell/USC/UCSD/Michigan/UNC/Duke/UCLA/UW/EPFL/Berkeley/CMU Winter/Spring 2008 In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov, Sebastian Thrun.

Big picture and key challenges Dynamics Model P sa Reward Function R Reinforcement Learning / Optimal Control Controller/ Policy  Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability of being in a state. Key challenges Providing a formal specification of the control task. Building a good dynamics model. Finding closed-loop controllers.

Apprenticeship learning algorithms Leverage expert demonstrations to learn to perform a desired task. Formal guarantees Running time Sample complexity Performance of resulting controller Enabled us to solve highly challenging, previously unsolved, real-world control problems in Quadruped locomotion Autonomous helicopter flight Overview

Example task: driving

Input: Dynamics model / Simulator P sa ( s t +1 | s t, a t ) No reward function Teacher’s demonstration: s 0, a 0, s 1, a 1, s 2, a 2, … (= trace of the teacher’s policy  *) Desired output: Policy, which (ideally) has performance guarantees, i.e., Note: R* is unknown. Problem setup

Formulate as standard machine learning problem Fix a policy class E.g., support vector machine, neural network, decision tree, deep belief net, … Estimate a policy from the training examples ( s 0, a 0 ), ( s 1, a 1 ), ( s 2, a 2 ), … E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, Prior work: behavioral cloning

Limitations: Fails to provide strong performance guarantees Underlying assumption: policy simplicity

Problem structure Dynamics Model P sa Reward Function R Reinforcement Learning / Optimal Control Controller/ Policy  Prescribes action to take for each state: typically very complex Often fairly succinct

Apprenticeship learning [Abbeel & Ng, 2004] Assume Initialize: pick some controller  0. Iterate for i = 1, 2, … : “Guess” the reward function: Find a reward function such that the teacher maximally outperforms all previously found controllers. Find optimal control policy  i for the current guess of the reward function R w. If, exit the algorithm. Learning through reward functions rather than directly learning policies. There is no reward function for which the teacher significantly outperforms thus-far found policies.

Theoretical guarantees Guarantee w.r.t. unrecoverable reward function of teacher. Sample complexity does not depend on complexity of teacher’s policy  *.

Related work Prior work: Behavioral cloning. (covered earlier) Utility elicitation / Inverse reinforcement learning, Ng & Russell, No strong performance guarantees. Closely related later work: Ratliff et al., 2006, 2007; Neu & Szepesvari, 2007; Syed & Schapire, Work on specialized reward function: trajectories. E.g., Atkeson & Schaal, 1997.

Highway driving Teacher in Training WorldLearned Policy in Testing World Input: Dynamics model / Simulator P sa ( s t +1 | s t, a t ) Teacher’s demonstration: 1 minute in “training world” Note: R* is unknown. Reward features: 5 features corresponding to lanes/shoulders; 10 features corresponding to presence of other car in current lane at different distances

More driving examples In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration. Driving demonstration Learned behavior

Parking lot navigation [Abbeel et al., submitted] Reward function trades off: curvature smoothness, distance to obstacles, alignment with principal directions.

Demonstrate parking lot navigation on “train parking lot.” Run our apprenticeship learning algorithm to find a set of reward weights w. Receive “test parking lot” map + starting point and destination. Find a policyfor navigating the test parking lot. Experimental setup Learned reward weights

Learned controller

Reward function trades off 25 features. Quadruped [Kolter, Abbeel & Ng, 2008]

Demonstrate path across the “training terrain” Run our apprenticeship learning algorithm to find a set of reward weights w. Receive “testing terrain”---height map. Find a policy for crossing the testing terrain. Experimental setup Learned reward weights

Without learning

With learned reward function

Learn R Apprenticeship learning Dynamics Model P sa Reward Function R Reinforcement Learning / Optimal Control Controller  Teacher’s flight (s 0, a 0, s 1, a 1, ….)

Reinforcement Learning / Optimal Control Learn R Apprenticeship learning Dynamics Model P sa Reward Function R Controller  Teacher’s flight (s 0, a 0, s 1, a 1, ….)

Accurate dynamics model P sa Motivating example Textbook model Specification Accurate dynamics model P sa Collect flight data. Textbook model Specification Learn model from data.  How to fly for data collection?  How to ensure that the entire flight envelope is covered?

Aggressive (manual) exploration

Never any explicit exploration (neither manual nor autonomous). Near-optimal performance (compared to teacher). Small number of teacher demonstrations. Small number of autonomous trials. Desired properties

Learn P sa Apprenticeship learning of the model Dynamics Model P sa Reward Function R Reinforcement Learning / Optimal Control Controller  Autonomous flight (s 0, a 0, s 1, a 1, ….) Teacher’s flight (s 0, a 0, s 1, a 1, ….)

No explicit exploration required. Theoretical guarantees [Abbeel & Ng, 2005]

Learn R Learn P sa Dynamics Model P sa Reward Function R Reinforcement Learning / Optimal Control Controller  Autonomous flight (s 0, a 0, s 1, a 1, ….) Teacher’s flight (s 0, a 0, s 1, a 1, ….) Apprenticeship learning summary

Other relevant parts of the story Learning the dynamics model from data Locally weighted models Exploiting structure from physics Simulation accuracy at time-scales relevant for control Reinforcement learning / Optimal control Model predictive control Receding horizon differential dynamic programming [Abbeel et al. 2005, 2006a, 2006b, 2007]

Related work Bagnell & Schneider, 2001; LaCivita, Papageorgiou, Messner & Kanade, 2002; Ng, Kim, Jordan & Sastry 2004a (2001); Roberts, Corke & Buskey, 2003; Saripalli, Montgomery & Sukhatme, 2003; Shim, Chung, Kim & Sastry, 2003; Doherty et al., Gavrilets, Martinos, Mettler and Feron, 2002; Ng et al., 2004b. Maneuvers presented here are significantly more challenging than those flown by any other autonomous helicopter.

Autonomous aerobatic flips (attempt) before apprenticeship learning Task description: meticulously hand-engineered Model: learned from (commonly used) frequency sweeps data

1.Our expert pilot demonstrates the airshow several times. Experimental setup for helicopter

Demonstrations

1.Our expert pilot demonstrates the airshow several times. 2.Learn a reward function---trajectory. 3.Learn a dynamics model. Experimental setup for helicopter

Learned reward (trajectory)

1.Our expert pilot demonstrates the airshow several times. 2.Learn a reward function---trajectory. 3.Learn a dynamics model. 4.Find the optimal control policy for learned reward and dynamics model. 5.Autonomously fly the airshow 6.Learn an improved dynamics model. Go back to step 4. Experimental setup for helicopter

Accuracy White: target trajectory. Black: autonomously flown trajectory.

Summary Apprenticeship learning algorithms Learn to perform a task from observing expert demonstrations of the task. Formal guarantees Running time. Sample complexity. Performance of resulting controller. Enabled us to solve highly challenging, previously unsolved, real-world control problems.

Applications: Autonomous helicopters to assist in in wildland fire fighting. Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%. Learning from demonstrations only scratches surface of the potential impact of work at intersection machine learning/control on robotics. Safe autonomous learning. More general advice taking. Current and future work

Thank you.

Chaos (short)

Chaos (long)

Flips

Auto-rotation descent

Tic-toc

Autonomous nose-in funnel

From initial pilot demonstrations, our model/simulator P sa will be accurate for the part of the state space (s,a) visited by the pilot. Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s controller  *. Consequently, there is at least one controller (namely  * ) that looks capable of flying the helicopter well in our simulation. Thus, each time we solve for the optimal controller using the current model/simulator P sa, we will find a controller that successfully flies the helicopter according to P sa. If, on the actual helicopter, this controller fails to fly the helicopter- --despite the model P sa predicting that it should---then it must be visiting parts of the state space that are inaccurately modeled. Hence, we get useful training data to improve the model. This can happen only a small number of times. Model Learning: Proof Idea

Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, Learning First Order Markov Models for Control, Pieter Abbeel and Andrew Y. Ng. In NIPS 17, Exploration and Apprenticeship Learning in Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, Modeling Vehicular Dynamics, with Application to Modeling Helicopters, Pieter Abbeel, Varun Ganapathi and Andrew Y. Ng. In NIPS 18, Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proc. ICML, An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19, Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.

Helicopter dynamics model in auto

Performance guarantee intuition Intuition by example: Let If the returned controller  satisfies Then no matter what the values of and are, the controller  performs as well as the teacher’s controller  *.

Probabilistic graphical model for multiple demonstrations

Full model

Step 1: find the time-warping, and the distributional parameters We use EM, and dynamic time warping to alternatingly optimize over the different parameters. Step 2: find the intended trajectory Learning algorithm

Teacher demonstration for quadruped Full teacher demonstration = sequence of footsteps. Much simpler to “teach hierarchically”: Specify a body path. Specify best footstep in a small area.

Hierarchical inverse RL Quadratic programming problem (QP): quadratic objective, linear constraints. Constraint generation for path constraints.

Training: Have quadruped walk straight across a fairly simple board with fixed-spaced foot placements. Around each foot placement: label the best foot placement. (about 20 labels) Label the best body-path for the training board. Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels. Test on hold-out terrains: Plan a path across the test-board. Experimental setup