Inverse Reinforcement Learning Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA
High-level picture Dynamics Model T Reinforcement Reward Function R Probability distribution over next states given current state and action Describes desirability of being in a state. Reinforcement Learning / Optimal Control Reward Function R Controller/ Policy p* Prescribes action to take for each state What’s high level picture here: need 1,2,3. load balancing, pricing, ad placement, … Many problems in robotics have unknown, stochastic, high-dimensional, and highly non-linear dynamics, and offer significant challenges to classical control methods. Some of the key difficulties in these problems are that (i) It is often hard to write down, in closed form, a formal specification of the control task (for example, what is the objective function for "flying well"?), (ii) It is difficult to build a good dynamics model because of both data collection and data modeling challenges (similar to the "exploration problem" in reinforcement learning), and (iii) It is expensive to find closed-loop controllers for high dimensional, stochastic domains. In this talk, I will present learning algorithms with formal performance guarantees which show that these problems can be efficiently addressed in the apprenticeship learning setting---the setting when expert demonstrations of the task are available. Car example: What are we trying to learn here: Current state correct action = (throttle/brake, steering). Let’s consider the fairly simple task of driving a car on a highway (in that, much simpler than general traffic situation). Keep in mind, just iconic example to get the point across. I will look at more complex, real-world problems (which we solved for the first time) later. Build up policy complexity: try to maintain desired speed; if car ahead, switch to left lane; if car there, switch to right lane; What if we want to take an exit soon? – Don’t switch into the left lane, except if you see a clear path to get back (what does that even mean, a clear path) etc. Quickly becomes very complex. For supervised learning to work we need to (a) Encounter all these different situations, so we can learn the appropriate action for them. Thus, e.g., our demonstration needs to include some pretty dangerous situations … (b) We need to have a learning architecture (SVM, deep belief net, neural net, decision tree, …) capable of capturing the extreme richness of our decisions. Making decisions with so many possible actions (rather than 0/1 etc.) Really, only will work in case of policy simplicity. Examples include: bla …; these examples are extremely restrictive Ignoring the structure of the problem. When we drive a car we have (a) specific objectives: Avoid other cars Get to our goal Maybe talk on our cell phone, curse at other drivers, etc. ? (b) an idea of how our car and other cars behave == dynamics model Key contribution: exploit this structure, and learn reward function and dynamics model from demonstrations rather than directly learn a policy. The controller that falls out generalizes much better b/c it “understands” the nature of the problem. Rather than “blindly” learned to mimick the teacher. Step back here: I am not so much concerned about driving the car in the highway simulator; this is just an illustrative toy example. This idea applies generally to problems I mentioned before: Inverse RL: Given ¼*and T, can we recover R? More generally, given execution traces, can we recover R?
Motivation for inverse RL Scientific inquiry Model animal and human behavior E.g., bee foraging, songbird vocalization. [See intro of Ng and Russell, 2000 for a brief overview.] Apprenticeship learning/Imitation learning through inverse RL Presupposition: reward function provides the most succinct and transferable definition of the task Has enabled advancing the state of the art in various robotic domains Modeling of other agents, both adversarial and cooperative Potential use of RL and related methods as computational models for animal and human learning (Watkins, 1989; Schamjuk and Zanuttoa, 1997; Touretzky and Saksida, 1997). Such models are supported both by behavioral studies and by neurophysiological evidence that RL occurs in bee foraging (Montague et al., 1995) and in songbird vocalization (Doya and Sejnowski, 1995). This literature assumes, however, that the reward function is fixed and known---for example, models of bee foraging assume that the reward at each flower is a simple saturating function of nectar content. Yet it seems clear that in examining animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation. This is particularly true of multiattribute reward functions. Consider, for example, that the bee might weigh nectar ingestion against flight distance, time, and risk from wind and predators. It is hard to see how one could determine the relative weights of these terms a priori. Similar considerations apply to human economic behavior, for example. Hence inverse RL is a fundamental problem of theoretical biology, econometrics, and other fields.
Lecture outline Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Historical sketch of inverse RL Mathematical formulations for inverse RL Case studies Trajectory-based reward, with application to autonomous helicopter flight
Examples Simulated highway driving Aerial imagery based navigation Abbeel and Ng, ICML 2004, Syed and Schapire, NIPS 2007 Aerial imagery based navigation Ratliff, Bagnell and Zinkevich, ICML 2006 Parking lot navigation Abbeel, Dolgov, Ng and Thrun, IROS 2008 Urban navigation Ziebart, Maas, Bagnell and Dey, AAAI 2008 Quadruped locomotion Ratliff, Bradley, Bagnell and Chestnutt, NIPS 2007 Kolter, Abbeel and Ng, NIPS 2008
Urban navigation Reward function for urban navigation? destination prediction Distance Speed Type Lanes Turns Context Time Fuel Safety Stress Skill Mood Our view is that, like a fingerprint, each driver has a unique way in which they trade off these various factors under different contextual situations. Unfortunately, we have no way of knowing the perceived time, safety, stress, or skill requirements that a driver has for different routes. <click> Instead we view drivers preferences based on the characteristics of the road network that we do have, and learn to “take a driver’s fingerprint” by analyzing their previously driven routes with respect to these road network features. Ziebart, Maas, Bagnell and Dey AAAI 2008
Lecture outline Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Historical sketch of inverse RL Mathematical formulations for inverse RL Case studies Trajectory-based reward, with application to autonomous helicopter flight
Problem setup Input: Inverse RL: State space, action space Transition model Psa(st+1 | st, at) No reward function Teacher’s demonstration: s0, a0, s1, a1, s2, a2, … (= trace of the teacher’s policy *) Inverse RL: Can we recover R ? Apprenticeship learning via inverse RL Can we then use this R to find a good policy ? Behavioral cloning Can we directly learn the teacher’s policy using supervised learning?
Behavioral cloning Formulate as standard machine learning problem Fix a policy class E.g., support vector machine, neural network, decision tree, deep belief net, … Estimate a policy (=mapping from states to actions) from the training examples (s0, a0), (s1, a1), (s2, a2), … Two of the most notable success stories: Pomerleau, NIPS 1989: ALVINN Sammut et al., ICML 1992: Learning to fly (flight sim) Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002.
Inverse RL vs. behavioral cloning Which has the most succinct description: ¼* vs. R*? Especially in planning oriented tasks, the reward function is often much more succinct than the optimal policy.
Lecture outline Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Historical sketch of inverse RL Mathematical formulations for inverse RL Case studies Trajectory-based reward, with application to autonomous helicopter flight
Inverse RL history 1964, Kalman posed the inverse optimal control problem and solved it in the 1D input case 1994, Boyd+al.: a linear matrix inequality (LMI) characterization for the general linear quadratic setting 2000, Ng and Russell: first MDP formulation, reward function ambiguity pointed out and a few solutions suggested 2004, Abbeel and Ng: inverse RL for apprenticeship learning---reward feature matching 2006, Ratliff+al: max margin formulation
Inverse RL history 2007, Ratliff+al: max margin with boosting---enables large vocabulary of reward features 2007, Ramachandran and Amir, and Neu and Szepesvari: reward function as characterization of policy class 2008, Kolter, Abbeel and Ng: hierarchical max-margin 2008, Syed and Schapire: feature matching + game theoretic formulation 2008, Ziebart+al: feature matching + max entropy 2008, Abbeel+al: feature matching -- application to learning parking lot navigation style Active inverse RL? Inverse RL w.r.t. minmax control, partial observability, learning stage (rather than observing optimal policy), … ?
Lecture outline Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Historical sketch of inverse RL Mathematical formulations for inverse RL Case studies Trajectory-based reward, with application to autonomous helicopter flight
Three broad categories of formalizations Max margin Feature expectation matching Interpret reward function as parameterization of a policy class
Basic principle Find a reward function R* which explains the expert behaviour. Find R* such that In fact a convex feasibility problem, but many challenges: R=0 is a solution, more generally: reward function ambiguity We typically only observe expert traces rather than the entire expert policy ¼* --- how to compute LHS? Assumes the expert is indeed optimal --- otherwise infeasible Computationally: assumes we can enumerate all policies Note: it is a convex problem (but with many challenges)
Feature based reward function ff Subbing into gives us: Expected cumulative discounted sum of feature values or “feature expectations” Give car driving example here to illustrate the sample complexity issue.
Feature based reward function Feature expectations can be readily estimated from sample trajectories. The number of expert demonstrations required scales with the number of features in the reward function. The number of expert demonstration required does not depend on Complexity of the expert’s optimal policy ¼* Size of the state space Give car driving example here to illustrate the sample complexity issue.
Recap of challenges Challenges: Assumes we know the entire expert policy ¼* assumes we can estimate expert feature expectations R=0 is a solution (now: w=0), more generally: reward function ambiguity Assumes the expert is indeed optimal---became even more of an issue with the more limited reward function expressiveness! Computationally: assumes we can enumerate all policies
Ambiguity Standard max margin: “Structured prediction” max margin: Justification: margin should be larger for policies that are very different from ¼*. Example: m(¼, ¼*) = number of states in which ¼* was observed and in which ¼ and ¼* disagree
Expert suboptimality Structured prediction max margin with slack variables: Can be generalized to multiple MDPs (could also be same MDP with different initial state)
Complete max-margin formulation [Ratliff, Zinkevich and Bagnell, 2006] Resolved: access to ¼*, ambiguity, expert suboptimality One challenge remains: very large number of constraints Ratliff+al use subgradient methods. In this lecture: constraint generation
Constraint generation Initialize ¦(i) = {} for all i and then iterate Solve For current value of w, find the most violated constraint for all i by solving: = find the optimal policy for the current estimate of the reward function (+ loss augmentation m) For all i add ¼(i) to ¦(i) If no constraint violations were found, we are done. [Note: many other approaches exist, this one is particularly simple and seems to work well in practice.]
Visualization in feature expectation space Every policy ¼ has a corresponding feature expectation vector ¹(¼), which for visualization purposes we assume to be 2D 2 structured max margin (?) max margin (*) wmm Draw out some w vectors; show which are the optimal policy for it wsmm 1
Constraint generation Every policy ¼ has a corresponding feature expectation vector ¹(¼), which for visualization purposes we assume to be 2D constraint generation: 2 (2) (*) (1) Draw out some w vectors; show which are the optimal policy for it (0) 1
Three broad categories of formalizations Max margin (Ratliff+al, 2006) Feature boosting [Ratliff+al, 2007] Hierarchical formulation [Kolter+al, 2008] Feature expectation matching (Abbeel+Ng, 2004) Two player game formulation of feature matching (Syed+Schapire, 2008) Max entropy formulation of feature matching (Ziebart+al,2008) Interpret reward function as parameterization of a policy class. (Neu+Szepesvari, 2007; Ramachandran+Amir, 2007)
Feature matching Inverse RL starting point: find a reward function such that the expert outperforms other policies Observation in Abbeel and Ng, 2004: for a policy ¼ to be guaranteed to perform as well as the expert policy ¼*, it suffices that the feature expectations match: implies that for all w with
Theoretical guarantees I don’t want to parse this, but happy to give further details offline if anyone would like me to. Our algorithm does not necessarily recover the teacher’s reward function R_w^* --- which is impossible to recover. TODO: add slide with theorem that includes the sample complexity in the statement. Guarantee w.r.t. unrecoverable reward function of teacher. Sample complexity does not depend on complexity of teacher’s policy *.
Apprenticeship learning [Abbeel & Ng, 2004] Assume Initialize: pick some controller 0. Iterate for i = 1, 2, … : “Guess” the reward function: Find a reward function such that the teacher maximally outperforms all previously found controllers. Find optimal control policy i for the current guess of the reward function Rw. If , exit the algorithm. Learning through reward functions rather than directly learning policies. Say: providing the appropriate is typically fairly straightforward: we know the task well enough. The real challenge is to identify the trade-offs: the number of possibilities is combinatorial in the number of features, making it near-impossible for our expert to come up with. Convey intuition about what the algorithm is doing. Find reward that makes teacher look much better; and then hypothesize that reward and find the policy. The intuition behind the algorithm, is that the set of policies generated gets closer (in some sense) to the expert at every iteration. RL black box. Inverse RL: explain the details now. There is no reward function for which the teacher significantly outperforms thus-far found policies.
Algorithm example run 2 (E) (2) w(3) (1) w(2) w(1) (0) 1 At the end, if we properly mix 0, 1 and 2 (really 1 and 2) we can get quite close to the expert E. (0) 1
Suboptimal expert case Can match expert by stochastically mixing between 3 policies In practice: for any w* one of ¼0, ¼1, ¼2 outperforms ¼* pick one of them. Generally: for k-dimensional feature space left to pick between k+1 policies 2 (2) (*) (1) How to proceed in this case? (0) 1
Feature expectation matching If expert suboptimal then the resulting policy is a mixture of somewhat arbitrary policies which have expert in their convex hull. In practice: pick the best one of this set and pick the corresponding reward function. Next: Syed and Schapire, 2008. Ziebart+al, 2008.
Min-Max feature expectation matching Syed and Schapire (2008) 2 Any policy in this area performs at least as well as expert. How to find policy on pareto optimal curve in this area + corresponding reward function? (*) Intuition about feature expectation space here 1
Min-Max feature expectation matching Syed and Schapire (2008) 2 (*) Draw out w’s that would be OK. 1
Min max games minimizer pay-off matrix G Example of standard min-max game setting: rock-paper-scissors pay-off matrix: maximizer rock paper scissors rock 0 1 -1 paper -1 0 1 scissors 1 -1 0 Nash equilibrium solution is mixed strategy: (1/3,1/3,1/3) for both players minimizer pay-off matrix G
Min-Max feature expectation matching Syed and Schapire (2008) Standard min-max game: Min-max inverse RL: Solution: maximize over weights ¸ which weigh the contribution of all policies ¼1, ¼2, …, ¼N to the mixed policy. Formally: Remaining challenge: G very large! See paper for algorithm that only uses relevant parts of G. [Strong similarity with constraint generation schemes we have seen.]
Maximum-entropy feature expectation matching --- Ziebart+al, 2008 Recall feature matching in suboptimal expert case: 2 (3) (2) (*) (1) Can we actually capture a suboptimal expert? (5) (0) (4) 1
Maximum-entropy feature expectation matching --- Ziebart+al, 2008 Maximize entropy of distributions over paths followed while satisfying the constraint of feature expectation matching: This turns out to imply that P is of the form: See paper for algorithmic details. Can we actually capture a suboptimal expert?
Feature expectation matching If expert suboptimal: Abbeel and Ng, 2004: resulting policy is a mixture of policies which have expert in their convex hull---In practice: pick the best one of this set and pick the corresponding reward function. Syed and Schapire, 2008 recast the same problem in game theoretic form which, at cost of adding in some prior knowledge, results in having a unique solution for policy and reward function. Ziebart+al, 2008 assume the expert stochastically chooses between paths where each path’s log probability is given by its expected sum of rewards.
Lecture outline Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Historical sketch of inverse RL Mathematical formulations for inverse RL Max-margin Feature matching Reward function parameterizing the policy class Case studies Trajectory-based reward, with application to autonomous helicopter flight
Reward function parameterizing the policy class Recall: Let’s assume our expert acts according to: Then for any R and ®, we can evaluate the likelihood of seeing a set of state-action pairs as follows:
Reward function parameterizing the policy class Assume our expert acts according to: Then for any R and ®, we can evaluate the likelihood of seeing a set of state-action pairs as follows: Ramachandran and Amir, AAAI2007: MCMC method to sample from this distribution Neu and Szepesvari, UAI2007: gradient method to optimize the likelihood
Lecture outline Examples of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning History of inverse RL Mathematical formulations for inverse RL Case studies: (1) Highway driving, (2) Crusher, (3) Parking lot navigation, (4) Route inference, (5) Quadruped locomotion Trajectory-based reward, with application to autonomous helicopter flight
Simulated highway driving Abbeel and Ng, ICML 2004; Syed and Schapire, NIPS 2007
Highway driving [Abbeel and Ng 2004] Input: Teacher in Training World Learned Policy in Testing World Reward features: 0/1 features corresponding to the car being in each lane, each shoulder; 0/1 features corresponding to the closest car in the current lane discretized into 10 distances (total: 15 features). Input: Dynamics model / Simulator Psa(st+1 | st, at) Teacher’s demonstration: 1 minute in “training world” Note: R* is unknown. Reward features: 5 features corresponding to lanes/shoulders; 10 features corresponding to presence of other car in current lane at different distances
More driving examples [Abbeel and Ng 2004] Driving demonstration Learned behavior Driving demonstration Learned behavior In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.
Crusher RSS 2008: Dave Silver and Drew Bagnell Crusher outfitted for teleoperation Dave Silver and Drew Bagnell drove this robot around to generate demostrations; Learned using these functional gradient procedures; particularly a form of exponentiated functional gradient descent which we’ve been developing RSS 2008: Dave Silver and Drew Bagnell
Learning movie Shows only one of the examples Readily converges to desired trajectory
Max margin [Ratliff + al, 2006/7/8] Side-by-side: hold out terrain and the learned cost function Low cost: blue - road High cost: redish tints - Steeply sloped mountainous regions [Ratliff + al, 2006/7/8]
Max-margin [Ratliff + al, 2006/7/8] Implemented learned cost function on the robot playback of actual data from robot [Ratliff + al, 2006/7/8]
Parking lot navigation Reward function trades off: Staying “on-road,” Forward vs. reverse driving, Amount of switching between forward and reverse, Lane keeping, On-road vs. off-road, Curvature of paths. [Abbeel et al., IROS 08]
Experimental setup Demonstrate parking lot navigation on “train parking lots.” Run our apprenticeship learning algorithm to find the reward function. Receive “test parking lot” map + starting point and destination. Find the trajectory that maximizes the learned reward function for navigating the test parking lot.
Nice driving style
Sloppy driving-style
“Don’t mind reverse” driving-style
Only 35% of routes are “fastest” (Letchner, Krumm, & Horvitz 2006) Since people often care about getting to places quickly, we could assume they are taking the time-optimal route <click> But some folks at Microsoft found that this isn’t that great of a predictor of actual behavior
Time Money Stress Skill And if you think about why this might be, there are a lot of different factors that may influence our route decisions. How much will the route cost in terms of gas and tolls? How stressful will it be? Does the route match my skill level? Ziebart+al, 2007/8/9
Time Distance Fuel Speed Safety Type Stress Lanes Skill Turns Mood Context Our view is that, like a fingerprint, each driver has a unique way in which they trade off these various factors under different contextual situations. Unfortunately, we have no way of knowing the perceived time, safety, stress, or skill requirements that a driver has for different routes. <click> Instead we view drivers preferences based on the characteristics of the road network that we do have, and learn to “take a driver’s fingerprint” by analyzing their previously driven routes with respect to these road network features. Ziebart+al, 2007/8/9
Data Collection 25 Taxi Drivers Over 100,000 miles Length Speed Now I’ll talk about how we applied our approach for modeling the driving decisions of taxi drivers. We obtained a road network dataset with features describing each road segments <click> We additionally obtained contextual features based on accident reports, construction zones, congestion, and time of day <click> We provided GPS loggers to 25 taxi drivers <click> And collected over 100,000 miles of their driving data Length Speed Road Type Lanes Accidents Construction Congestion Time of day Over 100,000 miles Ziebart+al, 2007/8/9
Destination Prediction Here’s a video of how this works in practice. A driver is starting over here in the northeast and as he travels, the posterior distribution over destinations changes… It may seem like you can do something more naïve here, and just “block off” sections that are passed, but drivers may be going out of their way a bit to hop on the interstate and you really need a more sophisticated approach like ours to handle such a case.
Quadruped Reward function trades off 25 features. Height differential of terrain. Gradient of terrain around each foot. Height differential between feet. … (25 features total for our setup) Reward function trades off 25 features. Hierarchical max margin [Kolter, Abbeel & Ng, 2008]
Experimental setup Demonstrate path across the “training terrain” Run our apprenticeship learning algorithm to find the reward function Receive “testing terrain”---height map. Find the optimal policy with respect to the learned reward function for crossing the testing terrain. Hierarchical max margin [Kolter, Abbeel & Ng, 2008]
Without learning
With learned reward function
Quadruped: Ratliff + al, 2007 Run footstep planner as expert (slow!) Run boosted max margin to find a reward function that explains the center of gravity path of the robot (smaller state space) At control time: use the learned reward function as a heuristic for A* search when performing footstep-level planning
Lecture outline Example of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Sketch of history of inverse RL Mathematical formulations for inverse RL Case studies Trajectory-based reward, with application to autonomous helicopter flight
Remainder of lecture: extreme helicopter flight How does helicopter dynamics work Autonomous helicopter setup Application of inverse RL to autonomous helicopter flight
Helicopter dynamics 4 control inputs: Main rotor collective pitch Main rotor cyclic pitch (roll and pitch) Tail rotor collective pitch
Autonomous helicopter setup On-Board Inertial Measurements Unit (IMU) data Position data Send out controls to helicopter Kalman filter Feedback controller
Motivating example How do we specify a task like this??? I want to start off with a motivating example. Back in December 2006, we were working on getting our autonomous helicopter to perform expert-level aerobatics. In our previous work we managed to perform some challenging aerobatic maneuvers, like flipping the helicopter over in place, and flying at high speeds. It took some time to get things right, but we could achieve pretty reasonable performance on these tasks. {play video} Then we wanted to do a new maneuver---and this maneuver is called tic-toc. So we spent several weeks working on this maneuver, trying to specify this trajectory by hand and fly it using our existing approach. But we kept running into the same problem. It’s actually very hard to articulate just how the helicopter should behave during this maneuver. Now, maybe, if we spent enough time, we could have come up with a trajectory that worked---but if we expect to solve autonomous aerobatics soon, It seemed like we needed a much more automated procedure. How do we specify a task like this???
Key difficulties Often very difficult to specify trajectory by hand. Difficult to articulate exactly how a task is performed. The trajectory should obey the system dynamics. Use an expert demonstration as trajectory. But, getting perfect demonstrations is hard. Use multiple suboptimal demonstrations. Simply average them? How to find the experts “intended” trajectory from suboptimal demonstrations? Unfortunately, it can be very difficult to specify the trajectory by hand. (1) We may not be able to really explain how task is performed. (2) Trajectory should obey the true system dynamics. Not strictly necessary, but it makes control more difficult otherwise. (3) We had this idea: one way to satisfy these requirements is to use an expert demonstration. (4) but, getting a perfect demonstration is far from easy. After watching several suboptimal demonstrations though, it becomes very clear to a human spectator---even a spectator with no knowledge of helicopters---what the expert intended to do. (5) So, maybe, if we use multiple demonstrations, even suboptimal ones, we can create an algorithm that extracts the trajectory that the expert was trying to demonstrate.
Expert demonstrations: Airshow So, here is what our demonstration data looks like. We asked our human pilot to fly our helicopter through the same aerobatic sequence several times. And you’ll notice some clear suboptimalities; they’re all flying different paths, and at different speeds. But even though they’re all different, you can clearly see that there is some intended trajectory that the pilot is trying to follow.
Expert demonstrations Graphical model Intended trajectory Expert demonstrations Time indices To address these problems, we’re first going to propose a generative model for the expert’s suboptimal demonstrations. We assume that that is some “hidden” intended trajectory, represented here by the variables ‘z’ (1) and that this trajectory follows a nonlinear dynamical system with gaussian noise. (2) We also have a sequence of states from each expert demonstration, which we model as independent observations of one of the hidden trajectory states. (3) Unfortunately, we don’t know _which_ hidden state corresponds to each of the demonstration states because the demonstrations occur at different rates. So, we assume that there is some unknown set of time indices, the taus, that match these observations to the hidden states. For instance, y1 could be an observation of z1, z2, or z3. And the variable tau1 tells us which of these is the case. Intended trajectory satisfies dynamics. Expert trajectory is a noisy observation of one of the hidden states. But we don’t know exactly which one. [ICML 2008]
Learning algorithm Similar models appear in speech processing, genetic sequence alignment. See, e.g., Listgarten et. al., 2005 Maximize likelihood of the demonstration data over: Intended trajectory states Time index values Variance parameters for noise terms Time index distribution parameters (1) We note that this type of model appears in other fields. For instance, speech recognition, and genetic sequence alignment. And while the details of our inference algorithm differ slightly, there are some existing inference methods for models of this kind. One example is the work of Listgarten, et. al., though the particulars there are somewhat different because they are working in a different setting. As is common, we use a maximum likelihood approach where we maximize the likelihood of all the demonstration data over the intended trajectory states the time index values the variance parameters for the gaussian noise terms, and the distribution parameters for the time indices
Learning algorithm If ¿ is unknown, inference is hard. If ¿ is known, we have a standard HMM. Make an initial guess for ¿. Alternate between: Fix ¿. Run EM on resulting HMM. Choose new ¿ using dynamic programming. How do we do this? First, looking back at our model, we can see that inference when the time indices, \tau, are unknown inference is difficult because there are a large number of dependencies. On the other hand, if we know the time indices then the model is simpler. For instance, if we know that y1 corresponds to the intended state z2, then we don’t have to consider the direct dependence between z1 and y1, or z3 and y1. In this case we just have a standard hidden markov model for a nonlinear-gaussian system. So we use an alternating procedure. We start out with a guess for tau. Then we run EM on the resulting HMM to find the intended trajectory and distribution parameters. And then we choose a new value for tau using a dynamic programming algorithm. We give details for both of these steps in the full version of the paper on our website.
Details: Incorporating prior knowledge Might have some limited knowledge about how the trajectory should look. Flips and rolls should stay in place. Vertical loops should lie in a vertical plane. Pilot tends to “drift” away from intended trajectory. It would be nice if we could incorporate the expert’s advice into our trajectory learning. Some of this knowledge is very specific in nature: We know that loops should look flat when you view them from the edge, And we know that flips and rolls should stay roughly in place. But some of it is more dynamic. For instance, we know that the pilot tends to “drift” away from the intended trajectory slowly over time. Because we have a very rich model representation we can actually incorporate this knowledge to improve the trajectories, And we give examples of how to do that in the paper.
Results: Time-aligned demonstrations White helicopter is inferred “intended” trajectory. (start video) So, what does the result look like? The white helicopter is the resulting inferred trajectory that we think is the expert’s intended trajectory. The demonstrations here are the same as before, but now they’re being shown after time-warping them to match up with the intended trajectory, which is shown in white. You can do this using the time indices computed by the algorithm. You can see that the white trajectory really does capture what we might intuitively describe as the “typical” or “intended” behavior of these trajectories.
Results: Loops Here’s a plot of the loops that were flown by our pilot (only 3 are shown). (1) And here’s what the algorithm gives as the most likely intended path. No prior knowledge of the loop’s structure was used, but you can see that the inferred trajectory is much closer to an ideal loop.
Learning / Optimal Control High-level picture Data Dynamics Model Trajectory + Penalty Function Reinforcement Learning / Optimal Control Reward Function R Controller/ Policy p So, let’s go back now. We’ve just presented an algorithm for extracting a good trajectory that we can use for our reward function. But if we actually want to fly this trajectory, we need to have a good dynamics model. What’s high level picture here: need 1,2,3. load balancing, pricing, ad placement, … Many problems in robotics have unknown, stochastic, high-dimensional, and highly non-linear dynamics, and offer significant challenges to classical control methods. Some of the key difficulties in these problems are that (i) It is often hard to write down, in closed form, a formal specification of the control task (for example, what is the objective function for "flying well"?), (ii) It is difficult to build a good dynamics model because of both data collection and data modeling challenges (similar to the "exploration problem" in reinforcement learning), and (iii) It is expensive to find closed-loop controllers for high dimensional, stochastic domains. In this talk, I will present learning algorithms with formal performance guarantees which show that these problems can be efficiently addressed in the apprenticeship learning setting---the setting when expert demonstrations of the task are available. Car example: What are we trying to learn here: Current state correct action = (throttle/brake, steering). Let’s consider the fairly simple task of driving a car on a highway (in that, much simpler than general traffic situation). Keep in mind, just iconic example to get the point across. I will look at more complex, real-world problems (which we solved for the first time) later. Build up policy complexity: try to maintain desired speed; if car ahead, switch to left lane; if car there, switch to right lane; What if we want to take an exit soon? – Don’t switch into the left lane, except if you see a clear path to get back (what does that even mean, a clear path) etc. Quickly becomes very complex. For supervised learning to work we need to (a) Encounter all these different situations, so we can learn the appropriate action for them. Thus, e.g., our demonstration needs to include some pretty dangerous situations … (b) We need to have a learning architecture (SVM, deep belief net, neural net, decision tree, …) capable of capturing the extreme richness of our decisions. Making decisions with so many possible actions (rather than 0/1 etc.) Really, only will work in case of policy simplicity. Examples include: bla …; these examples are extremely restrictive Ignoring the structure of the problem. When we drive a car we have (a) specific objectives: Avoid other cars Get to our goal Maybe talk on our cell phone, curse at other drivers, etc. ? (b) an idea of how our car and other cars behave == dynamics model Key contribution: exploit this structure, and learn reward function and dynamics model from demonstrations rather than directly learn a policy. The controller that falls out generalizes much better b/c it “understands” the nature of the problem. Rather than “blindly” learned to mimick the teacher. Step back here: I am not so much concerned about driving the car in the highway simulator; this is just an illustrative toy example. This idea applies generally to problems I mentioned before:
Experimental setup for helicopter Our expert pilot demonstrates the airshow several times. Learn (by solving a joint optimization problem): Reward function---trajectory. Dynamics model---trajectory-specific local model. Fly autonomously: Inertial sensing + vision-based position sensing (extended) Kalman filter Receding horizon differential dynamic programming (DDP) feedback controller (20Hz) Learning to fly new aerobatics takes < 1 hour
Related work Bagnell & Schneider, 2001; LaCivita, Papageorgiou, Messner & Kanade, 2002; Ng, Kim, Jordan & Sastry 2004a (2001); Roberts, Corke & Buskey, 2003; Saripalli, Montgomery & Sukhatme, 2003; Shim, Chung, Kim & Sastry, 2003; Doherty et al., 2004. Gavrilets, Martinos, Mettler and Feron, 2002; Ng et al., 2004b. Maneuvers presented here are significantly more challenging and more diverse than those performed by any other autonomous helicopter. Before we show our final results, we note that there has been quite a lot of excellent work on autonomous helicopter flight. Bagnell & Schneider, etc. all developed helicopters that performed well during hover and low-speed flight. And these results were certainly ground-breaking. Since then there have been a number of other impressive applications; but we note that most of them are restricted still to low speed, non-aerobatic flight. In contrast, the work of Gavrilets et. al. in 2002, as well as Andrew Ng’s group at Stanford really pushed the envelope of helicopter control--- demonstrating that with some work, we could get autonomous helicopters to fly a limited range of very impressive aerobatic maneuvers. But we want to point out that the maneuvers we’re now able to perform are significantly more challenging than those flown by any other autonomous helicopter, and not only that, but we can fly maneuvers in a truly vast variety.
Autonomous aerobatic flips (attempt) before apprenticeship learning Task description: meticulously hand-engineered Model: learned from (commonly used) frequency sweeps data
Results: Autonomous airshow Let’s see the result. So, while you’re watching this, feel free to think about how long it would take you to code the entire trajectory by hand. {pause} We’ve shown this video to a lot of people, and a common question is, How long did this take you? How long would it take to put together a new airshow? On typical day, we spend about 30 minutes collecting demonstrations from our pilot, We spend another 30 minutes or so running our learning algorithm. Then, it only takes about a minute to generate our diff. dyn. Prog. Controllers, And shortly thereafter we’re flying the new airshow as well as you see here. It’s a very routine procedure now; and you can come to the poster to see some of the other shows and maneuvers that we’ve done.
Summary Example of apprenticeship learning via inverse RL Inverse RL vs. behavioral cloning Sketch of history of inverse RL Mathematical formulations for inverse RL Case studies Trajectory-based reward, with application to autonomous helicopter flight
Thank you.
Questions?
More materials Lecture based upon these slides (by Pieter Abbeel) to appear on VideoLectures.net [part of Robot Learning Summer School, 2009] Helicopter work: http://heli.stanford.edu for helicopter videos and video lecture (ICML 2008, Coates, Abbeel and Ng) and thesis defense (Abbeel)
More materials / References Abbeel and Ng. Apprenticeship learning via inverse reinforcement learning. (ICML 2004) Abbeel, Dolgov, Ng and Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. Boyd, Ghaoui, Feron and Balakrishnan. (1994). Linear matrix inequalities in system and control theory. SIAM. Coates, Abbeel and Ng. Learning for control from multiple demonstrations. (ICML 2008) Kolter, Abbeel and Ng. Hierarchical apprenticeship learning with application to quadruped locomotion. (NIPS 2008) Neu and Szepesvari. Apprenticeship learning using inverse reinforcement learning and gradient methods. (UAI 2007) Ng and Russell. Algorithms for inverse reinforcement learning. (ICML 2000) Ratliff, Bagnell and Zinkevich. Maximum margin planning. (ICML 2006) Ratliff, Bradley, Bagnell and Chestnutt. Boosting structured prediction for imitation learning. (NIPS 2007) Ratliff, Bagnell and Srinivasa. Imitation learning for locomotoin and manipulation. (I. Conf. on Humanoid Robotics, 2007) Ramachandran and Amir. Bayesian inverse reinforcement learning. (IJCAI 2007) Syed and Schapire. A game-theoretic approach to apprenticeship learning. (NIPS 2008) Ziebart, Maas, Bagnell and Dey. Maximum entropy inverse reinforcement learning. (AAAI 2008)