1. Algorithms for Inverse Reinforcement Learning 2

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.
Batch RL Via Least Squares Policy Iteration
Partially Observable Markov Decision Process (POMDP)
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning (1)
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Reinforcement Learning
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
CHAPTER 10 Reinforcement Learning Utility Theory.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MDPs (cont) & Reinforcement Learning
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Reinforcement learning (Chapter 21)
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Markov Decision Process (MDP)
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Generative Adversarial Imitation Learning
Making complex decisions
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Apprenticeship Learning Using Linear Programming
Markov Decision Processes
Daniel Brown and Scott Niekum The University of Texas at Austin
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence
Apprenticeship Learning via Inverse Reinforcement Learning
Dr. Unnikrishnan P.C. Professor, EEE
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
CS 188: Artificial Intelligence Spring 2006
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

1. Algorithms for Inverse Reinforcement Learning 2 1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse Reinforcement Learning

Algorithms for Inverse Reinforcement Learning Andrew Ng and Stuart Russell

Motivation Given: (1) measurements of an agent's behavior over time, in a variety of circumstances, (2) if needed, measurements of the sensory inputs to that agent; (3) if available, a model of the environment. Determine: the reward function being optimized.

Why? Reason #1: Computational models for animal and human learning. “In examining animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.” Particularly true of multiattribute reward functions (e.g. Bee foraging: amount of nectar vs. flight time vs. risk from wind/predators)

Why? Reason #2: Agent construction. “An agent designer [...] may only have a very rough idea of the reward function whose optimization would generate 'desirable' behavior.” e.g. “Driving well” Apprenticeship learning: Recovering expert's underlying reward function more “parsimonious” than learning expect's policy?

Possible applications in multi-agent systems In multi-agent adversarial games, learning opponents’ reward functions that guild their actions to devise strategies against them. example In mechanism design, learning each agent’s reward function from histories to manipulate its actions. and more?

Inverse Reinforcement Learning (1) – MDP Recap MDP is represented as a tuple (S, A, {Psa}, ,R) Note: R is bounded by Rmax Value function for policy : Q-function:

Inverse Reinforcement Learning (1) – MDP Recap Bellman Equation: Bellman Optimality:

Inverse Reinforcement Learning (2) Finite State Space Reward function solution set (a1 is optimal action)

Inverse Reinforcement Learning (2) Finite State Space There are many solutions of R that satisfy the inequality (e.g. R = 0), which one might be the best solution? 1. Make deviation from as costly as possible: 2. Make reward function as simple as possible

Inverse Reinforcement Learning (2) Finite State Space Va1 Va2 maximized Linear Programming Formulation: a1, a2, …, an R?

Inverse Reinforcement Learning (3) Large State Space Linear approximation of reward function (in driving example, basis functions can be collision, stay on right lane,…etc) Let be value function of policy , when reward R = For R to make optimal

Inverse Reinforcement Learning (3) Large State Spaces In an infinite or large number of state space, it is usually not possible to check all constraints: Choose a finite subset S0 from all states Linear Programming formulation, find αi that: x>=0, p(x)=x; otherwise p(x)=2x

Inverse Reinforcement Learning (4) IRL from Sample Trajectories If is only accessible through a set of sampled trajectories (e.g. driving demo in 2nd paper) Assume we start from a dummy state s0,(whose next state distribution is according to D). In the case that reward trajectory state sequence (s0, s1, s2….):

Inverse Reinforcement Learning (4) IRL from Sample Trajectories Assume we have some set of policies Linear Programming formulation The above optimization gives a new reward R, we then compute based on R, and add it to the set of policies reiterate

Discrete Gridworld Experiment 5x5 grid world Agent starts in bottom-left square. Reward of 1 in the upper-right square. Actions = N,W,S,E (30% chance of random)

Discrete Gridworld Results

Mountain Car Experiment #1 Car starts in valley, goal is at the top of hill Reward is -1 per “step” until goal is reached State = car's x-position & velocity (continuous!) Function approx. class: all linear combinations of 26 evenly spaced Gaussian-shaped basis functions

Mountain Car Experiment #2 Goal is in bottom of valley Car starts... not sure. Top of hill? Reward is 1 in the goal area, 0 elsewhere γ = 0.99 State = car's x-position & velocity (continuous!) Function approx. class: all linear combinations of 26 evenly spaced Gaussian-shaped basis functions

Mountain Car Results #1 #2

Continuous Gridworld Experiment State space is now [0,1] x [0,1] continuous grid Actions: 0.2 movement in any direction + noise in x and y coordinates of [-0.1,0.1] Reward 1 in region [0.8,1] x [0.8,1], 0 elsewhere γ = 0.9 Function approx. class: all linear combinations of a 15x15 array of 2-D Gaussian-shaped basis functions m=5000 trajectories of 30 steps each per policy

Continuous Gridworld Results 3%-10% error when comparing fitted reward's optimal policy with the true optimal policy However, no significant difference in quality of policy (measured using true reward function)

Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel & Andrew Y. Ng

Algorithm For t = 1,2,… Inverse RL step: Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert performs better than all previously found policies {i}. RL step: Compute optimal policy t for the estimated reward w. Convey intuition about what the algorithm is doing. Find reward that makes teacher look much better; and then hypotesize that reward and find the policy. The intuition behind the algorithm, is that the set of policies generated gets closer (in some sense) to the expert at every iteration. RL black box. Inverse RL: explain the details now. Courtesy of Pieter Abbeel

Algorithm: IRL step Maximize , w:||w||2≤ 1  s.t. Vw(E)  Vw(i) +  i=1,…,t-1  = margin of expert’s performance over the performance of previously found policies. Vw() = E [t t R(st)|] = E [t t wT(st)|] = wT E [t t (st)|] = wT () () = E [t t (st)|] are the “feature expectations” \mu(\pi) can be computed for any policy \pi based on the transition dynamics ‘’maybe mention – due to norm constraint -> QP, same as max margin for svm’s ; details  poster’’ Performance guarantees …  next slide Courtesy of Pieter Abbeel

Feature Expectation Closeness and Performance If we can find a policy  such that ||(E) - ()||2  , then for any underlying reward R*(s) =w*T(s), we have that |Vw*(E) - Vw*()| = |w*T (E) - w*T ()|  ||w*||2 ||(E) - ()||2  . Let’s now look at how we could get performance guarantees, although the reward function is unknown. The key is that, if we can find a policy that is \epsilon close to the expert in feature expectations, then no matter what the underlying, unknown, reward function R* is, that policy will have performance \epsilon close to the expert’s performance as measure w.r.t. the unknown reward function. This is easily seen as follows. Going back to our algorithm, at every iteration it finds a policy closer to the expert in terms of feature expectations as follows. I will now illustrate this intuition graphically. In the inverse RL step it finds the direction in feature space for which the expert is mostly separated from the previously found policies. In the RL step, it finds a policy which is optimal w.r.t. this estimated reward, which brings it closer to the expert, again, in feature expectation space. Motivation for algorithm. --- get closer in \mu; Courtesy of Pieter Abbeel

IRL step as Support Vector Machine maximum margin hyperplane seperating two sets of points () (E) |w*T (E) - w*T ()| = |Vw*(E) - Vw*()| = maximal difference between expert policy’s value function and 2nd to the optimal policy’s value function

2 (E) (2) w(3) (1) w(2) w(1) (0) 1 Uw() = wT() Courtesy of Pieter Abbeel

Gridworld Experiment 128 x 128 grid world divided into 64 regions, each of size 16 x 16 (“macrocells”). A small number of macrocells have positive rewards. For each macrocell, there is one feature Φi(s) indicating whether that state s is in macrocell i Algorithm was also run on the subset of features Φi(s) that correspond to non-zero rewards.

Gridworld Results Distance to expert vs. # Iterations Performance vs. # Trajectories

Car Driving Experiment No explict reward function at all! Expert demonstrates proper policy via 2 min. of driving time on simulator (1200 data points). 5 different “driver types” tried. Features: which lane the car is in, distance to closest car in current lane. Algorithm run for 30 iterations, policy hand- picked. Movie Time! (Expert left, IRL right)

Demo-1 Nice

Demo-2 Right Lane Nasty

Car Driving Results