Daniel Brown and Scott Niekum The University of Texas at Austin

Daniel Brown and Scott Niekum The University of Texas at Austin
Feb 2018 Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Daniel Brown and Scott Niekum The University of Texas at Austin Do I want an outline slide?

Learning from Demonstration (LfD)
Today I’m going to be talking about the problem of learning from demonstration. The goal of Learning from demonstration is to allow robots and other intelligent agents to learn how to perform tasks from human demonstrations. The motivation is that teaching through examples and demonstrations comes naturally for humans and allowing machines to learn from human demonstrations would enable anyone to teach a robot or software agent to perform complex tasks according to their own particular preferences.

Bounding Performance for LfD
Correctness Generalizability Safety The work I’m presenting today is focused on obtaining performance bounds when learning from demonstration. So, usually, the output with the goal of allowing an agent that learns from demonstrations to reason about the correctness, generalizability, and safety of what it has learned.

How close is my performance to optimal?
Bounding Policy Loss How close is my performance to optimal? Value of policy Policy Loss How close is my performance to optimal? Introduce at high level the problem of LfD

General Problem: Policy evaluation w/out R
I’m 95% confident my performance is ε-close to optimal. Given: Domain, MDP\R Demonstrations, Evaluation policy, Find such that with high confidence Robot policy could be built-in, hand-crafted, obtained from RL on an ad hoc reward function, or most likely obtained through LfD or IRL. Find a high confidence bound on the performance loss between the robot’s policy and the optimal policy under the demonstrator’s unknown reward.

How to bound Policy Loss?
We don’t know the reward function (or the optimal policy) Bayesian Inverse Reinforcement Learning There are many ways to estimate the return of an evaluation policy given samples of the reward; however we don’t know the reward function and typically there are an infinite number of reward functions that could explain the demonstrations. We use bayeisan IRL to sample from the posterior over likely reward functions conditioned on the demonstrations. This allows us to sample from the posterior distribution over policy losses. Since we don’t know the true reward, we then formulate a risk-sensitive bound that hedges against the uncertainty in the true reward function.

Bayesian IRL (Ramachandran 2007)
Uses MCMC to sample from posterior Assumes demonstrations follow softmax policy with temperature c.

We don’t know the reward function (or the optimal policy) Bayesian Inverse Reinforcement Learning There are many ways to estimate the return of an evaluation policy given samples of the reward; however we don’t know the reward function and typically there are an infinite number of reward functions that could explain the demonstrations. We use bayeisan IRL to sample from the posterior over likely reward functions conditioned on the demonstrations. This allows us to sample from the posterior distribution over policy losses. Since we don’t know the true reward, we then formulate a risk-sensitive bound that hedges against the uncertainty in the true reward function.

We don’t know the reward function (or the optimal policy) Bayesian Inverse Reinforcement Learning Risk-sensitive performance bound 𝛼−Value at Risk (𝛼−quantile worst−case outcome) There are many ways to estimate the return of an evaluation policy given samples of the reward; however we don’t know the reward function and typically there are an infinite number of reward functions that could explain the demonstrations. We use bayeisan IRL to sample from the posterior over likely reward functions conditioned on the demonstrations. This allows us to sample from the posterior distribution over policy losses. Since we don’t know the true reward, we then formulate a risk-sensitive bound that hedges against the uncertainty in the true reward function.

Calculate policy loss assuming sampled reward is true reward
High-level Approach Sorted Policy Losses Return high confidence bound on alpha-worst-case policy loss over P(R|D). Calculate policy loss assuming sampled reward is true reward Bayesian IRL Maybe build this out so I have queues to talk to and not one gobbledygook that overwhelms people!

Experiments Grid world Driving

Assumptions on Reward Functions
Linear combination of features We can rewrite the expected return of a policy in terms of expected feature counts

Baseline Worst-case feature count bound (WFCB)
Penalize the largest difference in state-visitation counts between demonstrations and evaluation policy Empirical expected feature counts of demonstrations Expected feature counts of evaluation policy

Grid World Results 200 random grid worlds.
Evaluation policy is optimal policy for MAP reward given demonstrations Efficiency gain

Theoretical IRL performance bounds
Based on Hoeffding-style concentration inequalities (Abbeel & Ng 2004, Syed & Schapire 2008) Extremely loose in practice Require over demonstrations to achive a bound comparable to our bound found after one demonsration

Based on the demonstrations should I use policy A or B?
Policy Selection Based on the demonstrations should I use policy A or B? Rank a set of evaluation policies based on high-confidence performance bounds

Driving Experiment Actions = left, right, straight
State Features: distances to other cars, lane # Reward features: lane #, in collision

Demonstration that avoids collisions
Right-safe: avoids cars but prefers right lane On-road: Stays on road, but ignores other cars Nasty: seeks collisions

Policy Ranking Feature count bound is misled by state-occupancies
Our method reasons over reward likelihoods I wonder if a bar chart would be better here?

Future Work Scalability:
Estimating the amount of noise in human demonstrations Active Learning: query demonstrator to reduce VaR discuss limitations and future applications. Can a robot actively query for demonstrations in areas where VaR is high, and know when it’s received enough demonstrations? Can we infer posterior over rewards without repeatedly solving for optimal policies? BIRL has boltzman temperature parameter in likelihood function, but how do you set it?

Conclusion First practical method for policy evaluation when reward function is unknown. Based on probabilistic worst-case performance over likely reward functions. Applications: Policy selection Policy improvement Demonstration sufficiency MDP\R e first practical method that allows agents that learn from demonstration to express confidence in the quality of their learned policy

Future Work Scalability: Estimating the noise in human demonstrations
Active Learning: query demonstrator to reduce VaR discuss limitations and future applications. Can a robot actively query for demonstrations in areas where VaR is high, and know when it’s received enough demonstrations? Can we infer posterior over rewards without repeatedly solving for optimal policies? BIRL has boltzman temperature parameter in likelihood function, but how do you set it?

Daniel Brown and Scott Niekum The University of Texas at Austin

Similar presentations

Presentation on theme: "Daniel Brown and Scott Niekum The University of Texas at Austin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Daniel Brown and Scott Niekum The University of Texas at Austin

Similar presentations

Presentation on theme: "Daniel Brown and Scott Niekum The University of Texas at Austin"— Presentation transcript:

Similar presentations

About project

Feedback