Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.

Slides:



Advertisements
Similar presentations
The simplex algorithm The simplex algorithm is the classical method for solving linear programs. Its running time is not polynomial in the worst case.
Advertisements

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Markov Decision Process
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Optimal Policies for POMDP Presented by Alp Sardağ.
1. Algorithms for Inverse Reinforcement Learning 2
An Introduction to Markov Decision Processes Sarah Hickmott
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Visual Recognition Tutorial
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
An Introduction to PO-MDP Presented by Alp Sardağ.
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Daniel Kroening and Ofer Strichman Decision Procedures An Algorithmic Point of View Deciding ILPs with Branch & Bound ILP References: ‘Integer Programming’
Finite element method 1 Finite Elements  Basic formulation  Basis functions  Stiffness matrix  Poisson‘s equation  Regular grid  Boundary conditions.
Decision Procedures An Algorithmic Point of View
MAKING COMPLEX DEClSlONS
Search and Planning for Inference and Learning in Computer Vision
Simplex method (algebraic interpretation)
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Jump to first page Chapter 3 Splines Definition (3.1) : Given a function f defined on [a, b] and a set of numbers, a = x 0 < x 1 < x 2 < ……. < x n = b,
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Dynamic Programming. A Simple Example Capital as a State Variable.
Presented by- Nikhil Kejriwal advised by- Theo Damoulas (ICS) Carla Gomes (ICS) in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Markov Decision Process (MDP)
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
On-Line Markov Decision Processes for Learning Movement in Video Games
Markov Decision Process (MDP)
Making complex decisions
Markov Decision Processes
Reinforcement Learning
Biomedical Data & Markov Decision Process
Chapter 4 Linear Programming: The Simplex Method
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Reinforcement Learning
Reinforcement Learning (2)
Presentation transcript:

Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ

Goal Given the observed optimal behaviour extract a reward function. It may be useful: In apprenticeship learning For ascertaining the reward function being optimized by a natural system.

The problem Given: Measurements of an agent’s behaviour over time, in a variety of circumstances. Measurements of the sensory inputs to that agent. İf available, a model of the environment. Determine the reward function.

Sources of Motivation The use of reinforcement learning and related methods as computaional models for animal and human learning. The task of of constructing an intelligent agent that can behave successfully in a particular domain.

First Motivation Reinforcement learning: Suported by behavioral studies and neurophysiological evidence that reinfocement learning occurs. Assumption: the reward function is fixed and known. In animal and human behaviour the reward function is an unknown to be ascertained through emperical investigation. Example: Bee foraging: the litteratture assumes reward is the simple saturating function of nectar content. The bee might weigh nectar ingestion against flight distance, time and risk from wind and predators.

Second Motivation The task of constructing an intelligent agent. An agent designer may have a rough idea of reward function. The entire field of reinforcement is founded on the assumption that the reward function is the most robust and transferrable definition of the task.

Notation A (finite) MDP is a tuple (S,A,{P sa }, ,R) where S is a finite set of N states. A={a 1,...,a k } is a set of k actions. P sa (.) are the state transition probabilities upon taking action a in state s.   [0,1) is the discount factor. R:S  {Real} is the reinforcement function. The value function: The Q-function:

Notation Cont. For discrete finite spaces all the functions R, V, P can be represented as vectors indexed by state: R whose i th elemnt is the reward for i th state. V is the vector whose i th element is the value function at state i. P a is the N-by-N matrix such that its (i,j) element gives the probability of transioning to state j upon taking action a in state i. a

Basic Properties of MDP

Inverse Reinforcement learning The inverse reinforcement learning problem is to find a reward function that can explain observed behaviour. We are given: A finite state space S. A set of k actions, A={a 1,...,a k }. Transition probabilities {P sa }. A discount factor  and a policy . The problem is to find the set of possible reward functions R such that  optimal policy.

The steps IRL in Finite State Space 1. Find the set of all reward functions for which a given policy is optimal. 2. The set contains many degenerate solutions. 3. With a simple heuristic for removing this degeneracy, resulting in a linear programming solution to the IRL problem.

The Solution Set is necessary and sufficient for  =a 1 to be unique optimal policy.

The problems R=0 is always a solution. For most MDP s it also seems likely that there are many choices of R that meet criteria: How do we decide which one of these many reinforcement functions to choose?

Linear programming can be used to find a feasible point of the constraints in equation: Favor solutions that make any single-step deviation from  as costly as possible. LP Formulation

Penalty Terms Small rewards are “simpler” and preferable. Optionally add to the objective function a weight decay like penalty: - ||R|| where is an adjustable penalty coefficient that balances between the twin goals of having small reinforcements, and of maximizing.

Putting all together Clearly, this may easily formulated as linear program and solved efficiently.

Linear Function Approximation in Large State Spaces The reward function is now: R:{Real} n   {Real} The linear approximation for the reward function: where  1,..,  d are known basis function mapping S into Reals and the  i are unknown parameters.

Our final linear programming formulation: where S 0 is the subsamples of states. P is given by p(x)=x if x  0, p(x)=2x otherwise. Linear Function Approximation in Large State Spaces

IRL from Sampled Trajectories In realistic case, the policy  is given only through a set of actual trajectories in state space where transition probabilities (P a ) are unknown, instead assume we are able to simulate trajectories. The algorithm: 1. For each  k run the simulation starting S Calculate V  (S 0 ) for each  k 3. Our objective is to find  i s.t. : 4. A new setting  i s, hence a new Reward function. 5. Repeat until large number of iterations or R with which we are satisfied.

Experiments