# Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.

## Presentation on theme: "Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State."— Presentation transcript:

Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State University

 A Markov Decision Process (MDP) is a tuple where  is the set of states  is the set of actions  is the transition function denoting probability of transitioning to state after taking action in  is the reward function giving reward in state  is the initial state  A stationary policy is a mapping from states to actions  The H-horizon value of a policy is the expected total reward of trajectories that start at and follow for H steps

Teacher Learner Trajectory Data Supervised Learning Algorithm Classifier GOAL: To learn a policy whose H-horizon value is not much worse than

Teacher Trajectory Data Supervised Learning Algorithm Learner DRAWBACK:  Generating such trajectories can be tedious and may even be impractical. Real-time low-level Control of multiple Game agents!! Classifier

Learner Select Best State Query Teacher correct action to take in is Current Training data (s, a) pairs Simulator

Learner Select Best State Query Teacher This is a bad state which I would never visit!! I choose not to suggest any action Bad State( ) Current Training data (s, a) pairs Simulator

Select Best State Query Wargus Expert Wargus Agent A bad state query!! Bad State( ) Current Training data (s, a) pairs Simulator

Select Best State Query Expert Pilot Helicopter Flying Agent A bad state query!! Current Training data (s, a) pairs Simulator Bad State( )

 It is important to minimize bad state queries!! Learner Select Best State Query Teacher correct action to take in is Challenge: how to combine action uncertainty and bad- state likelihood We provide a principled approach based on noiseless Bayesian active learning Current Training data (s, a) pairs Simulator

 It is possible to simulate passive imitation learning via state queries Supervised Learning Algorithm Trajectory Data N

Learner Select Best State Query Teacher correct action to take in is Single known target distribution Current Training data (s, a) pairs Simulator

Select Best State Query Teacher correct action to take in is Single known target distribution  Applying i.i.d. active learning uniformly over entire state space leads to poor performance: Queries are in uncertain states that are also bad!! Current Training data (s, a) pairs Simulator Learner

 Goal: identify true hypothesis with as few tests as possible  We employ a form of generalized binary search (GBS) in this work Hypotheses Tests Test Outcomes

GOAL: Determine the path corresponding to by performing test from that have outcomes (teacher responses)

Bootstrap Sample 1 Bootstrap Sample 2 Bootstrap Sample 3 Bootstrap Sample K Generalized Binary Search Supervised Learner Supervised Learner Supervised Learner Supervised Learner Simulator Path 1Path 2Path 3Path K Labeled Data (s, a) pairs

 can be rewritten in the following form: Posterior prob. mass of hypotheses that go through s Entropy of multinomial distribution over actions at s Small bonus that is maximized when Posterior prob. of target policy visiting s Uncertainty over action choices at s

 We use Pegasus style determinization approach to handle stochastic MDPs (Ng & Jordan, UAI 2000)  Details are in the paper!!

 We performed experiments in two domains:  A grid world with pits  Cart pole  We compared IQBC against following baselines :  Random:  Selects states to query uniformly at random  Standard QBC (SQBC):  Treats all states as i.i.d. and applies standard uncertainty based QBC  Passive imitation learning (Passive):  Simulates standard passive imitation learning  Confidence based autonomy (CBA) (Chernova & Veloso, JAIR 2009):  Executes policy until the confidence falls below an automatically adjusted threshold, at which point the learner queries the teacher for an action, updates its policy and threshold and resumes execution  Performance can be quite sensitive to threshold adjustment

30 Pit Goal

 Generous: always responds with an action  Strict: declares states far away from the states visited by the teacher as bad states

“Generous” teacher

“Strict” teacher

 state =  Actions = left or right  Bounds on cart position and pole angle are [-2.4, 2.4] and [-90, 90] resp.

“Generous” teacher

“Strict” teacher

 Develop policy optimization algorithms that take responses and other forms of teacher input  Query short sequence of states rather than single states  Consider more application areas like structured prediction and other RL domains  Conduct studies with human teachers

Download ppt "Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State."

Similar presentations