Download presentation

Presentation is loading. Please wait.

Published byFelix Tolman Modified over 2 years ago

1
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State University

2
A Markov Decision Process (MDP) is a tuple where is the set of states is the set of actions is the transition function denoting probability of transitioning to state after taking action in is the reward function giving reward in state is the initial state A stationary policy is a mapping from states to actions The H-horizon value of a policy is the expected total reward of trajectories that start at and follow for H steps

3
Teacher Learner Trajectory Data Supervised Learning Algorithm Classifier GOAL: To learn a policy whose H-horizon value is not much worse than

4
Teacher Trajectory Data Supervised Learning Algorithm Learner DRAWBACK: Generating such trajectories can be tedious and may even be impractical. Real-time low-level Control of multiple Game agents!! Classifier

5
Learner Select Best State Query Teacher correct action to take in is Current Training data (s, a) pairs Simulator

6
Learner Select Best State Query Teacher This is a bad state which I would never visit!! I choose not to suggest any action Bad State( ) Current Training data (s, a) pairs Simulator

7
Select Best State Query Wargus Expert Wargus Agent A bad state query!! Bad State( ) Current Training data (s, a) pairs Simulator

8
Select Best State Query Expert Pilot Helicopter Flying Agent A bad state query!! Current Training data (s, a) pairs Simulator Bad State( )

9
It is important to minimize bad state queries!! Learner Select Best State Query Teacher correct action to take in is Challenge: how to combine action uncertainty and bad- state likelihood We provide a principled approach based on noiseless Bayesian active learning Current Training data (s, a) pairs Simulator

10
It is possible to simulate passive imitation learning via state queries Supervised Learning Algorithm Trajectory Data N

11
Learner Select Best State Query Teacher correct action to take in is Single known target distribution Current Training data (s, a) pairs Simulator

12
Select Best State Query Teacher correct action to take in is Single known target distribution Applying i.i.d. active learning uniformly over entire state space leads to poor performance: Queries are in uncertain states that are also bad!! Current Training data (s, a) pairs Simulator Learner

13
Goal: identify true hypothesis with as few tests as possible We employ a form of generalized binary search (GBS) in this work Hypotheses Tests Test Outcomes

17
GOAL: Determine the path corresponding to by performing test from that have outcomes (teacher responses)

27
Bootstrap Sample 1 Bootstrap Sample 2 Bootstrap Sample 3 Bootstrap Sample K Generalized Binary Search Supervised Learner Supervised Learner Supervised Learner Supervised Learner Simulator Path 1Path 2Path 3Path K Labeled Data (s, a) pairs

28
can be rewritten in the following form: Posterior prob. mass of hypotheses that go through s Entropy of multinomial distribution over actions at s Small bonus that is maximized when Posterior prob. of target policy visiting s Uncertainty over action choices at s

29
We use Pegasus style determinization approach to handle stochastic MDPs (Ng & Jordan, UAI 2000) Details are in the paper!!

30
We performed experiments in two domains: A grid world with pits Cart pole We compared IQBC against following baselines : Random: Selects states to query uniformly at random Standard QBC (SQBC): Treats all states as i.i.d. and applies standard uncertainty based QBC Passive imitation learning (Passive): Simulates standard passive imitation learning Confidence based autonomy (CBA) (Chernova & Veloso, JAIR 2009): Executes policy until the confidence falls below an automatically adjusted threshold, at which point the learner queries the teacher for an action, updates its policy and threshold and resumes execution Performance can be quite sensitive to threshold adjustment

31
30 Pit Goal

32
Generous: always responds with an action Strict: declares states far away from the states visited by the teacher as bad states

33
“Generous” teacher

34
“Strict” teacher

36
state = Actions = left or right Bounds on cart position and pole angle are [-2.4, 2.4] and [-90, 90] resp.

37
“Generous” teacher

38
“Strict” teacher

39
Develop policy optimization algorithms that take responses and other forms of teacher input Query short sequence of states rather than single states Consider more application areas like structured prediction and other RL domains Conduct studies with human teachers

Similar presentations

OK

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on eisenmenger syndrome causes Ppt on unity in diversity logo Ppt on index numbers pdf Ppt on global warming for class 7 Ppt on different occupations in nursing Ppt on fibonacci numbers in music Ppt on chapter 3 atoms and molecules ppt Ppt on summary writing worksheet Ppt on earth movements and major landforms in spain Ppt on c language history