Download presentation
Presentation is loading. Please wait.
Published byEric Elliott Modified over 9 years ago
1
1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island, CA, USA
2
2 Machine Learning Algorithm Policy Trajectories Learner Teacher GOAL: Teacher wants to teach policy to the learner Car driving Flight simulation Electronic games [Pomerleau, 89] [Sammut et. al., 92] [Ross and Bagnell, 2010]
3
3 Classifier Passive i.i.d. Learner Reduction to passive i.i.d. learning Teacher Labeled data [Ross and Bagnell, 2010, Syed and Schapire, 2010] Horizon Error rate of Learner
4
4 DRAWBACK: Generating such trajectories can be tedious and may even be impractical. Real-time low-level Control of multiple Game agents Simulated Snake Control Controlling a quadruped robot on a rough terrain [Ratliff et. al., 2006] Classifier Passive i.i.d. Learner Teacher Learner Labeled data
5
5 Current Training data (s, a) pairs Dynamics Simulator (No rewards) Learner Teacher GOAL: Learn using as few queries as possible Select Best Action Query correct action to take in is
6
6 Empirical Performance: Active vs. Passive ? Theoretical Label Complexity: Active vs. Passive ? I.I.D Setting: Active has significant empirical and theoretical advantages # of queries posed # of actions by teacher Our Approach: “reduce” active imitation learning to i.i.d. active learning
7
7 Goal: Train a classifier such that with probability, the error rate Oracle/Teacher Learner Passive i.i.d. Learner Unlabeled data distribution pairs Labeled data i.i.d. active learner Select Best Query Oracle/Teacher True label is Current Training data pairs Unlabeled data distribution Active Label Complexity: Passive Label Complexity:
8
8 Unlabeled data distribution i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is
9
9 i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is Key Challenge: what unlabeled data distribution do we give to the i.i.d. active learner? ?
10
10 Ideal Distribution: state distribution of the teacher policy But this unknown since we don’t know Chicken and egg problem!! i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is State distribution according to
11
11 Naïve Approach: use an arbitrary distribution Learning on arbitrary distribution often leads to bad performance i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is Arbitrary state distribution
12
12 i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is ? Arbitrary state distribution Naïve Approach: use an arbitrary distribution Learning on arbitrary distribution often leads to bad performance Wasted queries
13
13 i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is Key Challenge: what unlabeled data distribution do we give to the i.i.d. active learner? ?
14
14 Non-stationary policies: Active Forward Training Based on forward training algorithm of Ross and Bagnell (2010) Stationary policies: RAIL: “Reduction-based Active Imitation Learning” Both reductions achieve exponentially better label complexity than any known passive imitation learning algorithm
15
15 Extract unlabeled states Trajectories Generate trajectories Iteration t =1 k queries to teacher Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs i.i.d. active Learner Unlabeled states
16
16 Iteration t =2 Extract unlabeled states Trajectories Generate trajectories k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs Unlabeled states
17
17 Extract unlabeled states Trajectories Generate trajectories Iteration t =H k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs is returned as the final learned policy! Unlabeled states
18
18 Extract unlabeled states Trajectories Generate trajectories Iteration t =1 k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs accurate classifier Unlabeled states
19
19 Extract unlabeled states Trajectories Generate trajectories Iteration t =2 k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs accurate classifier Unlabeled states
20
20 Extract unlabeled states Trajectories Generate trajectories Iteration t =H k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs is returned as the final learned policy! Unlabeled states
21
21 What are the label complexities of RAIL and passive imitation learning in order to achieve a regret of ? Assume active i.i.d. PAC learner L a w/ parameters and Theorem. If L a is run with parameters and /H then with probability at least the learned policy satisfies: Corresponding bound for passive imitation learning [Ross and Bagnell, 2010]
22
22 In the realizable setting we get exponentially better label complexity: RAIL vs Passive [Ross and Bagnell, 2010]
23
23 Extract unlabeled states Trajectories Generate trajectories Iteration t = k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs Discarded after learning Unlabeled states
24
24 Extract unlabeled states Trajectories Generate trajectories Iteration t = k=1 queries to teacher Density-weighted QBC [McCallum & Nigam, 1998] Teacher Select Best Action Query Correct action to take in is Training data from all previous iterations Unlabeled states
25
25 We performed experiments in the following domains: Bicycle Balancing Wargus Cart pole Structured prediction We compared RAIL-DW against following baselines : unif-RAND: Selects states to query uniformly at random unif-QBC: Treats all states as i.i.d. and applies standard QBC Passive imitation learning (Passive): Simulates standard passive imitation learning Confidence based autonomy (CBA) [Chernova & Veloso, 2009]: Query the teacher if confidence < threshold (automatically computed) Performance can be quite sensitive to threshold adjustment
26
26
27
27
28
28
29
29
30
30
31
31
32
32
33
33
34
34 We did not run experiments for unif-QBC and unif-RAND due to difficulty of defining space of feasible states over which to sample uniformly
35
35 Theoretical label complexity: Active better than passive Empirical performance: Active better than passive Future work: Active querying in RL setting Theoretical analysis of RAIL-DW Multiple query types
36
36
37
37 Stress prediction in NETtalk data set [Dietterich et al. 2008] q u i c k Input: Output: ‘ ’ ?
38
38 We present result for L=2. The results were qualitatively similar for L=1.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.