1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,

1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island, CA, USA

2 Machine Learning Algorithm Policy Trajectories Learner Teacher GOAL: Teacher wants to teach policy to the learner Car driving Flight simulation Electronic games [Pomerleau, 89] [Sammut et. al., 92] [Ross and Bagnell, 2010]

3 Classifier Passive i.i.d. Learner Reduction to passive i.i.d. learning Teacher Labeled data [Ross and Bagnell, 2010, Syed and Schapire, 2010] Horizon Error rate of Learner

4 DRAWBACK:  Generating such trajectories can be tedious and may even be impractical. Real-time low-level Control of multiple Game agents Simulated Snake Control Controlling a quadruped robot on a rough terrain [Ratliff et. al., 2006] Classifier Passive i.i.d. Learner Teacher Learner Labeled data

5 Current Training data (s, a) pairs Dynamics Simulator (No rewards) Learner Teacher GOAL: Learn using as few queries as possible Select Best Action Query correct action to take in is

6 Empirical Performance: Active vs. Passive ? Theoretical Label Complexity: Active vs. Passive ? I.I.D Setting: Active has significant empirical and theoretical advantages # of queries posed # of actions by teacher Our Approach: “reduce” active imitation learning to i.i.d. active learning

7 Goal: Train a classifier such that with probability, the error rate Oracle/Teacher Learner Passive i.i.d. Learner Unlabeled data distribution pairs Labeled data i.i.d. active learner Select Best Query Oracle/Teacher True label is Current Training data pairs Unlabeled data distribution Active Label Complexity: Passive Label Complexity:

8 Unlabeled data distribution i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is

9 i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is Key Challenge: what unlabeled data distribution do we give to the i.i.d. active learner? ?

10 Ideal Distribution: state distribution of the teacher policy But this unknown since we don’t know Chicken and egg problem!! i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is State distribution according to

11 Naïve Approach: use an arbitrary distribution Learning on arbitrary distribution often leads to bad performance i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is Arbitrary state distribution

12 i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is ? Arbitrary state distribution Naïve Approach: use an arbitrary distribution Learning on arbitrary distribution often leads to bad performance Wasted queries

13 i.i.d. active Learner Select Best Action Query Teacher Current Training data (s, a) pairs correct action to take in is Key Challenge: what unlabeled data distribution do we give to the i.i.d. active learner? ?

14 Non-stationary policies: Active Forward Training Based on forward training algorithm of Ross and Bagnell (2010) Stationary policies: RAIL: “Reduction-based Active Imitation Learning” Both reductions achieve exponentially better label complexity than any known passive imitation learning algorithm

15 Extract unlabeled states Trajectories Generate trajectories Iteration t =1 k queries to teacher Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs i.i.d. active Learner Unlabeled states

16 Iteration t =2 Extract unlabeled states Trajectories Generate trajectories k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs Unlabeled states

17 Extract unlabeled states Trajectories Generate trajectories Iteration t =H k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs is returned as the final learned policy! Unlabeled states

18 Extract unlabeled states Trajectories Generate trajectories Iteration t =1 k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs accurate classifier Unlabeled states

19 Extract unlabeled states Trajectories Generate trajectories Iteration t =2 k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs accurate classifier Unlabeled states

20 Extract unlabeled states Trajectories Generate trajectories Iteration t =H k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs is returned as the final learned policy! Unlabeled states

21 What are the label complexities of RAIL and passive imitation learning in order to achieve a regret of ? Assume active i.i.d. PAC learner L a w/ parameters and Theorem. If L a is run with parameters and /H then with probability at least the learned policy satisfies: Corresponding bound for passive imitation learning [Ross and Bagnell, 2010]

22 In the realizable setting we get exponentially better label complexity: RAIL vs Passive [Ross and Bagnell, 2010]

23 Extract unlabeled states Trajectories Generate trajectories Iteration t = k queries to teacher i.i.d. active Learner Teacher Select Best Action Query Correct action to take in is Current Training data (s, a) pairs Discarded after learning Unlabeled states

24 Extract unlabeled states Trajectories Generate trajectories Iteration t = k=1 queries to teacher Density-weighted QBC [McCallum & Nigam, 1998] Teacher Select Best Action Query Correct action to take in is Training data from all previous iterations Unlabeled states

25 We performed experiments in the following domains: Bicycle Balancing Wargus Cart pole Structured prediction We compared RAIL-DW against following baselines : unif-RAND: Selects states to query uniformly at random unif-QBC: Treats all states as i.i.d. and applies standard QBC Passive imitation learning (Passive): Simulates standard passive imitation learning Confidence based autonomy (CBA) [Chernova & Veloso, 2009]: Query the teacher if confidence < threshold (automatically computed) Performance can be quite sensitive to threshold adjustment

34 We did not run experiments for unif-QBC and unif-RAND due to difficulty of defining space of feasible states over which to sample uniformly

35 Theoretical label complexity: Active better than passive Empirical performance: Active better than passive Future work: Active querying in RL setting Theoretical analysis of RAIL-DW Multiple query types

37 Stress prediction in NETtalk data set [Dietterich et al. 2008] q u i c k Input: Output: ‘ ’ ?

38 We present result for L=2. The results were qualitatively similar for L=1.

1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,

Similar presentations

Presentation on theme: "1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,

Similar presentations

Presentation on theme: "1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,"— Presentation transcript:

Similar presentations

About project

Feedback