Presentation is loading. Please wait.

Presentation is loading. Please wait.

Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Similar presentations


Presentation on theme: "Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego."— Presentation transcript:

1 Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

2 Supervised learning Given access to labeled data (drawn iid from an unknown underlying distribution P), want to learn a classifier chosen from hypothesis class H, with misclassification rate < . Sample complexity characterized by d = VC dimension of H. If data is separable, need roughly d/  labeled samples.

3 Active learning In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?

4 Our result A parameter which coarsely characterizes the label complexity of active learning in the separable setting

5 Can adaptive querying really help? [CAL92, D04]: Threshold functions on the real line h w (x) = 1(x ¸ w), H = {h w : w 2 R} Start with 1/  unlabeled points Binary search – need just log 1/  labels, from which the rest can be inferred! Exponential improvement in sample complexity. w +-

6 More general hypothesis classes For a general hypothesis class with VC dimension d, is a “generalized binary search” possible? Random choice of queriesd/  labels Perfect binary searchd log 1/  labels Where in this large range does the label complexity of active learning lie? We’ve already handled linear separators in 1-d…

7 Linear separators in R 2 For linear separators in R 1, need just log 1/  labels. But when H = {linear separators in R 2 }: some target hypotheses require 1/  labels to be queried! h3h3 h2h2 h0h0 h1h1  fraction of distribution Need 1/  labels to distinguish between h 0, h 1, h 2, …, h 1/  ! Consider any distribution over the circle in R 2.

8 A fuller picture For linear separators in R 2 : some bad target hypotheses which require 1/  labels, but “most” require just O(log 1/  ) labels… good bad

9 A view of the hypothesis space H = {linear separators in R 2 } All-positive hypothesis All-negative hypothesis Good region Bad regions

10 Geometry of hypothesis space H = any hypothesis class, of VC dimension d < 1. P = underlying distribution of data. (i) Non-Bayesian setting: no probability measure on H (ii) But there is a natural (pseudo) metric: d(h,h’) = P(h(x)  h’(x)) (iii) Each point x defines a cut through H h h’ H x

11 The learning process (h 0 = target hypothesis) Keep asking for labels until the diameter of the remaining version space is at most . h0h0 H

12 Searchability index Accuracy  Data distribution P Amount of unlabeled data Each hypothesis h 2 H has a “searchability index”  h   (h) / min(pos mass of h, neg mass of h), but never <   ·  (h) · 1, bigger is better  1/2 1/4 1/5  1/4 1/5 Example: linear separators in R 2, data on a circle: 1/3 All positive hypothesis H

13 Searchability index Accuracy  Data distribution P Amount of unlabeled data Each hypothesis h 2 H has a “searchability index”  (h) Searchability index lies in the range:  ·  (h) · 1 Upper bound. There is an active learning scheme which identifies any target hypothesis h 2 H (within accuracy ·  ) with a label complexity of at most: Lower bound. For any h 2 H, any active learning scheme for the neighborhood B(h,  (h)) has a label complexity of at least: [When  (h) À  : active learning helps a lot.]

14 Linear separators in R d Previous sample complexity results for active learning have focused on the following case: H = homogeneous (through the origin) linear separators in R d Data distributed uniformly over unit sphere [1] Query by committee [SOS92, FSST97] Bayesian setting: average-case over target hypotheses picked uniformly from the unit sphere [2] Perceptron-based active learner [DKM05] Non-Bayesian setting: worst-case over target hypotheses In either case: just (d log 1/  ) labels needed!

15 Example: linear separators in R d This sample complexity is realized by many schemes: [SOS92, FSST97] Query by committee [DKM05] Perceptron-based active learner Simplest of all, [CAL92]: pick a random point whose label is not completely certain (with respect to current version space) } as before H: {Homogeneous linear separators in R d }, P: uniform distribution  (h) is the same for all h, and is ¸ 1/8

16 Linear separators in R d Uniform distribution: Concentrated near the equator (any equator) + -

17 Linear separators in R d Instead: distribution P with a different vertical marginal: Result:  ¸ 1/32, provided amt of unlabeled data grows by … Do the schemes [CAL92, SOS92, FSST97, DKM05] achieve this label complexity? + - Say that for < 1, U(x)/ · P(x) · U(x) (U = uniform)

18 What next 1.Make this algorithmic! Linear separators: is some kind of “querying near current boundary” a reasonable approximation? 2.Nonseparable data Need a robust base learner! true boundary + -

19 Thanks For helpful discussions: Peter Bartlett Yoav Freund Adam Kalai John Langford Claire Monteleoni

20 Star-shaped configurations Hypothesis space: In the vicinity of the “bad” hypothesis h 0, we find a star structure: Data space: h3h3 h2h2 h1h1 h0h0 h0h0 h1h1 h2h2 h3h3 h 1/ 

21 Example: the 1-d line Searchability index lies in range:  ·  (h) · 1 Theorem: · # labels needed · Example: Threshold functions on the line w +- Result:  = 1/2 for any target hypothesis and any input distribution

22 Linear separators in R d Result:  =  (1) for most target hypotheses, but is  for the hypothesis that makes one slab +, the other -… the most “natural” one! origin Data lies on the rim of two slabs, distributed uniformly


Download ppt "Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego."

Similar presentations


Ads by Google