Presentation is loading. Please wait.

Presentation is loading. Please wait.

Active Learning of Binary Classifiers

Similar presentations


Presentation on theme: "Active Learning of Binary Classifiers"— Presentation transcript:

1 Active Learning of Binary Classifiers
Presenters: Nina Balcan and Steve Hanneke Maria-Florina Balcan

2 Outline What is Active Learning? Active Learning Linear Separators
General Theories of Active Learning Open Problems Maria-Florina Balcan

3 Supervised Passive Learning
Data Source Expert / Oracle Learning Algorithm Unlabeled examples Labeled examples Algorithm outputs a classifier Maria-Florina Balcan

4 Incorporating Unlabeled Data in the Learning process
In many settings, unlabeled data is cheap & easy to obtain, labeled data is much more expensive. Web page, document classification OCR, Image classification Maria-Florina Balcan

5 Semi-Supervised Passive Learning
Data Source Learning Algorithm Expert / Oracle Unlabeled examples Unlabeled examples Labeled Examples Algorithm outputs a classifier Maria-Florina Balcan

6 Semi-Supervised Passive Learning
Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [Joachims ’98] Co-training [Blum & Mitchell ’98], [BBY04] Graph-based methods [Blum & Chawla01], [ZGL03] Maria-Florina Balcan

7 Semi-Supervised Passive Learning
Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [Joachims ’98] Co-training [Blum & Mitchell ’98], [BBY04] Graph-based methods [Blum & Chawla01], [ZGL03] + _ Labeled data only Transductive SVM SVM Maria-Florina Balcan

8 Semi-Supervised Passive Learning
Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [Joachims ’98] Co-training [Blum & Mitchell ’98], [BBY04] Graph-based methods [Blum & Chawla01], [ZGL03] Workshops [ICML ’03, ICML’ 05] Books: Semi-Supervised Learning, MIT 2006 O. Chapelle, B. Scholkopf and A. Zien (eds) Theoretical models: Balcan-Blum’05 Maria-Florina Balcan

9 Active Learning Data Source Expert / Oracle Learning Algorithm
Unlabeled examples Request for the Label of an Example A Label for that Example Request for the Label of an Example A Label for that Example . . . Algorithm outputs a classifier Maria-Florina Balcan

10 What Makes a Good Algorithm?
Guaranteed to output a relatively good classifier for most learning problems. Doesn’t make too many label requests. Choose the label requests carefully, to get informative labels. Maria-Florina Balcan

11 Can It Really Do Better Than Passive?
YES! (sometimes) We often need far fewer labels for active learning than for passive. This is predicted by theory and has been observed in practice. Maria-Florina Balcan

12 Active Learning in Practice
Active SVM (Tong & Koller, ICML 2000) seems to be quite useful in practice. At any time during the alg., we have a “current guess” of the separator: the max-margin separator of all labeled points so far. E.g., strategy 1: request the label of the example closest to the current separator. Maria-Florina Balcan

13 When Does it Work? And Why?
The algorithms currently used in practice are not well understood theoretically. We don’t know if/when they output a good classifier, nor can we say how many labels they will need. So we seek algorithms that we can understand and state formal guarantees for. Rest of this talk: surveys recent theoretical results. Maria-Florina Balcan

14 Standard Supervised Learning Setting
S={(x, l)} - set of labeled examples drawn i.i.d. from some distr. D over X and labeled by some target concept c* 2 C Want to do optimization over S to find some hyp. h, but we want h to have small error over D. err(h)=Prx 2 D(h(x)  c*(x)) Sample Complexity, Finite Hyp. Space, Realizable case Maria-Florina Balcan

15 Sample Complexity: Uniform Convergence Bounds
Infinite Hypothesis Case E.g., if C - class of linear separators in Rd, then we need roughly O(d/) examples to achieve generalization error . Non-realizable case – replace  with 2. Maria-Florina Balcan

16 Active Learning How many labels can we save by querying adaptively?
We get to see unlabeled data first, and there is a charge for every label. The learner has the ability to choose specific examples to be labeled: - The learner works harder, in order to use fewer labeled examples. Or alternatively, the learner has two abilities: draw an unlabeled sample from the distribution ask for a label of one of these samples How many labels can we save by querying adaptively? Maria-Florina Balcan

17 Can adaptive querying help? [CAL92, Dasgupta04]
Consider threshold functions on the real line: hw(x) = 1(x ¸ w), C = {hw: w 2 R} w + - Sample with 1/ unlabeled examples. + - - Binary search – need just O(log 1/) labels. Active setting: O(log 1/) labels to find an -accurate threshold. Supervised learning needs O(1/) labels. Exponential improvement in sample complexity  Maria-Florina Balcan

18 Active Learning might not help [Dasgupta04]
In general, number of queries needed depends on C and also on D. h3 C = {linear separators in R1}: active learning reduces sample complexity substantially. h2 C = {linear separators in R2}: there are some target hyp. for which no improvement can be achieved! - no matter how benign the input distr. h1 h0 In this case: learning to accuracy  requires 1/ labels… Maria-Florina Balcan

19 Examples where Active Learning helps
In general, number of queries needed depends on C and also on D. C = {linear separators in R1}: active learning reduces sample complexity substantially no matter what is the input distribution. C - homogeneous linear separators in Rd, D - uniform distribution over unit sphere: need only d log 1/ labels to find a hypothesis with error rate < . Dasgupta, Kalai, Monteleoni, COLT 2005 Freund et al., ’97. Balcan-Broder-Zhang, COLT 07 Maria-Florina Balcan

20 Region of uncertainty [CAL92]
Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) Example: data lies on circle in R2 and hypotheses are homogeneous linear separators. current version space + + region of uncertainty in data space Maria-Florina Balcan

21 Region of uncertainty [CAL92]
current version space region of uncertainy Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. Maria-Florina Balcan

22 Region of uncertainty [CAL92]
Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) current version space region of uncertainty in data space + Maria-Florina Balcan

23 Region of uncertainty [CAL92]
Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) new version space + + New region of uncertainty in data space Maria-Florina Balcan

24 Region of uncertainty [CAL92], Guarantees
Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. [Balcan, Beygelzimer, Langford, ICML’06] Analyze a version of this alg. which is robust to noise. C- linear separators on the line, low noise, exponential improvement. C - homogeneous linear separators in Rd, D -uniform distribution over unit sphere. low noise, need only d2 log 1/ labels to find a hypothesis with error rate < . realizable case, d3/2 log 1/ labels. supervised -- d/ labels. Maria-Florina Balcan

25 Margin Based Active-Learning Algorithm
[Balcan-Broder-Zhang, COLT 07] wk wk+1 γk w* Use O(d) examples to find w1 of error 1/8. iterate k=2, … , log(1/) rejection sample mk samples x from D satisfying |wk-1T ¢ x| · k ; label them; find wk 2 B(wk-1, 1/2k ) consistent with all these examples. end iterate Maria-Florina Balcan

26 Margin Based Active-Learning [BBZ’07]
Wk region of uncertainty in data space Maria-Florina Balcan

27 BBZ’07, Proof Idea iterate k=2, … , log(1/)
Rejection sample mk samples x from D satisfying |wk-1T ¢ x| · k ; ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples. end iterate Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and only need O(d log( 1/)) labels in round k. wk wk+1 γk w* Maria-Florina Balcan

28 BBZ’07, Proof Idea iterate k=2, … , log(1/)
Rejection sample mk samples x from D satisfying |wk-1T ¢ x| · k ; ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples. end iterate Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and only need O(d log( 1/)) labels in round k. wk wk+1 γk w* Maria-Florina Balcan

29 BBZ’07, Proof Idea · /4 iterate k=2, … , log(1/)
Rejection sample mk samples x from D satisfying |wk-1T ¢ x| · k ; ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples. end iterate Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and only need O(d log( 1/)) labels in round k. wk wk+1 γk w* Key Point Under the uniform distr. assumption for we have Well separated concepts have most of their disagreement here… … Each example in the error region of many concepts …. · /4 Maria-Florina Balcan

30 We can do so by only using O(d log( 1/)) labels in round k.
BBZ’07, Proof Idea wk wk+1 γk w* Key Point Under the uniform distr. assumption for we have · /4 Key Point So, it’s enough to ensure that The region is “small enough” (small in terms of the original probability measure) so that if we look at the conditional distribution the error under the conditional distrib. Given that region gets very large… We can do so by only using O(d log( 1/)) labels in round k. Maria-Florina Balcan

31 Our Algorithm: Extensions
A robust version – add a testing step. Deals with certain types of noise, a more general class of distributions. Maria-Florina Balcan

32 General Theories of Active Learning
Maria-Florina Balcan

33 General Concept Spaces
In the general learning problem, there is a concept space C, and we want to find an -optimal classifier h  C with high probability 1-. Maria-Florina Balcan

34 How Many Labels Do We Need?
In passive learning, we know of an algorithm (empirical risk minimization) that needs only labels (for realizable learning), and if there is noise. We also know this is close to the best we can expect from any passive algorithm. Here VC dimension completely specifies the sample complexity. Maria-Florina Balcan

35 How Many Labels Do We Need?
As before, we want to explore the analogous idea for Active Learning, (but now for general concept space C). How many label requests are necessary and sufficient for Active Learning? What are the relevant complexity measures? (i.e., the Active Learning analogue of VC dimension) Maria-Florina Balcan

36 What ARE the Interesting Quantities?
Generally speaking, we want examples whose labels are highly controversial among the set of remaining concepts. The likelihood of drawing such an informative example is an important quantity to consider. But there are many ways to define “informative” in general. Maria-Florina Balcan

37 What Do You Mean By “Informative”?
Want examples that reduce the version space. But how do we measure progress? A problem-specific measure P on C? The Diameter? Measure of the region of disagreement? Cover size? (see e.g., Hanneke, COLT 2007) All of these seem to have interesting theories associated with them. As an example, let’s take a look at Diameter in detail. Maria-Florina Balcan

38 Diameter (Dasgupta, NIPS 2005)
Imagine each pair of concepts separated by distance >  has an edge between them. We have to rule out at least one of the two concepts for each edge. Each unlabeled example X partitions the concepts into two sets. And guarantees some fraction of the edges will have at least one concept contradicted, no matter which label it has. Define distance d(g,h) = Pr(g(X)h(X)). One way to guarantee our classifier is within  of the target classifier is to (safely) reduce the diameter to size . Maria-Florina Balcan

39 Diameter Theorem: (Dasgupta, NIPS 2005)
If, for any finite subset V  C, PrX(X eliminates a ρ fraction of the edges)  , then (assuming no noise) we can reduce the diameter to  using a number of label requests at most Furthermore, there is an algorithm that does this, which with high probability requires a number of unlabeled examples at most The algorithm is just what you’d expect. Suppose we have a finite C. Draw unlabeled examples until we get a good one, query its label, throw away inconsistent hypotheses, and repeat. If we don’t have finite C, can use a cover. Maria-Florina Balcan

40 Open Problems in Active Learning
Efficient (correct) learning algorithms for linear separators provably achieving significant improvements on many distributions. What about binary feature spaces? Tight general-purpose sample complexity bounds, for both realizable and agnostic. An optimal active learning algorithm? The first was posted as an Open Problem at COLT 2006, which ties a little more prestige to a solution. Maria-Florina Balcan


Download ppt "Active Learning of Binary Classifiers"

Similar presentations


Ads by Google