Active Learning of Binary Classifiers

Slides:



Advertisements
Similar presentations
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Advertisements

Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
Machine Learning Week 2 Lecture 1.
A general agnostic active learning algorithm
Semi-Supervised Learning
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probably Approximately Correct Model (PAC)
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
New Theoretical Frameworks for Machine Learning
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Online Learning Algorithms
Incorporating Unlabeled Data in the Learning Process
Ensembles of Classifiers Evgueni Smirnov
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
Machine Learning CSE 681 CH2 - Supervised Learning.
Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.
Universit at Dortmund, LS VIII
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Machine Learning Concept Learning General-to Specific Ordering
Ensemble Methods in Machine Learning
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
The Boosting Approach to Machine Learning
The Boosting Approach to Machine Learning
Importance Weighted Active Learning
A general agnostic active learning algorithm
Semi-Supervised Learning
Computational Learning Theory
Computational Learning Theory
CSCI B609: “Foundations of Data Science”
Presentation transcript:

Active Learning of Binary Classifiers Presenters: Nina Balcan and Steve Hanneke Maria-Florina Balcan

Outline What is Active Learning? Active Learning Linear Separators General Theories of Active Learning Open Problems Maria-Florina Balcan

Supervised Passive Learning Data Source Expert / Oracle Learning Algorithm Unlabeled examples Labeled examples Algorithm outputs a classifier Maria-Florina Balcan

Incorporating Unlabeled Data in the Learning process In many settings, unlabeled data is cheap & easy to obtain, labeled data is much more expensive. Web page, document classification OCR, Image classification Maria-Florina Balcan

Semi-Supervised Passive Learning Data Source Learning Algorithm Expert / Oracle Unlabeled examples Unlabeled examples Labeled Examples Algorithm outputs a classifier Maria-Florina Balcan

Semi-Supervised Passive Learning Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [Joachims ’98] Co-training [Blum & Mitchell ’98], [BBY04] Graph-based methods [Blum & Chawla01], [ZGL03] Maria-Florina Balcan

Semi-Supervised Passive Learning Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [Joachims ’98] Co-training [Blum & Mitchell ’98], [BBY04] Graph-based methods [Blum & Chawla01], [ZGL03] + _ Labeled data only Transductive SVM SVM Maria-Florina Balcan

Semi-Supervised Passive Learning Several methods have been developed to try to use unlabeled data to improve performance, e.g.: Transductive SVM [Joachims ’98] Co-training [Blum & Mitchell ’98], [BBY04] Graph-based methods [Blum & Chawla01], [ZGL03] Workshops [ICML ’03, ICML’ 05] Books: Semi-Supervised Learning, MIT 2006 O. Chapelle, B. Scholkopf and A. Zien (eds) Theoretical models: Balcan-Blum’05 Maria-Florina Balcan

Active Learning Data Source Expert / Oracle Learning Algorithm Unlabeled examples Request for the Label of an Example A Label for that Example Request for the Label of an Example A Label for that Example . . . Algorithm outputs a classifier Maria-Florina Balcan

What Makes a Good Algorithm? Guaranteed to output a relatively good classifier for most learning problems. Doesn’t make too many label requests. Choose the label requests carefully, to get informative labels. Maria-Florina Balcan

Can It Really Do Better Than Passive? YES! (sometimes) We often need far fewer labels for active learning than for passive. This is predicted by theory and has been observed in practice. Maria-Florina Balcan

Active Learning in Practice Active SVM (Tong & Koller, ICML 2000) seems to be quite useful in practice. At any time during the alg., we have a “current guess” of the separator: the max-margin separator of all labeled points so far. E.g., strategy 1: request the label of the example closest to the current separator. Maria-Florina Balcan

When Does it Work? And Why? The algorithms currently used in practice are not well understood theoretically. We don’t know if/when they output a good classifier, nor can we say how many labels they will need. So we seek algorithms that we can understand and state formal guarantees for. Rest of this talk: surveys recent theoretical results. Maria-Florina Balcan

Standard Supervised Learning Setting S={(x, l)} - set of labeled examples drawn i.i.d. from some distr. D over X and labeled by some target concept c* 2 C Want to do optimization over S to find some hyp. h, but we want h to have small error over D. err(h)=Prx 2 D(h(x)  c*(x)) Sample Complexity, Finite Hyp. Space, Realizable case Maria-Florina Balcan

Sample Complexity: Uniform Convergence Bounds Infinite Hypothesis Case E.g., if C - class of linear separators in Rd, then we need roughly O(d/) examples to achieve generalization error . Non-realizable case – replace  with 2. Maria-Florina Balcan

Active Learning How many labels can we save by querying adaptively? We get to see unlabeled data first, and there is a charge for every label. The learner has the ability to choose specific examples to be labeled: - The learner works harder, in order to use fewer labeled examples. Or alternatively, the learner has two abilities: draw an unlabeled sample from the distribution ask for a label of one of these samples How many labels can we save by querying adaptively? Maria-Florina Balcan

Can adaptive querying help? [CAL92, Dasgupta04] Consider threshold functions on the real line: hw(x) = 1(x ¸ w), C = {hw: w 2 R} w + - Sample with 1/ unlabeled examples. + - - Binary search – need just O(log 1/) labels. Active setting: O(log 1/) labels to find an -accurate threshold. Supervised learning needs O(1/) labels. Exponential improvement in sample complexity  Maria-Florina Balcan

Active Learning might not help [Dasgupta04] In general, number of queries needed depends on C and also on D. h3 C = {linear separators in R1}: active learning reduces sample complexity substantially. h2 C = {linear separators in R2}: there are some target hyp. for which no improvement can be achieved! - no matter how benign the input distr. h1 h0 In this case: learning to accuracy  requires 1/ labels… Maria-Florina Balcan

Examples where Active Learning helps In general, number of queries needed depends on C and also on D. C = {linear separators in R1}: active learning reduces sample complexity substantially no matter what is the input distribution. C - homogeneous linear separators in Rd, D - uniform distribution over unit sphere: need only d log 1/ labels to find a hypothesis with error rate < . Dasgupta, Kalai, Monteleoni, COLT 2005 Freund et al., ’97. Balcan-Broder-Zhang, COLT 07 Maria-Florina Balcan

Region of uncertainty [CAL92] Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) Example: data lies on circle in R2 and hypotheses are homogeneous linear separators. current version space + + region of uncertainty in data space Maria-Florina Balcan

Region of uncertainty [CAL92] current version space region of uncertainy Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. Maria-Florina Balcan

Region of uncertainty [CAL92] Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) current version space region of uncertainty in data space + Maria-Florina Balcan

Region of uncertainty [CAL92] Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) new version space + + New region of uncertainty in data space Maria-Florina Balcan

Region of uncertainty [CAL92], Guarantees Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. [Balcan, Beygelzimer, Langford, ICML’06] Analyze a version of this alg. which is robust to noise. C- linear separators on the line, low noise, exponential improvement. C - homogeneous linear separators in Rd, D -uniform distribution over unit sphere. low noise, need only d2 log 1/ labels to find a hypothesis with error rate < . realizable case, d3/2 log 1/ labels. supervised -- d/ labels. Maria-Florina Balcan

Margin Based Active-Learning Algorithm [Balcan-Broder-Zhang, COLT 07] wk wk+1 γk w* Use O(d) examples to find w1 of error 1/8. iterate k=2, … , log(1/) rejection sample mk samples x from D satisfying |wk-1T ¢ x| · k ; label them; find wk 2 B(wk-1, 1/2k ) consistent with all these examples. end iterate Maria-Florina Balcan

Margin Based Active-Learning [BBZ’07] Wk region of uncertainty in data space Maria-Florina Balcan

BBZ’07, Proof Idea iterate k=2, … , log(1/) Rejection sample mk samples x from D satisfying |wk-1T ¢ x| · k ; ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples. end iterate Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and only need O(d log( 1/)) labels in round k. wk wk+1 γk w* Maria-Florina Balcan

BBZ’07, Proof Idea iterate k=2, … , log(1/) Rejection sample mk samples x from D satisfying |wk-1T ¢ x| · k ; ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples. end iterate Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and only need O(d log( 1/)) labels in round k. wk wk+1 γk w* Maria-Florina Balcan

BBZ’07, Proof Idea · /4 iterate k=2, … , log(1/) Rejection sample mk samples x from D satisfying |wk-1T ¢ x| · k ; ask for labels and find wk 2 B(wk-1, 1/2k ) consistent with all these examples. end iterate Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and only need O(d log( 1/)) labels in round k. wk wk+1 γk w* Key Point Under the uniform distr. assumption for we have Well separated concepts have most of their disagreement here… … Each example in the error region of many concepts …. · /4 Maria-Florina Balcan

We can do so by only using O(d log( 1/)) labels in round k. BBZ’07, Proof Idea wk wk+1 γk w* Key Point Under the uniform distr. assumption for we have · /4 Key Point So, it’s enough to ensure that The region is “small enough” (small in terms of the original probability measure) so that if we look at the conditional distribution the error under the conditional distrib. Given that region gets very large… We can do so by only using O(d log( 1/)) labels in round k. Maria-Florina Balcan

Our Algorithm: Extensions A robust version – add a testing step. Deals with certain types of noise, a more general class of distributions. Maria-Florina Balcan

General Theories of Active Learning Maria-Florina Balcan

General Concept Spaces In the general learning problem, there is a concept space C, and we want to find an -optimal classifier h  C with high probability 1-. Maria-Florina Balcan

How Many Labels Do We Need? In passive learning, we know of an algorithm (empirical risk minimization) that needs only labels (for realizable learning), and if there is noise. We also know this is close to the best we can expect from any passive algorithm. Here VC dimension completely specifies the sample complexity. Maria-Florina Balcan

How Many Labels Do We Need? As before, we want to explore the analogous idea for Active Learning, (but now for general concept space C). How many label requests are necessary and sufficient for Active Learning? What are the relevant complexity measures? (i.e., the Active Learning analogue of VC dimension) Maria-Florina Balcan

What ARE the Interesting Quantities? Generally speaking, we want examples whose labels are highly controversial among the set of remaining concepts. The likelihood of drawing such an informative example is an important quantity to consider. But there are many ways to define “informative” in general. Maria-Florina Balcan

What Do You Mean By “Informative”? Want examples that reduce the version space. But how do we measure progress? A problem-specific measure P on C? The Diameter? Measure of the region of disagreement? Cover size? (see e.g., Hanneke, COLT 2007) All of these seem to have interesting theories associated with them. As an example, let’s take a look at Diameter in detail. Maria-Florina Balcan

Diameter (Dasgupta, NIPS 2005) Imagine each pair of concepts separated by distance >  has an edge between them. We have to rule out at least one of the two concepts for each edge. Each unlabeled example X partitions the concepts into two sets. And guarantees some fraction of the edges will have at least one concept contradicted, no matter which label it has. Define distance d(g,h) = Pr(g(X)h(X)). One way to guarantee our classifier is within  of the target classifier is to (safely) reduce the diameter to size . Maria-Florina Balcan

Diameter Theorem: (Dasgupta, NIPS 2005) If, for any finite subset V  C, PrX(X eliminates a ρ fraction of the edges)  , then (assuming no noise) we can reduce the diameter to  using a number of label requests at most Furthermore, there is an algorithm that does this, which with high probability requires a number of unlabeled examples at most The algorithm is just what you’d expect. Suppose we have a finite C. Draw unlabeled examples until we get a good one, query its label, throw away inconsistent hypotheses, and repeat. If we don’t have finite C, can use a cover. Maria-Florina Balcan

Open Problems in Active Learning Efficient (correct) learning algorithms for linear separators provably achieving significant improvements on many distributions. What about binary feature spaces? Tight general-purpose sample complexity bounds, for both realizable and agnostic. An optimal active learning algorithm? The first was posted as an Open Problem at COLT 2006, which ties a little more prestige to a solution. Maria-Florina Balcan