Incorporating Unlabeled Data in the Learning Process

Slides:



Advertisements
Similar presentations
Semi-Supervised Learning Avrim Blum Carnegie Mellon University [USC CS Distinguished Lecture Series, 2008]
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
On-line learning and Boosting
Distributed Machine Learning: Communication, Efficiency, and Privacy Avrim Blum [RaviKannan60] Joint work with Maria-Florina Balcan, Shai Fine, and Yishay.
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
A general agnostic active learning algorithm
Semi-Supervised Learning
Practical Online Active Learning for Classification Claire Monteleoni (MIT / UCSD) Matti Kääriäinen (University of Helsinki)
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Active Learning. 2 Learning from Examples  Passive learning A random set of labeled examples A random set of labeled examples.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
Active Perspectives on Computational Learning and Testing Liu Yang Slide 1.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Computational Learning Theory; The Tradeoff between Computational Complexity and Statistical Soundness Shai Ben-David CS Department, Cornell and Technion,
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
New Theoretical Frameworks for Machine Learning
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Online Learning Algorithms
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.
Universit at Dortmund, LS VIII
Modern Topics in Multivariate Methods for Data Analysis.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
An Overview on Semi-Supervised Learning Methods Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Machine Learning Concept Learning General-to Specific Ordering
Ensemble Methods in Machine Learning
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
05/04/07 Using Active Learning to Label Large Corpora Ted Markowitz Pace University CSIS DPS & IBM T. J. Watson Research Ctr.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
Distribution Specific Learnability – an Open Problem in Statistical Learning Theory M. Hassan Zokaei-Ashtiani December 2013.
NTU & MSRA Ming-Feng Tsai
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Learning with General Similarity Functions Maria-Florina Balcan.
Hierarchical Sampling for Active Learning Sanjoy Dasgupta and Daniel Hsu University of California, San Diego Session : Active Learning and Experimental.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Semi-Supervised Learning Using Label Mean
Importance Weighted Active Learning
A general agnostic active learning algorithm
Semi-Supervised Learning
Computational Learning Theory
Computational Learning Theory
Maria Florina Balcan 03/04/2010
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
Machine Learning: UNIT-3 CHAPTER-2
Presentation transcript:

Incorporating Unlabeled Data in the Learning Process Lecture 25th Maria Florina Balcan Maria-Florina Balcan

Supervised Passive Learning Data Source Expert / Oracle Learning Algorithm Unlabeled examples Labeled examples Algorithm outputs a classifier Maria-Florina Balcan

Standard Passive Supervised Learning X – feature space S={(x, l)} - set of labeled examples drawn i.i.d. from distr. D over X and labeled by target concept c* Do optimization over S, find hypothesis h 2 C. Goal: h has small error over D. h c* err(h)=Prx 2 D(h(x)  c*(x)) Define realizable, agnostic; say binary classific in words; c* in C, realizable case; else agnostic Classic models: PAC (Valiant), SLT (Vapnik)

Standard Supervised Learning Setting Sample Complexity well understood Sample Complexity, Finite Hyp. Space, Realizable case Maria-Florina Balcan

Sample Complexity: Uniform Convergence Bounds Infinite Hypothesis Case E.g., if C - class of linear separators in Rd, then we need roughly O(d/) examples to achieve generalization error . Non-realizable case – replace  with 2. In PAC, can also talk about efficient algorithms. Maria-Florina Balcan

Incorporating Unlabeled Data in the Learning process Modern applications: lots of unlabeled data, labeled data is rare or expensive: Web page, document classification OCR, Image classification Classification pbs in Computational Biology Maria-Florina Balcan

Incorporating Unlabeled Data & Interaction Areas of significant activity in modern ML. Semi-Supervised Learning Using cheap unlabeled data in addition to labeled data. Active Learning The algorithm interactively asks for labels of informative examples. Foundations lacking a few years ago. Triggered in part by my work Why and by how much? Does interaction help? Does unlabeled data help?

Incorporating Unlabeled Data & Interaction Areas of significant activity in modern ML. Semi-Supervised Learning Using cheap unlabeled data in addition to labeled data. Active Learning The algorithm interactively asks for labels of informative examples. Foundations lacking a few years ago. Significant progress recently. Triggered in part by my work Mostly on understanding Sample Complexity.

Semi-Supervised Learning Su={xi} - unlabeled examples i.i.d. from D Sl={(xi, yi)} – labeled examples i.i.d. from D, labeled by target c*. Data Source Learning Algorithm Expert / Oracle Unlabeled examples Unlabeled examples Labeled Examples Algorithm outputs a classifier

Semi-Supervised Learning Variety of methods and experimental results. E.g.,: Transductive SVM [Joachims ’98] Co-training [Blum & Mitchell ’98] Graph-based methods [Blum & Chawla01], [Zhu-Lafferty-Ghahramani’03] Scattered and very specific theoretical results (prior to 2005). A general discriminative (PAC, SLT style) framework for SSL. [Balcan-Blum, COLT 2005; JACM 2010; book chapter, 2006] Challenge: capture many of the assumptions typically used. Different SSL algs based on very different assumptions.

Example of “typical” assumption: Margins Belief: target goes through low density regions (large margin). + _ Labeled data only Transductive SVM SVM Due Joachims (see his talk tomorrow!!)

Another Example: Self-consistency Agreement between two parts : co-training [Blum-Mitchell98]. - examples contain two sufficient sets of features, x = h x1, x2 i - belief: the parts are consistent, i.e. 9 c1, c2 s.t. c1(x1)=c2(x2)=c*(x) For example, if we want to classify web pages: x = h x1, x2 i My Advisor Prof. Avrim Blum x1- Text info x2- Link info x - Link info & Text info

New discriminative model for SSL Problems with thinking about SSL in standard models PAC or SLT: learn a class C under (known or unknown) distribution D. a complete disconnect between the target and D Unlabeled data doesn’t give any info about which c 2 C is the target. Key Insight Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution. Under what conditions will unlabeled data help and by how much? How much data should I expect to need in order to perform well?

BB Model, Main Ideas Augment the notion of a concept class C with a notion of compatibility  between a concept and the data distribution. “learn C” becomes “learn (C,)” (learn class C under ) Express relationships that target and underlying distr. possess. Idea I: use unlabeled data & belief that target is compatible to reduce C down to just {the highly compatible functions in C}. + _ abstract prior  Class of fns C unlabeled data Compatible fns in C e.g., linear separators e.g., large margin linear separators finite sample Idea II: degree of compatibility estimated from a finite sample.

Sample Complexity, Uniform Convergence Bounds Compatible fns in C Bound # of labeled examples as a measure of the helpfulness of D wrt  Finite Hypothesis Spaces, Doubly Realizable Case a helpful distribution is one in which CD, () is small helpful D is one in which CD, () is small

Sample Complexity, Uniform Convergence Bounds Compatible fns in C Helpful distribution Non-helpful distribution Highly compatible + _ Finite Hypothesis Spaces, Doubly Realizable Case a helpful distribution is one in which CD, () is small

Key Aspects of the Model Fundamental sample complexity aspects. How much unlabeled data is needed depends both complexity of C and of the compatibility notion. - Ability of unlabeled data to reduce # of labeled examples compatibility of the target, helpfulness of the distribution Our analysis suggests better ways to do regularization based on unlabeled data. Passive The epsilon cover based bounds are very natural in our setting!!!! Subsequent work using our framework P. Bartlett, D. Rosenberg, AISTATS 2007; Kakade et al, COLT 2008 J. Shawe-Taylor et al., Neurocomputing 2007; Zhu, survey 2009

Active Learning Data Source Expert / Oracle Unlabeled examples Learning Algorithm Request for the Label of an Example A Label for that Example Request for the Label of an Example A Label for that Example . . . Algorithm outputs a classifier This talk focus on linear classifiers The learner can choose specific examples to be labeled. He works harder, to use fewer labeled examples.

What Makes a Good Algorithm? Guaranteed to output a relatively good classifier for most learning problems. Doesn’t make too many label requests. Choose the label requests carefully, to get informative labels. Maria-Florina Balcan

Can It Really Do Better Than Passive? YES! (sometimes) We often need far fewer labels for active learning than for passive. This is predicted by theory and has been observed in practice. Maria-Florina Balcan

Active Learning in Practice Active SVM (Tong & Koller, ICML 2000) seems to be quite useful in practice. At any time during the alg., we have a “current guess” of the separator: the max-margin separator of all labeled points so far. E.g., strategy 1: request the label of the example closest to the current separator. Maria-Florina Balcan

Can adaptive querying help? [CAL92, Dasgupta04] Threshold fns on the real line: hw(x) = 1(x ¸ w), C = {hw: w 2 R} w + - Active Algorithm Sample with 1/ unlabeled examples; do binary search. + - - Binary search – need just O(log 1/) labels. Passive supervised: (1/) labels to find an -accurate threshold. Active: only O(log 1/) labels. Exponential improvement. Other interesting results as well.

Active Learning might not help [Dasgupta04] In general, number of queries needed depends on C and also on D. h3 C = {linear separators in R1}: active learning reduces sample complexity substantially. h2 C = {linear separators in R2}: there are some target hyp. for which no improvement can be achieved! - no matter how benign the input distr. h1 h0 In this case: learning to accuracy  requires 1/ labels… Maria-Florina Balcan

Examples where Active Learning helps In general, number of queries needed depends on C and also on D. C = {linear separators in R1}: active learning reduces sample complexity substantially no matter what is the input distribution. C - homogeneous linear separators in Rd, D - uniform distribution over unit sphere: need only d log 1/ labels to find a hypothesis with error rate < . Dasgupta, Kalai, Monteleoni, COLT 2005 Freund et al., ’97. Balcan-Broder-Zhang, COLT 07 Maria-Florina Balcan

Region of uncertainty [CAL92] Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) Example: data lies on circle in R2 and hypotheses are homogeneous linear separators. current version space + + region of uncertainty in data space Maria-Florina Balcan

Region of uncertainty [CAL92] current version space region of uncertainy Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. Maria-Florina Balcan

Region of uncertainty [CAL92] Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) current version space region of uncertainty in data space + Maria-Florina Balcan

Region of uncertainty [CAL92] Current version space: part of C consistent with labels so far. “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) new version space + + New region of uncertainty in data space Maria-Florina Balcan

Region of uncertainty [CAL92], Guarantees Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. [Balcan, Beygelzimer, Langford, ICML’06] Analyze a version of this alg. which is robust to noise. C- linear separators on the line, low noise, exponential improvement. C - homogeneous linear separators in Rd, D -uniform distribution over unit sphere. low noise, need only d2 log 1/ labels to find a hypothesis with error rate < . realizable case, d3/2 log 1/ labels. supervised -- d/ labels. Maria-Florina Balcan