Semi-Supervised Learning Avrim Blum Carnegie Mellon University [USC CS Distinguished Lecture Series, 2008]

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Co Training Presented by: Shankar B S DMML Lab
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Distributed Machine Learning: Communication, Efficiency, and Privacy Avrim Blum [RaviKannan60] Joint work with Maria-Florina Balcan, Shai Fine, and Yishay.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
Semi-Supervised Learning
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
Improving the Graph Mincut Approach to Learning from Labeled and Unlabeled Examples Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira Carnegie Mellon.
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Active Learning with Support Vector Machines
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Machine Learning Rob Schapire Princeton Avrim Blum Carnegie Mellon Tommi Jaakkola MIT.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Improving the Graph Mincut Approach to Learning from Labeled and Unlabeled Examples Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Incorporating Unlabeled Data in the Learning Process
Crash Course on Machine Learning
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
Machine Learning Overview Tamara Berg Language and Vision.
Machine Learning Chapter 3. Decision Tree Learning
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
Approximation Algorithms Pages ADVANCED TOPICS IN COMPLEXITY THEORY.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Universit at Dortmund, LS VIII
Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Learning with General Similarity Functions Maria-Florina Balcan.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Reconstructing Preferences from Opaque Transactions Avrim Blum Carnegie Mellon University Joint work with Yishay Mansour (Tel-Aviv) and Jamie Morgenstern.
Combining Labeled and Unlabeled Data with Co-Training
Semi-Supervised Learning
CSCI B609: “Foundations of Data Science”
Computational Learning Theory
Computational Learning Theory
Support Vector Machines and Kernels
CS639: Data Management for Data Science
Lecture 14 Learning Inductive inference
Semi-Supervised Learning
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

Semi-Supervised Learning Avrim Blum Carnegie Mellon University [USC CS Distinguished Lecture Series, 2008]

Semi-Supervised Learning Supervised Learning = learning from labeled data. Dominant paradigm in Machine Learning. E.g, say you want to train an classifier to distinguish spam from important messages

Semi-Supervised Learning Supervised Learning = learning from labeled data. Dominant paradigm in Machine Learning. E.g, say you want to train an classifier to distinguish spam from important messages Take sample S of data, labeled according to whether they were/werent spam.

Semi-Supervised Learning Supervised Learning = learning from labeled data. Dominant paradigm in Machine Learning. E.g, say you want to train an classifier to distinguish spam from important messages Take sample S of data, labeled according to whether they were/werent spam. Train a classifier (like SVM, decision tree, etc) on S. Make sure its not overfitting. Use to classify new s.

Basic paradigm has many successes recognize speech, steer a car, classify documents classify proteins recognizing faces, objects in images...

However, for many problems, labeled data can be rare or expensive. Unlabeled data is much cheaper. Need to pay someone to do it, requires special testing,…

However, for many problems, labeled data can be rare or expensive. Unlabeled data is much cheaper. Speech Images Medical outcomes Customer modeling Protein sequences Web pages Need to pay someone to do it, requires special testing,…

However, for many problems, labeled data can be rare or expensive. Unlabeled data is much cheaper. [From Jerry Zhu] Need to pay someone to do it, requires special testing,…

However, for many problems, labeled data can be rare or expensive. Unlabeled data is much cheaper. Can we make use of cheap unlabeled data?

Semi-Supervised Learning Can we use unlabeled data to augment a small labeled sample to improve learning? But unlabeled data is missing the most important info!! But maybe still has useful regularities that we can use. But, but, but

Semi-Supervised Learning Can we use unlabeled data to augment a small labeled sample to improve learning? But unlabeled data is missing the most important info!! But maybe still has useful regularities that we can use. But…

Semi-Supervised Learning Substantial recent work in ML. A number of interesting methods have been developed. This talk: Discuss several diverse methods for taking advantage of unlabeled data. General framework to understand when unlabeled data can help, and make sense of whats going on. [joint work with Nina Balcan]

Method 1: Co-Training

Co-training [Blum&Mitchell98] Many problems have two different sources of info you can use to determine label. E.g., classifying webpages: can use words on page or words on links pointing to the page. My AdvisorProf. Avrim BlumMy AdvisorProf. Avrim Blum x 2 - Text info x 1 - Link info x - Link info & Text info

Co-training Idea: Use small labeled sample to learn initial rules. –E.g., my advisor pointing to a page is a good indicator it is a faculty home page. –E.g., I am teaching on a page is a good indicator it is a faculty home page. my advisor

Co-training Idea: Use small labeled sample to learn initial rules. –E.g., my advisor pointing to a page is a good indicator it is a faculty home page. –E.g., I am teaching on a page is a good indicator it is a faculty home page. Then look for unlabeled examples where one rule is confident and the other is not. Have it label the example for the other. Training 2 classifiers, one on each type of info. Using each to help train the other. h x 1,x 2 i

Co-training Turns out a number of problems can be set up this way. E.g., [Levin-Viola-Freund03] identifying objects in images. Two different kinds of preprocessing. E.g., [Collins&Singer99] named-entity extraction. –I arrived in London yesterday …

Co-training Setting is each example x = h x 1,x 2 i, where x 1, x 2 are two views of the data. Have separate algorithms running on each view. Use each to help train the other. Basic hope is that two views are consistent. Using agreement as proxy for labeled data.

Toy example: intervals As a simple example, suppose x 1, x 2 2 R. Target function is some interval [a,b] a1a1 b1b1 a2a2 b2b2

Results: webpages 12 labeled examples, 1000 unlabeled (sample run)

Results: images [Levin-Viola-Freund 03]: Visual detectors with different kinds of processing Images with 50 labeled cars. 22,000 unlabeled images. Factor 2-3+ improvement. From [LVF03]

Co-Training Theorems [BM98] if x 1, x 2 are independent given the label, and if have alg that is robust to noise, then can learn from an initial weakly-useful rule plus unlabeled data. Faculty home pages Faculty with advisees My advisor

Co-Training Theorems [BM98] if x 1, x 2 are independent given the label, and if have alg that is robust to noise, then can learn from an initial weakly-useful rule plus unlabeled data. [BB05] in some cases (e.g., LTFs), you can use this to learn from a single labeled example! [BBY04] if algs are correct when they are confident, then suffices for distrib to have expansion. …

Method 2: Semi-Supervised (Transductive) SVM

S 3 VM [Joachims98] Suppose we believe target separator goes through low density regions of the space/large margin. Aim for separator with large margin wrt labeled and unlabeled data. (L+U) + + _ _ Labeled data only + + _ _ + + _ _ S^3VM SVM

S 3 VM [Joachims98] Suppose we believe target separator goes through low density regions of the space/large margin. Aim for separator with large margin wrt labeled and unlabeled data. (L+U) Unfortunately, optimization problem is now NP- hard. Algorithm instead does local optimization. –Start with large margin over labeled data. Induces labels on U. –Then try flipping labels in greedy fashion. + + _ _ + + _ _ + + _ _

S 3 VM [Joachims98] Suppose we believe target separator goes through low density regions of the space/large margin. Aim for separator with large margin wrt labeled and unlabeled data. (L+U) Unfortunately, optimization problem is now NP- hard. Algorithm instead does local optimization. –Or, branch-and-bound, other methods (Chapelle etal06) Quite successful on text data. + + _ _ + + _ _ + + _ _

Method 3: Graph-based methods

Graph-based methods Suppose we believe that very similar examples probably have the same label. If you have a lot of labeled data, this suggests a Nearest-Neighbor type of alg. If you have a lot of unlabeled data, perhaps can use them as stepping stones E.g., handwritten digits [Zhu07]:

Graph-based methods Idea: construct a graph with edges between very similar examples. Unlabeled data can help glue the objects of the same class together.

Graph-based methods Idea: construct a graph with edges between very similar examples. Unlabeled data can help glue the objects of the same class together.

Graph-based methods Idea: construct a graph with edges between very similar examples. Unlabeled data can help glue the objects of the same class together. Solve for: –Minimum cut [BC,BLRR] –Minimum soft-cut [ZGL] e=(u,v) (f(u)-f(v)) 2 –Spectral partitioning [J] –…

Graph-based methods Suppose just two labels: 0 & 1. Solve for labels 0 · f(x) · 1 for unlabeled examples x to minimize: – e=(u,v) |f(u)-f(v)| [soln = minimum cut] – e=(u,v) (f(u)-f(v)) 2 [soln = electric potentials] –…–…

Is there some underlying principle here? What should be true about the world for unlabeled data to help?

Detour back to standard supervised learning Then new model [Joint work with Nina Balcan]

Standard formulation (PAC) for supervised learning We are given training set S = {(x,f(x))}. –Assume xs are random sample from underlying distribution D over instance space. –Labeled by target function f. Alg does optimization over S to produce some hypothesis (prediction rule) h. Goal is for h to do well on new examples also from D. I.e., Pr D [h(x) f(x)] <.

Standard formulation (PAC) for supervised learning Question: why should doing well on S have anything to do with doing well on D? Say our algorithm is choosing rules from some class C. –E.g., say data is represented by n boolean features, and we are looking for a good decision tree of size O(n). x3x3 x5x5 x2x How big does S have to be to hope performance carries over?

Confidence/sample-complexity Consider a rule h with err(h)>, that were worried might fool us. Chance that h survives m examples is at most (1- ) m. So, Pr[some rule h with err(h)> is consistent] < |C|(1- ) m. This is ln(|C|) + ln View as just # bits to write h down!

Occams razor William of Occam (~1320 AD): entities should not be multiplied unnecessarily (in Latin) Which we interpret as: in general, prefer simpler explanations. Why? Is this a good policy? What if we have different notions of whats simpler?

Occams razor (contd) A computer-science-ish way of looking at it: Say simple = short description. At most 2 s explanations can be < s bits long. So, if the number of examples satisfies: m > s ln(2) + ln Then its unlikely a bad simple (< s bits) explanation will fool you just by chance. Think of as 10x #bits to write down h.

Intrinsically, using notion of simplicity that is a function of unlabeled data. Semi-supervised model High-level idea: (Formally, of how proposed rule relates to underlying distribution; use unlabeled data to estimate)

Intrinsically, using notion of simplicity that is a function of unlabeled data. Semi-supervised model High-level idea: E.g., large margin separator small cutself-consistent rules + + _ _ + h 1 (x 1 )=h 2 (x 2 )

Formally Convert belief about world into an unlabeled loss function l unl (h,x) 2 [0,1]. Defines unlabeled error rate –Err unl (h) = E x » D [ l unl (h,x)] (incompatibility score) Co-training: fraction of data pts h x 1,x 2 i where h 1 (x 1 ) h 2 (x 2 )

Formally Convert belief about world into an unlabeled loss function l unl (h,x) 2 [0,1]. Defines unlabeled error rate –Err unl (h) = E x » D [ l unl (h,x)] (incompatibility score) S 3 VM: fraction of data pts x near to separator h. l unl (h,x) 0 Using unlabeled data to estimate this score.

Can use to prove sample bounds Can use to prove sample bounds Simple example theorem: (believe target is fully compatible) Define C( ) = {h 2 C : err unl (h) · }. Bound the # of labeled examples as a measure of the helpfulness of D wrt our incompatibility score. –a helpful distribution is one in which C( ) is small

Can use to prove sample bounds Can use to prove sample bounds Simple example theorem: Define C( ) = {h 2 C : err unl (h) · }. Extend to case where target not fully compatible. Then care about {h 2 C : err unl (h) · err unl (f * )}.

When does unlabeled data help? When does unlabeled data help? Target agrees with beliefs (low unlabeled error rate / incompatibility score). Space of rules nearly as compatible as target is small (in size or VC-dimension or -cover size,…) And, have algorithm that can optimize. Extend to case where target not fully compatible. Then care about {h 2 C : err unl (h) · err unl (f * )}.

Interesting implication of analysis: Best bounds for algorithms that first use unlabeled data to generate set of candidates. (small -cover of compatible rules) Then use labeled data to select among these. When does unlabeled data help? When does unlabeled data help? Unfortunately, often hard to do this algorithmically. Interesting challenge. (can do for linear separators if have indep given label ) learn from single labeled example)

Conclusions Conclusions Semi-supervised learning is an area of increasing importance in Machine Learning. Automatic methods of collecting data make it more important than ever to develop methods to make use of unlabeled data. Several promising algorithms (only discussed a few). Also new theoretical framework to help guide further development.