New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
A small taste of inferential statistics
Semi-Supervised Learning Avrim Blum Carnegie Mellon University [USC CS Distinguished Lecture Series, 2008]
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Online learning, minimizing regret, and combining expert advice
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Boosting Approach to ML
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Yi Wu (CMU) Joint work with Vitaly Feldman (IBM) Venkat Guruswami (CMU) Prasad Ragvenhdra (MSR)
The Nature of Statistical Learning Theory by V. Vapnik
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
The Computational Complexity of Searching for Predictive Hypotheses Shai Ben-David Computer Science Dept. Technion.
Reduced Support Vector Machine
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Dasgupta, Kalai & Monteleoni COLT 2005 Analysis of perceptron-based active learning Sanjoy Dasgupta, UCSD Adam Tauman Kalai, TTI-Chicago Claire Monteleoni,
Machine Learning for Mechanism Design and Pricing Problems Avrim Blum Carnegie Mellon University Joint work with Maria-Florina Balcan, Jason Hartline,
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Machine Learning Rob Schapire Princeton Avrim Blum Carnegie Mellon Tommi Jaakkola MIT.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Machine Learning Theory Maria-Florina Balcan Lecture 1, Jan. 12 th 2010.
Machine Learning Theory Maria-Florina (Nina) Balcan Lecture 1, August 23 rd 2011.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Universit at Dortmund, LS VIII
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning Tutorial Amit Gruber The Hebrew University of Jerusalem.
Face Detection Using Large Margin Classifiers Ming-Hsuan Yang Dan Roth Narendra Ahuja Presented by Kiang “Sean” Zhou Beckman Institute University of Illinois.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Maria-Florina Balcan 16/11/2015 Active Learning. Supervised Learning E.g., which s are spam and which are important. E.g., classify objects as chairs.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
Learning with General Similarity Functions Maria-Florina Balcan.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
CS 9633 Machine Learning Support Vector Machines
Dan Roth Department of Computer and Information Science
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Semi-Supervised Learning
CSCI B609: “Foundations of Data Science”
Maria Florina Balcan 03/04/2010
CSCI B609: “Foundations of Data Science”
Support Vector Machines and Kernels
Presentation transcript:

New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on New Horizons in Computing, Kyoto 2005]

What is Machine Learning?  Design of programs that adapt from experience, identify patterns in data.  Used to: –recognize speech, faces, images –steer a car, –play games, –categorize documents, info retrieval,...  Goals of ML theory: develop models, analyze algorithmic and statistical issues involved.

Plan for this talk  Discuss some of current challenges and “hot topics”.  Focus on topic of “kernel methods”, and connections to random projection, embeddings.  Start with a quick orientation…

The concept learning setting  Imagine you want a computer program to help you decide which messages are spam and which are important.  Might represent each message by n features. (e.g., return address, keywords, spelling, etc.)  Take sample S of data, labeled according to whether they were/weren’t spam.  Goal of algorithm is to use data seen so far to produce good prediction rule (a “hypothesis”) h(x) for future data.

example label The concept learning setting E.g., Given data, some reasonable rules might be: Predict SPAM if unknown AND (money OR pills) Predict SPAM if money + pills – known > 0....

Big questions (A) How to optimize?  How might we automatically generate rules like this that do well on observed data? [Algorithm design] (B) What to optimize?  Our real goal is to do well on new data.  What kind of confidence do we have that rules that do well on sample will do well in the future?  Statistics  Sample complexity  SRM for a given learning alg, how much data do we need...

To be a little more formal… PAC model setup:  Alg is given sample S = {(x, l )} drawn from some distribution D over examples x, labeled by some target function f.  Alg does optimization over S to produce some hypothesis h 2 H. [e.g., H = linear separators]  Goal is for h to be close to f over D. –Pr x 2 D (h(x)  f(x)) · .  Allow failure with small prob  (to allow for chance that S is not representative).

The issue of sample-complexity  We want to do well on D, but all we have is S. –Are we in trouble? –How big does S have to be so that low error on S ) low error on D?  Luckily, simple sample-complexity bounds: –If |S| ¸ (1/  )[ log|H| + log 1/  ], [think of log|H| as the number of bits to write down h] then whp (1-  ), all h 2 H that agree with S have true error · . –In fact, with extra factor of O(1/  ), enough so whp all have |true error – empirical error| · .

The issue of sample-complexity  We want to do well on D, but all we have is S. –Are we in trouble? –How big does S have to be so that low error on S ) low error on D?  Implication: –If we view cost of examples as comparable to cost of computation, then don’t have to worry about data cost since just ~ 1/  per bit output. –But, in practice, costs often wildly different, so sample-complexity issues are crucial.

Some current hot topics in ML  More precise confidence bounds, as a function of observable quantities. –Replace log |H| with log(# ways of splitting S using functions in H). –Bounds based on margins: how well-separated the data is. –Bounds based on other observable properties of S and relation of S to H; other complexity measures…

Some current hot topics in ML  More precise confidence bounds, as a function of observable quantities.  Kernel methods. –Allow to implicitly map data into higher- dimensional space, without paying for it if algorithm can be “kernelized”. –Get back to this in a few minutes… –Point is: if, say, data not linearly separable in original space, it could be in new space.

Some current hot topics in ML  More precise confidence bounds, as a function of observable quantities.  Kernel methods.  Semi-supervised learning. –Using labeled and unlabeled data together (often unlabeled data is much more plentiful). –Useful if have beliefs about not just form of target but also its relationship to underlying distribution. –Co-training, graph-based methods, transductive SVM,…

Some current hot topics in ML  More precise confidence bounds, as a function of observable quantities.  Kernel methods.  Semi-supervised learning.  Online learning / adaptive game playing. –Classic strategies with excellent regret bounds (from Hannan in 1950s to weighted-majority in 80s-90s). –New work on strategies that can efficiently handle large implicit choice spaces. [KV][Z]… –Connections to game-theoretic equilibria.

Some current hot topics in ML  More precise confidence bounds, as a function of observable quantities.  Kernel methods.  Semi-supervised learning.  Online learning / adaptive game playing. Could give full talk on any one of these. Focus on #2, with connection to random projection and metric embeddings…

 One of the most natural approaches to learning is to try to learn a linear separator.  But what if the data is not linearly separable? Yet you still want to use the same algorithm.  One idea: Kernel functions. Kernel Methods

 A Kernel Function K(x,y) is a function on pairs of examples, such that for some implicit function  (x) into a possibly high- dimensional space, K(x,y) =  (x) ¢  (y).  E.g., K(x,y) = (1 + x ¢ y) m. –If x 2 R n, then  (x) 2 R n m. –K is easy to compute, even though you can’t even efficiently write down  (x).  The point: many linear-separator algorithms can be kernelized – made to use K and act as if their input was the  (x)’s. –E.g., Perceptron, SVM. Kernel Methods

 Given a set of images:, represented as pixels, want to distinguish men from women.  But pixels not a great representation for image classification.  Use a Kernel K (, ) =  ( ) ¢  ( ),  is implicit, high-dimensional mapping. Choose K appropriate for type of data. Typical application for Kernels

 Use a Kernel K (, ) =  ( ) ¢  ( ),  is implicit, high-dimensional mapping.  What about # of samples needed? –Don’t have to pay for dimensionality of  -space if data is separable by a large margin . –E.g., Perceptron, SVM need sample size only Õ(1/  2 ). What about sample-complexity? |w    x)|/ |  (x)|  , |w|=  space

So, with that background…

Question  Are kernels really allowing you to magically use power of implicit high-dimensional  - space without paying for it?  What’s going on?  Claim: [BBV] Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D], –Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space for f,D], then this is a good feature set [ 9 almost-as- good separator in this explicit space].

contd  Claim: [BBV] Given a kernel [as a black-box program K(x,y)] & access to typical inputs [samples from D] –Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space], then this is a good feature set [ 9 almost-as- good separator in this explicit space].  Eg, sample z 1,...,z d from D. Given x, define x i =K(x,z i ).  Implications: –Practical: alternative to kernelizing the algorithm. –Conceptual: View choosing a kernel like choosing a (distrib dependent) set of features, rather than “magic power of implicit high dimensional space”. [though argument needs existence of  functions]

Why is this a plausible goal in principle?  JL lemma: If data separable with margin  in  -space, then with prob 1- , a random linear projection down to space of dimension d = O((1/  2 )log[1/(  )]) will have a linear separator of error < . X + -   If vectors are r 1,r 2,...,r d, then can view coords as features x i =  (x) ¢ r i.  Problem: uses . Can we do directly, using K as black- box, without computing  ? + -

3 methods (from simplest to best) 1.Draw d examples z 1,...,z d from D. Use: F(x) = (K(x,z 1 ),..., K(x,z d )). [So, “x i ” = K(x,z i )] For d = (8/  )[1/  2 + ln 1/  ], if separable with margin  in  -space, then whp this will be separable with error . (but this method doesn’t preserve margin). 2.Same d, but a little more complicated. Separable with error  at margin  /2. 3.Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/ , rather than linear. So, can set  ¿ 1/d. All these methods need access to D, unlike JL. Can this be removed? We show NO for generic K, but may be possible for natural K.

Actually, the argument is not too hard... (though we did try a lot of things first that didn’t work...)

Key fact Claim : If 9 perfect w of margin  in  -space, then if draw z 1,...,z d 2 D for d ¸ (8/  )[1/  2 + ln 1/  ], whp (1-  ) exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2. Proof: Let S = examples drawn so far. Assume |w|=1, |  (z)|=1 8 z.  w in = proj(w,span(S)), w out = w – w in.  Say w out is large if Pr z ( |w out ¢  (z)| ¸  /2 ) ¸  ; else small.  If small, then done: w’ = w in.  Else, next z has at least  prob of improving S. |w out | 2 Ã |w out | 2 – (  /2) 2  Can happen at most 4/  2 times. □

So.... If draw z 1,...,z d 2 D for d = (8/  )[1/  2 + ln 1/  ], then whp exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2.  So, for some w’ =  1  (z 1 )  d  (z d ), Pr (x, l ) 2 P [sign(w’ ¢  (x))  l ] · .  But notice that w’ ¢  (x) =  1 K(x,z 1 )  d K(x,z d ). ) vector (  1,...  d ) is an  -good separator in the feature space: x i = K(x,z i ).  But margin not preserved because length of target, examples not preserved.

What if we want to preserve margin? (mapping 2)  Problem with last mapping is  (z)’s might be highly correlated. So, dot-product mapping doesn’t preserve margin.  Instead, given a new x, want to do an orthogonal projection of  (x) into that span. (preserves dot- product, decreases |  (x)|, so only increases margin). Run K(z i,z j ) for all i,j=1,...,d. Get matrix M. Decompose M = U T U. (Mapping #2) = (mapping #1)U -1. □

Use this to improve dimension  Current mapping gives d = (8/  )[1/  2 + ln 1/  ].  Johnson-Lindenstrauss gives d = O((1/  2 ) log 1/(  ) ). Nice because can have d ¿ 1/ . [So can set  small enough so that whp a sample of size O(d) is perfectly separable]  Can we achieve that efficiently?  Answer: just combine the two... Run Mapping #2, then do random projection down from that. (using fact that mapping #2 had a margin) Gives us desired dimension (# features), though sample-complexity remains as in mapping #2.

X X X X X X O O O O X X O O O O X X X X X O O O O X X X RdRd R d1  JL F X X O O O O X X X RNRN F1F1

Lower bound (on necessity of access to D) For arbitrary black-box kernel K, can’t hope to convert to small feature space without access to D.  Consider X={0,1} n, random X’ ½ X of size 2 n/2, D = uniform over X’.  c = arbitrary function (so learning is hopeless).  But we have this magic kernel K(x,y) =  (x) ¢  (y)  (x) = (1,0) if x  X’.  (x) = (-½, p 3/2) if x 2 X’, c(x)=pos.  (x) = (-½,- p 3/2) if x 2 X’, c(x)=neg.  P is separable with margin p 3/2 in  - space.  But, without access to D, all attempts at running K(x,y) will give answer of 1.   

Open Problems  For specific natural kernels, like “polynomial” kernel K(x,y) = (1 + x ¢ y) m, is there an efficient analog to JL, without needing access to D? Or, can one at least reduce the sample-complexity ? (use fewer accesses to D) This would increase practicality of this approach  Can one extend results (e.g., mapping #1: x  [K(x,z 1 ),..., K(x,z d )]) to more general similarity functions K? Not exactly clear what theorem statement would look like.