Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.

Slides:



Advertisements
Similar presentations
Quantum Versus Classical Proofs and Advice Scott Aaronson Waterloo MIT Greg Kuperberg UC Davis | x {0,1} n ?
Advertisements

Shortest Vector In A Lattice is NP-Hard to approximate
New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
On a Theory of Similarity Functions for Learning and Clustering Avrim Blum Carnegie Mellon University [Includes work joint with Nina Balcan, Nati Srebro,
Semi-Supervised Learning and Learning via Similarity Functions: Two key settings for Data- Dependent Concept Spaces Avrim Blum [NIPS 2008 Workshop on Data.
A general agnostic active learning algorithm
Semi-Supervised Learning
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
“Random Projections on Smooth Manifolds” -A short summary
Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey University of Waterloo Department of Combinatorics and Optimization Joint.
Support Vector Machines and Kernel Methods
Derandomized DP  Thus far, the DP-test was over sets of size k  For instance, the Z-Test required three random sets: a set of size k, a set of size k-k’
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Active Learning of Binary Classifiers
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
On Testing Convexity and Submodularity Michal Parnas Dana Ron Ronitt Rubinfeld.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Dimensionality Reduction
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
Topics in Algorithms 2007 Ramesh Hariharan. Random Projections.
PATTERN RECOGNITION AND MACHINE LEARNING
1 Introduction to Quantum Information Processing QIC 710 / CS 667 / PH 767 / CO 681 / AM 871 Richard Cleve DC 2117 Lecture 16 (2011)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
An Algorithmic Proof of the Lopsided Lovasz Local Lemma Nick Harvey University of British Columbia Jan Vondrak IBM Almaden TexPoint fonts used in EMF.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
AN ORTHOGONAL PROJECTION
CS151 Complexity Theory Lecture 13 May 11, Outline proof systems interactive proofs and their power Arthur-Merlin games.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Topics in Algorithms 2007 Ramesh Hariharan. Tree Embeddings.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
List Decoding Using the XOR Lemma Luca Trevisan U.C. Berkeley.
Kernels and Margins Maria Florina Balcan 10/13/2011.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
Learning with General Similarity Functions Maria-Florina Balcan.
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,
Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}
Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
New Characterizations in Turnstile Streams with Applications
Background: Lattices and the Learning-with-Errors problem
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Mechanism Design via Machine Learning
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Maria Florina Balcan 03/04/2010
CSCI B609: “Foundations of Data Science”
Support Vector Machines and Kernels
Lecture 15: Least Square Regression Metric Embeddings
Presentation transcript:

Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala

Generic problem  Given a set of images:, want to learn a linear separator to distinguish men from women.  Problem: pixel representation no good. Old style advice:  Pick a better set of features!  But seems ad-hoc. Not scientific. New style advice:  Use a Kernel! K (, ) =  ( ) ¢  ( ).  is implicit, high-dimensional mapping.  Feels more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.

Generic problem Old style advice:  Pick a better set of features!  But seems ad-hoc. Not scientific. New style advice:  Use a Kernel! K (, ) =  ( ) ¢  ( ).  is implicit, high-dimensional mapping.  Feels more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.  E.g., K(x,y) = (x ¢ y + 1) m.  :(n-diml space) ! (n m -diml space).

Claim: Can view new method as way of conducting old method.  Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D],  Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. “You give me a kernel, I give you a set of features” Do this using idea of random projection…

Claim: Can view new method as way of conducting old method.  Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D],  Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. E.g., sample z 1,...,z d from D. Given x, define x i = K(x,z i ). Implications:  Practical: alternative to kernelizing the algorithm.  Conceptual: View kernel as (principled) way of doing feature generation. View as similarity function, rather than “magic power of implicit high dimensional space”.

Basic setup, definitions  Instance space X. X  Distribution D, target c. Use P = (D,c).  K(x,y) =  (x) ¢  (y).  P is separable with margin  in  -space if 9 w s.t. Pr (x, l ) 2 P [ l (w ¢  (x)/ |  (x)| ) <  (|w|=1) P=(D,c) + -  w  Error  at margin  : replace “0” with “  ”. Goal is to use K to get mapping to low-dim’l space.

One idea: Johnson-Lindenstrauss lemma  If P separable with margin  in  -space, then with prob 1- , a random linear projection down to space of dimension d = O((1/  2 )log[1/(  )]) will have a linear separator of error < . [Arriaga Vempala] XP=(D,c) + -   If vectors are r 1,r 2,...,r d, then can view as features x i =  (x) ¢ r i.  Problem: uses . Can we do directly, using K as black- box, without computing  ?

3 methods (from simplest to best) 1.Draw d examples z 1,...,z d from D. Use: F(x) = (K(x,z 1 ),..., K(x,z d )). [So, “x i ” = K(x,z i )] For d = (8/  )[1/  2 + ln 1/  ], if P was separable with margin  in  -space, then whp this will be separable with error . (but this method doesn’t preserve margin). 2.Same d, but a little more complicated. Separable with error  at margin  /2. 3.Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/ , rather than linear. So, can set  ¿ 1/d. All these methods need access to D, unlike JL. Can this be removed? We show NO for generic K, but may be possible for natural K.

Key fact Claim : If 9 perfect w of margin  in  -space, then if draw z 1,...,z d 2 D for d ¸ (8/  )[1/  2 + ln 1/  ], whp (1-  ) exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2. Proof: Let S = examples drawn so far. Assume |w|=1, |  (z)|=1 8 z.  w in = proj(w,span(S)), w out = w – w in.  Say w out is large if Pr z ( |w out ¢  (z)| ¸  /2 ) ¸  ; else small.  If small, then done: w’ = w in.  Else, next z has at least  prob of improving S. |w out | 2 Ã |w out | 2 – (  /2) 2  Can happen at most 4/  2 times. □

So.... If draw z 1,...,z d 2 D for d = (8/  )[1/  2 + ln 1/  ], then whp exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2.  So, for some w’ =  1  (z 1 )  d  (z d ), Pr (x, l ) 2 P [sign(w’ ¢  (x))  l ] · .  But notice that w’ ¢  (x) =  1 K(x,z 1 )  d K(x,z d ). ) vector (  1,...  d ) is an  -good separator in the feature space: x i = K(x,z i ).  But margin not preserved because length of target, examples not preserved.

How to preserve margin? (mapping #2)  We know 9 w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2.  So, given a new x, just want to do an orthogonal projection of  (x) into that span. (preserves dot- product, decreases |  (x)|, so only increases margin). Run K(z i,z j ) for all i,j=1,...,d. Get matrix M. Decompose M = U T U. (Mapping #2) = (mapping #1)U -1. □

Mapping #2, Details  Draw a set S={z 1,..., z d } of d = (8/  )[1/  2 + ln 1/  ], unlabeled examples from D.  Run K(x,y) for all x,y 2 S, get M(S)=(K(z i,z j )) z i,z j 2 S.  Place S into d-dim. space based on K (or M(S)). X z1z1 z3z3 z2z2 K(z 1,z 1 )=|F 2 (z 1 )| 2 F 2 (z 1 ) F 2 (z 2 ) K(z 2,z 2 ) K(z 1,z 2 ) K(z 3,z 3 ) F 2 (z 3 ) RdRd F1F1

Mapping #2, Details, cont  What to do with new points?  Extend the embedding F 1 to all of X: consider F 2 : X ! R d defined as follows: for x 2 X, let F 2 (x) 2 R d be the point of smallest length such that F 2 (x) ¢ F 2 (z i ) = K(x,z i ), for all i 2 {1,..., d}.  The mapping is equivalent to orthogonally projecting  (x) down to span(  (z 1 ),…,  (z d )).

How to improve dimension?  Current mapping (F 2 ) gives d = (8/  )[1/  2 + ln 1/  ].  Johnson-Lindenstrauss gives d 1 = O((1/  2 ) log 1/(  ) ). Nice because can have d ¿ 1/ .  Answer: just combine the two... Run Mapping #2, then do random projection down from that. Gives us desired dimension (# features), though sample-complexity remains as in mapping #2.

X X X X X X O O O O X X O O O O X X X X X O O O O X X X RdRd  JL F X X O O O O X X X RNRN F2F2 R d1

Mapping #3  Do JL(mapping2(x)).  JL says: fix y,w. Random projection M down to space of dimension O(1/  2 log 1/  ’) will with prob (1-  ’) preserve margin of y up to §  /4.  Use  ’ = . ) For all y, Pr M [failure on y] < , ) Pr D, M [failure on y] < , ) Pr M [fail on prob mass  ] < .  So, we get desired dimension (# features), though sample-complexity remains as in mapping #2.

Lower bound (on necessity of access to D) For arbitrary black-box kernel K, can’t hope to convert to small feature space without access to D.  Consider X={0,1} n, random X’ ½ X of size 2 n/2, D = uniform over X’.  c = arbitrary function (so learning is hopeless).  But we have this magic kernel K(x,y) =  (x) ¢  (y)  (x) = (1,0) if x  X’.  (x) = (-½, p 3/2) if x 2 X’, c(x)=pos.  (x) = (-½,- p 3/2) if x 2 X’, c(x)=neg.  P is separable with margin p 3/2 in  - space.  But, without access to D, all attempts at running K(x,y) will give answer of 1.   

Open Problems  For specific natural kernels, like K(x,y) = (1 + x ¢ y) m, is there an efficient analog to JL, without needing access to D? Or, at least can one at least reduce the sample- complexity ? (use fewer accesses to D)  Can one extend results (e.g., mapping #1: x  [K(x,z 1 ),..., K(x,z d )]) to more general similarity functions K? Not exactly clear what theorem statement would look like.