Kernels and Margins Maria Florina Balcan 10/13/2011.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

New Horizons in Machine Learning Avrim Blum CMU This is mostly a survey, but last part is joint work with Nina Balcan and Santosh Vempala [Workshop on.
Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Linear Separators.
Support vector machine
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Reduced Support Vector Machine
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
A Kernel-based Support Vector Machine by Peter Axelberg and Johan Löfhede.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Lecture 10: Support Vector Machines
Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.
Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold,
Online Learning Algorithms
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
SVM by Sequential Minimal Optimization (SMO)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Linear Learning Machines and SVM The Perceptron Algorithm revisited
Face Detection Using Large Margin Classifiers Ming-Hsuan Yang Dan Roth Narendra Ahuja Presented by Kiang “Sean” Zhou Beckman Institute University of Illinois.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Features, Kernels, and Similarity functions Avrim Blum Machine learning lunch 03/05/07.
Learning with General Similarity Functions Maria-Florina Balcan.
1 Machine Learning in Natural Language More on Discriminative models Dan Roth University of Illinois, Urbana-Champaign
Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.
On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,
Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Linear machines 28/02/2017.
Maria Florina Balcan 03/04/2010
Support Vector Machines and Kernels
Machine Learning Support Vector Machine Supervised Learning
Support Vector Machines 2
Presentation transcript:

Kernels and Margins Maria Florina Balcan 10/13/2011

Kernel Methods Significant percentage of ICML, NIPS, COLT. Lots of Books, Workshops. ICML 2007, Business meeting Amazingly popular method in ML in recent years.

Linear Separators Hypothesis class of linear decision surfaces in R n X X X X X X X X X X O O O O O O O O Instance space: X=R n w  h(x)=w ¢ x, if h(x)> 0, then label x as +, otherwise label it as -

Lin. Separators: Perceptron algorithm  Start with all-zeroes weight vector w.  Given example x, predict positive, w ¢ x ¸ 0.  On a mistake, update as follows: Mistake on positive, then update w à w + x Mistake on negative, then update w à w - x w is a weighted sum of the incorrectly classified examples Note: mistake bound is 1/  2 Guarantee:

Geometric Margin If S is a set of labeled examples, then a vector w has margin  w.r.t. S if w    x h: w ¢ x = 0

What if Not Linearly Separable Example: Problem: data not linearly separable in the most natural feature representation. Classic: “ Learn a more complex class of functions ”. (prominent method today) Modern: “ Use a Kernel ” (prominent method today) vs No good linear separator in pixel representation. Solutions:

Overview of Kernel Methods A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping  such that K(, )=  ( ) ¢  ( ). Why Kernels matter? Many algorithms interact with data only via dot-products. So, if replace x ¢ y with K(x,y), they act implicitly as if data was in the higher-dimensional  -space. What is a Kernel? If data is linearly separable by margin in  -space, then good sample complexity.

Kernels K( ¢, ¢ ) - kernel if it can be viewed as a legal definition of inner product:  9  : X ! R N such that K(x,y) =  (x) ¢  (y) range of  - “  -space” N can be very large  But think of  as implicit, not explicit!

x2x2 x1x1 O O O O O O O O X X X X X X X X X X X X X X X X X X z1z1 z3z3 O O O O O O O O O X X X X X X X X X X X X X X X X X X Example E.g., for n=2, d=2, the kernel z2z2 K(x,y) = (x ¢ y) d corresponds to original space  space

Example x2x2 x1x1 O O O O O O O O X X X X X X X X X X X X X X X X X X z1z1 z3z3 O O O O O O O O O X X X X X X X X X X X X X X X X X X z2z2 original space  space

Note: feature space need not be unique Example

Kernels K is a kernel iff K is symmetric for any set of training points x 1, x 2, …,x m and for any a 1, a 2, …, a m 2 R we have:  Linear: K(x,y)=x ¢ y  Polynomial: K(x,y) =(x ¢ y) d or K(x,y) =(1+x ¢ y) d More examples:  Gaussian: Theorem

Kernelizing a learning algorithm If all computations involving instances are in terms of inner products then: Examples of kernalizable algos: Perceptron, SVM.  Conceptually, work in a very high diml space and the alg’s performance depends only on linear separability in that extended space.  Computationally, only need to modify the alg. by replacing each x ¢ y with a K(x,y).

Lin. Separators: Perceptron algorithm  Start with all-zeroes weight vector w.  Given example x, predict positive, w ¢ x ¸ 0.  On a mistake, update as follows: Mistake on positive, then update w à w + x Mistake on negative, then update w à w - x Note: need to store all the mistakes so far. Easy to kernelize since w is a weighted sum of examples: Replacewith

Generalize Well if Good Margin If data is linearly separable by margin in  -space, then good sample complexity. |  (x)| ·   If margin  in  -space, then need sample size of only Õ(1/  2 ) to get confidence in generalization. Cannot rely on standard VC-bounds since the dimension of the phi-space might be very large.  VC-dim for the class of linear sep. in R m is m+1.

Kernels & Large Margins If S is a set of labeled examples, then a vector w in the  - space has margin  if: A vector w in the  -space has margin  with respect to P if: A vector w in the  -space has error  at margin  if: ( ,  )-good kernel

Large Margin Classifiers If large margin, then the amount of data we need depends only on 1/  and is independent on the dim of the space!  If large margin  and if our alg. produces a large margin classifier, then the amount of data we need depends only on 1/  [Bartlett & Shawe-Taylor ’99]  If large margin, then Perceptron also behaves well.  Another nice justification based on Random Projection [Arriaga & Vempala ’99].

Kernels & Large Margins Powerful combination in ML in recent years!  A kernel implicitly allows mapping data into a high dimensional space and performing certain operations there without paying a high price computationally.  If data indeed has a large margin linear separator in that space, then one can avoid paying a high price in terms of sample size as well.

Kernels Methods Offer great modularity. No need to change the underlying learning algorithm to accommodate a particular choice of kernel function. Also, we can substitute a different algorithm while maintaining the same kernel.

Kernels, Closure Properties Easily create new kernels using basic ones!

Kernels, Closure Properties Easily create new kernels using basic ones!

What we really care about are good kernels not only legal kernels!

Good Kernels, Margins, and Low Dimensional Mappings  Designing a kernel function is much like designing a feature space.  Given a good kernel K, we can reinterpret K as defining a new set of features. [Balcan-Blum -Vempala, MLJ’06]