Support Vector Machine Rong Jin
Linear Classifiers How to find the linear decision boundary that linearly separates data points from two classes? denotes +1 denotes -1
Linear Classifiers denotes +1 denotes -1
Linear Classifiers Any of these would be fine.. ..but which is best? denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore
Classifier Margin Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called an Linear SVM) Copyright © 2001, 2003, Andrew W. Moore
Why Maximum Margin ? If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. Empirically it works very very well. denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore
Estimate the Margin denotes +1 denotes -1 x What is the distance expression for a point x to a line wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore
Estimate the Margin What is the classification margin for wx+b= 0? denotes +1 denotes -1 x What is the classification margin for wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore
Maximize the Classification Margin denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin Quadratic programming problem Quadratic objective function Linear equality and inequality constraints Well studied problem in OR
Quadratic Programming Find Quadratic criterion Subject to n additional linear inequality constraints And subject to e additional linear equality constraints Copyright © 2001, 2003, Andrew W. Moore
Linearly Inseparable Case This is going to be a problem! What should we do? denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore
Linearly Inseparable Case Relax the constraints Penalize the relaxation denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore
Linearly Inseparable Case Relax the constraints Penalize the relaxation denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore
Linearly Inseparable Case denotes +1 denotes -1 Still a quadratic programming problem Copyright © 2001, 2003, Andrew W. Moore
Linearly Inseparable Case Copyright © 2001, 2003, Andrew W. Moore
Linearly Inseparable Case Support Vector Machine Regularized logistic regression Copyright © 2001, 2003, Andrew W. Moore
Dual Form of SVM How to decide b ? Copyright © 2001, 2003, Andrew W. Moore
Dual Form of SVM denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore
Suppose we’re in 1-dimension What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore
Suppose we’re in 1-dimension Not a big surprise x=0 Positive “plane” Negative “plane” Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional Dataset x=0 Copyright © 2001, 2003, Andrew W. Moore
Common SVM Basis Functions Polynomial terms of x of degree 1 to q Radial (Gaussian) basis functions Copyright © 2001, 2003, Andrew W. Moore
Quadratic Basis Functions Constant Term Quadratic Basis Functions Linear Terms Number of terms (assuming m input dimensions) = (m+2)-choose-2 = (m+2)(m+1)/2 = m2/2 Pure Quadratic Terms Quadratic Cross-Terms Copyright © 2001, 2003, Andrew W. Moore
Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore
Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore
Quadratic Dot Products + Quadratic Dot Products + +
Quadratic Dot Products Just out of casual, innocent, interest, let’s look at another function of a and b: Quadratic Dot Products Copyright © 2001, 2003, Andrew W. Moore
Kernel Trick Copyright © 2001, 2003, Andrew W. Moore
Kernel Trick Copyright © 2001, 2003, Andrew W. Moore
SVM Kernel Functions Polynomial kernel function Radial basis kernel function (universal kernel) Copyright © 2001, 2003, Andrew W. Moore
Kernel Tricks Replacing dot product with a kernel function Not all functions are kernel functions Are they kernel functions ? Copyright © 2001, 2003, Andrew W. Moore
Kernel Tricks Mercer’s condition To expand Kernel function k(x,y) into a dot product, i.e. k(x,y)=(x)(y), k(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds Copyright © 2001, 2003, Andrew W. Moore
Kernel Tricks Introducing nonlinearity into the model Computationally efficient Copyright © 2001, 2003, Andrew W. Moore
Nonlinear Kernel (I) Copyright © 2001, 2003, Andrew W. Moore
Nonlinear Kernel (II) Copyright © 2001, 2003, Andrew W. Moore
Reproducing Kernel Hilbert Space (RKHS) Reproducing Kernel Hilbert Space H Eigen decomposition: Elements of space H: Reproducing property
Reproducing Kernel Hilbert Space (RKHS) Representer theorem
Kernelize Logistic Regression How can we introduce nonlinearity into the logistic regression model? Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel Kernel function describes the correlation or similarity between two data points Given that a similarity function s(x,y) Non-negative and symmetric Does not obey Mercer’s condition How can we generate a kernel function based on this similarity function? A graph theory approach … Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel Create a graph for all the training examples Each vertex corresponds to a data point The weight of each edge is the similarity s(x,y) Graph Laplacian Negative semi-definite Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel Consider a simple Laplacian Consider L2, L3, … What do these matrices represent? A diffusion kernel Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel: Properties Positive definite How to compute the diffusion kernel ? Copyright © 2001, 2003, Andrew W. Moore
Doing Multi-class Classification SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Copyright © 2001, 2003, Andrew W. Moore
Kernel Learning What if we have multiple kernel functions Which one is the best ? How can we combine multiple kernels ?
Kernel Learning
References An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998 Software: SVM-light, http://svmlight.joachims.org/, free download Copyright © 2001, 2003, Andrew W. Moore