Support Vector Machine

Name: Support Vector Machine
Uploaded: 2017-07-31T04:25:44+00:00
Duration: PTM12S56
Description: Support Vector Machine

Support Vector Machine
Rong Jin

Linear Classifiers How to find the linear decision boundary that linearly separates data points from two classes? denotes +1 denotes -1

Linear Classifiers denotes +1 denotes -1

Linear Classifiers Any of these would be fine.. ..but which is best?
denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore

Classifier Margin Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called an Linear SVM) Copyright © 2001, 2003, Andrew W. Moore

Why Maximum Margin ? If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. Empirically it works very very well. denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

Estimate the Margin What is the classification margin for wx+b= 0?
denotes +1 denotes -1 x What is the classification margin for wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore

Maximize the Classification Margin
denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin denotes +1 denotes -1 x
Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin Quadratic programming problem
Quadratic objective function Linear equality and inequality constraints Well studied problem in OR

Quadratic Programming
Find Quadratic criterion Subject to n additional linear inequality constraints And subject to e additional linear equality constraints Copyright © 2001, 2003, Andrew W. Moore

Linearly Inseparable Case
This is going to be a problem! What should we do? denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

Dual Form of SVM How to decide b ?

Dual Form of SVM denotes +1 denotes -1

Suppose we’re in 1-dimension
What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore

Suppose we’re in 1-dimension
Not a big surprise x=0 Positive “plane” Negative “plane” Copyright © 2001, 2003, Andrew W. Moore

Harder 1-dimensional Dataset
x=0 Copyright © 2001, 2003, Andrew W. Moore

Common SVM Basis Functions
Polynomial terms of x of degree 1 to q Radial (Gaussian) basis functions Copyright © 2001, 2003, Andrew W. Moore

Quadratic Basis Functions
Constant Term Quadratic Basis Functions Linear Terms Number of terms (assuming m input dimensions) = (m+2)-choose-2 = (m+2)(m+1)/2 = m2/2 Pure Quadratic Terms Quadratic Cross-Terms Copyright © 2001, 2003, Andrew W. Moore

Quadratic Dot Products
+ Quadratic Dot Products + +

Quadratic Dot Products
Just out of casual, innocent, interest, let’s look at another function of a and b: Quadratic Dot Products Copyright © 2001, 2003, Andrew W. Moore

SVM Kernel Functions Polynomial kernel function
Radial basis kernel function (universal kernel) Copyright © 2001, 2003, Andrew W. Moore

Kernel Tricks Replacing dot product with a kernel function
Not all functions are kernel functions Are they kernel functions ? Copyright © 2001, 2003, Andrew W. Moore

Kernel Tricks Mercer’s condition
To expand Kernel function k(x,y) into a dot product, i.e. k(x,y)=(x)(y), k(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds Copyright © 2001, 2003, Andrew W. Moore

Reproducing Kernel Hilbert Space (RKHS)
Reproducing Kernel Hilbert Space H Eigen decomposition: Elements of space H: Reproducing property

Reproducing Kernel Hilbert Space (RKHS)
Representer theorem

Diffusion Kernel Kernel function describes the correlation or similarity between two data points Given that a similarity function s(x,y) Non-negative and symmetric Does not obey Mercer’s condition How can we generate a kernel function based on this similarity function? A graph theory approach … Copyright © 2001, 2003, Andrew W. Moore

Diffusion Kernel Create a graph for all the training examples
Each vertex corresponds to a data point The weight of each edge is the similarity s(x,y) Graph Laplacian Negative semi-definite Copyright © 2001, 2003, Andrew W. Moore

Doing Multi-class Classification
SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Copyright © 2001, 2003, Andrew W. Moore

Kernel Learning What if we have multiple kernel functions
Which one is the best ? How can we combine multiple kernels ?

Kernel Learning

References An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): , The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998 Software: SVM-light, free download Copyright © 2001, 2003, Andrew W. Moore

Support Vector Machine

Similar presentations

Presentation on theme: "Support Vector Machine"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Support Vector Machine

Similar presentations

Presentation on theme: "Support Vector Machine"— Presentation transcript:

Similar presentations

About project

Feedback