Presentation is loading. Please wait.

Presentation is loading. Please wait.

Support Vector Machine

Similar presentations


Presentation on theme: "Support Vector Machine"— Presentation transcript:

1 Support Vector Machine
Rong Jin

2 Linear Classifiers How to find the linear decision boundary that linearly separates data points from two classes? denotes +1 denotes -1

3 Linear Classifiers denotes +1 denotes -1

4 Linear Classifiers Any of these would be fine.. ..but which is best?
denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore

5 Classifier Margin Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

6 Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called an Linear SVM) Copyright © 2001, 2003, Andrew W. Moore

7 Why Maximum Margin ? If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. Empirically it works very very well. denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

8 Estimate the Margin denotes +1 denotes -1 x What is the distance expression for a point x to a line wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore

9 Estimate the Margin What is the classification margin for wx+b= 0?
denotes +1 denotes -1 x What is the classification margin for wx+b= 0? Copyright © 2001, 2003, Andrew W. Moore

10 Maximize the Classification Margin
denotes +1 denotes -1 x Copyright © 2001, 2003, Andrew W. Moore

11 Maximum Margin denotes +1 denotes -1 x
Copyright © 2001, 2003, Andrew W. Moore

12 Maximum Margin denotes +1 denotes -1 x
Copyright © 2001, 2003, Andrew W. Moore

13 Maximum Margin Quadratic programming problem
Quadratic objective function Linear equality and inequality constraints Well studied problem in OR

14 Quadratic Programming
Find Quadratic criterion Subject to n additional linear inequality constraints And subject to e additional linear equality constraints Copyright © 2001, 2003, Andrew W. Moore

15 Linearly Inseparable Case
This is going to be a problem! What should we do? denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

16 Linearly Inseparable Case
Relax the constraints Penalize the relaxation denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

17 Linearly Inseparable Case
Relax the constraints Penalize the relaxation denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore

18 Linearly Inseparable Case
denotes +1 denotes -1 Still a quadratic programming problem Copyright © 2001, 2003, Andrew W. Moore

19 Linearly Inseparable Case
Copyright © 2001, 2003, Andrew W. Moore

20 Linearly Inseparable Case
Support Vector Machine Regularized logistic regression Copyright © 2001, 2003, Andrew W. Moore

21 Dual Form of SVM How to decide b ?
Copyright © 2001, 2003, Andrew W. Moore

22 Dual Form of SVM denotes +1 denotes -1
Copyright © 2001, 2003, Andrew W. Moore

23 Suppose we’re in 1-dimension
What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore

24 Suppose we’re in 1-dimension
Not a big surprise x=0 Positive “plane” Negative “plane” Copyright © 2001, 2003, Andrew W. Moore

25 Harder 1-dimensional Dataset
x=0 Copyright © 2001, 2003, Andrew W. Moore

26 Harder 1-dimensional Dataset
x=0 Copyright © 2001, 2003, Andrew W. Moore

27 Harder 1-dimensional Dataset
x=0 Copyright © 2001, 2003, Andrew W. Moore

28 Common SVM Basis Functions
Polynomial terms of x of degree 1 to q Radial (Gaussian) basis functions Copyright © 2001, 2003, Andrew W. Moore

29 Quadratic Basis Functions
Constant Term Quadratic Basis Functions Linear Terms Number of terms (assuming m input dimensions) = (m+2)-choose-2 = (m+2)(m+1)/2 = m2/2 Pure Quadratic Terms Quadratic Cross-Terms Copyright © 2001, 2003, Andrew W. Moore

30 Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore

31 Dual Form of SVM Copyright © 2001, 2003, Andrew W. Moore

32 Quadratic Dot Products
+ Quadratic Dot Products + +

33 Quadratic Dot Products
Just out of casual, innocent, interest, let’s look at another function of a and b: Quadratic Dot Products Copyright © 2001, 2003, Andrew W. Moore

34 Kernel Trick Copyright © 2001, 2003, Andrew W. Moore

35 Kernel Trick Copyright © 2001, 2003, Andrew W. Moore

36 SVM Kernel Functions Polynomial kernel function
Radial basis kernel function (universal kernel) Copyright © 2001, 2003, Andrew W. Moore

37 Kernel Tricks Replacing dot product with a kernel function
Not all functions are kernel functions Are they kernel functions ? Copyright © 2001, 2003, Andrew W. Moore

38 Kernel Tricks Mercer’s condition
To expand Kernel function k(x,y) into a dot product, i.e. k(x,y)=(x)(y), k(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds Copyright © 2001, 2003, Andrew W. Moore

39 Kernel Tricks Introducing nonlinearity into the model
Computationally efficient Copyright © 2001, 2003, Andrew W. Moore

40 Nonlinear Kernel (I) Copyright © 2001, 2003, Andrew W. Moore

41 Nonlinear Kernel (II) Copyright © 2001, 2003, Andrew W. Moore

42 Reproducing Kernel Hilbert Space (RKHS)
Reproducing Kernel Hilbert Space H Eigen decomposition: Elements of space H: Reproducing property

43 Reproducing Kernel Hilbert Space (RKHS)
Representer theorem

44 Kernelize Logistic Regression
How can we introduce nonlinearity into the logistic regression model? Copyright © 2001, 2003, Andrew W. Moore

45 Diffusion Kernel Kernel function describes the correlation or similarity between two data points Given that a similarity function s(x,y) Non-negative and symmetric Does not obey Mercer’s condition How can we generate a kernel function based on this similarity function? A graph theory approach … Copyright © 2001, 2003, Andrew W. Moore

46 Diffusion Kernel Create a graph for all the training examples
Each vertex corresponds to a data point The weight of each edge is the similarity s(x,y) Graph Laplacian Negative semi-definite Copyright © 2001, 2003, Andrew W. Moore

47 Diffusion Kernel Consider a simple Laplacian Consider L2, L3, …
What do these matrices represent? A diffusion kernel Copyright © 2001, 2003, Andrew W. Moore

48 Diffusion Kernel: Properties
Positive definite How to compute the diffusion kernel ? Copyright © 2001, 2003, Andrew W. Moore

49 Doing Multi-class Classification
SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). What can be done? Answer: with output arity N, learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N” Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Copyright © 2001, 2003, Andrew W. Moore

50 Kernel Learning What if we have multiple kernel functions
Which one is the best ? How can we combine multiple kernels ?

51 Kernel Learning

52 References An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): , The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998 Software: SVM-light, free download Copyright © 2001, 2003, Andrew W. Moore


Download ppt "Support Vector Machine"

Similar presentations


Ads by Google