Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction to Support Vector Machines

Similar presentations


Presentation on theme: "An Introduction to Support Vector Machines"— Presentation transcript:

1 An Introduction to Support Vector Machines

2 CSE 802. Prepared by Martin Law
Outline What is a good decision boundary for binary classification problem? From minimizing the misclassification error to maximize the margin Two classes, linearly inseparable How to deal with some noisy data How to make SVM non-linear: kernel Conclusion 11/12/2018 CSE 802. Prepared by Martin Law

3 Two Class Problem: Linear Separable Case
The problem of minimizing the misclassification: Many decision boundaries can separate these two classes without misclassification Which one should we choose? Class 2 Perceptron learning rule can be used to find any decision boundary between class 1 and class 2 Class 1 11/12/2018 CSE 802. Prepared by Martin Law

4 CSE 802. Prepared by Martin Law
Maximizing the margin The decision boundary should be as far away from the data of both classes as possible We should maximize the margin, m Class 2 m Class 1 11/12/2018 CSE 802. Prepared by Martin Law

5 The Optimization Problem
Let {x1, ..., xn} be our data set and let yi Î {1,-1} be the class label of xi The decision boundary should classify all points correctly Þ A constrained optimization problem 11/12/2018 CSE 802. Prepared by Martin Law

6 CSE 802. Prepared by Martin Law
The dual Problem We can transform the problem to its dual This is a quadratic programming (QP) problem Global maximum of ai can always be found w can be recovered by Let x(1) and x(-1) be two S.V. Then b = -1/2( w^T x(1) + w^T x(-1) ) 11/12/2018 CSE 802. Prepared by Martin Law

7 A Geometrical Interpretation
Class 2 a10=0 a8=0.6 a7=0 a2=0 a5=0 a1=0.8 a4=0 So, if change internal points, no effect on the decision boundary a6=1.4 a9=0 a3=0 Class 1 11/12/2018 CSE 802. Prepared by Martin Law

8 Characteristics of the Solution
Many of the ai are zero w is a linear combination of a small number of data Sparse representation xi with non-zero ai are called support vectors (SV) The decision boundary is determined only by the SV Let tj (j=1, ..., s) be the indices of the s support vectors. We can write For testing with a new data z Compute and classify z as class 1 if the sum is positive, and class 2 otherwise 11/12/2018 CSE 802. Prepared by Martin Law

9 CSE 802. Prepared by Martin Law
Some Notes There are theoretical upper bounds on the error on unseen data for SVM The larger the margin, the smaller the bound The smaller the number of SV, the smaller the bound Note that in both training and testing, the data are referenced only as inner product, xTy This is important for generalizing to the non-linear case 11/12/2018 CSE 802. Prepared by Martin Law

10 How About Not Linearly Separable
We allow “error” xi in classification to tolerate some noisy data Class 2 Class 1 11/12/2018 CSE 802. Prepared by Martin Law

11 Soft Margin Hyperplane
Define xi=0 if there is no error for xi xi are just “slack variables” in optimization theory We want to minimize C : tradeoff parameter between error and margin The optimization problem becomes 11/12/2018 CSE 802. Prepared by Martin Law

12 The Optimization Problem
The dual of the problem is w is also recovered as The only difference with the linear separable case is that there is an upper bound C on ai Once again, a QP solver can be used to find ai Note also, everything is done by inner-products 11/12/2018 CSE 802. Prepared by Martin Law

13 Extension to Non-linear Decision Boundary
In most of the situation, the decision boundary we are looking for should NOT be a straight line. f( ) f( ) f( ) f( ) f( ) f(.) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) f( ) Feature space Input space 11/12/2018 CSE 802. Prepared by Martin Law

14 Extension to Non-linear Decision Boundary
Key idea: Use a function f(x) Transform xi to a higher dimensional space to “make life easier” Input space: the space xi are in Feature space: the space of f(xi) after transformation Searching a hyper plane in Feature space to maximize the margin. The hyper plane in Feature space correspond to a curve in input space. Why transform? We still like the idea of maximizing the margin. More powerful in mining knowledge, more flexible. XOR: x_1, x_2, and we want to transform to x_1^2, x_2^2, x_1 x_2 It can also be viewed as feature extraction from the feature vector x, but now we extract more feature than the number of features in x. 11/12/2018 CSE 802. Prepared by Martin Law

15 Transformation and Kernel
11/12/2018 CSE 802. Prepared by Martin Law

16 Kernel: Efficient computation
Define the kernel function K (x,y) as Consider the following transformation In practice we don’t need to worry about the transformation function f(x), what we have to do is to select a good kernel for our problem. 11/12/2018 CSE 802. Prepared by Martin Law

17 Examples of Kernel Functions
Polynomial kernel with degree d Radial basis function kernel with width s Closely related to radial basis function neural networks Research on different kernel functions in different applications is very active Despite violating Mercer condition, the sigmoid kernel function can still work 11/12/2018 CSE 802. Prepared by Martin Law

18 Summary: Steps for Classification
Prepare the data matrix Select the kernel function to use Select the parameter of the kernel function and the value of C You can use the values suggested by the SVM software, or you can set apart a validation set to determine the values of the parameter Execute the training algorithm and obtain the ai Unseen data can be classified using the ai and the support vectors 11/12/2018 CSE 802. Prepared by Martin Law

19 Classification result of SVM
11/12/2018 CSE 802. Prepared by Martin Law

20 CSE 802. Prepared by Martin Law
Conclusion Most popular tools for numeric binary classification Key ideas of SVM: Maximizing the margin can lead to a “good” classifier Transformation to higher space to make the classifier more flexible. Kernel tricks for efficient computation Weaknesses of SVM Need a “good” kernel function 11/12/2018 CSE 802. Prepared by Martin Law

21 CSE 802. Prepared by Martin Law
Resources 11/12/2018 CSE 802. Prepared by Martin Law


Download ppt "An Introduction to Support Vector Machines"

Similar presentations


Ads by Google