Support Vector Machines

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Lecture 9 Support Vector Machines
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Support vector machine
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Separating Hyperplanes
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Support Vector Machine (SVM) Classification
Lecture 10: Support Vector Machines
Linear Discriminant Functions Chapter 5 (Duda et al.)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Outline Separating Hyperplanes – Separable Case
SVM by Sequential Minimal Optimization (SMO)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Non-Bayes classifiers. Linear discriminants, neural networks.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Lecture 4 Linear machine
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Support Vector Machines
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computational Intelligence: Methods and Applications
Support Vector Machines
Computer vision: models, learning and inference
Support Vector Machines
Support Vector Machines (SVM)
An Introduction to Support Vector Machines
Linear machines 28/02/2017.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Large Scale Support Vector Machines
CSSE463: Image Recognition Day 14
Support vector machines
Machine Learning Week 3.
CSSE463: Image Recognition Day 14
Lecture 18. SVM (II): Non-separable Cases
Support vector machines
Support vector machines
Linear Discrimination
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Presentation transcript:

Support Vector Machines Reminder of perceptron Large-margin linear classifier Non-separable case

Linearly separable case Every vector in the grey region is a solution vector. The region is called the “solution region”. A vector in the middle looks better. We can impose conditions to select it.

Gradient descent procedure

Perceptron Y(a) is the set of samples mis-classified by a. When Y(a) is empty, define J(a)=0. Because aty <0 when yi is misclassified, J(a) is non-negative. The gradient is simple: The update rule is: Learning rate

Perceptron

Perceptron

Perceptron The perceptron adjusts a only according to misclassified samples; correctly classified samples are ignored. The final a is a linear combination of the training points. To have good testing-sample performance, a large set of training samples is needed; however, it is almost certain that a large set of training samples is not linearly separable. In the case of linearly non-separable, the iteration doesn’t stop. To make sure it converges, we can let η(k)  0 as k∞. However, how to choose the rate of change?

f(x)=wtx+w0 Large-margin linear classifier Let’s assume the linearly separable case. The optimal separating hyperplane separates the two classes and maximizes the distance to the closest point. f(x)=wtx+w0 Unique solution Better test sample performance

Large-margin linear classifier {x1, ..., xn}: our training dataset in d-dimension yiÎ {1,-1}: class label Our goal: Among all f(x) such that Find the optimal separating hyperplane  Find the largest margin M,

Large-margin linear classifier The border is M away from the hyperplane. M is called “margin”. Drop the ||β||=1 requirement, Let M=1 / ||β||, then the easier version is:

Large-margin linear classifier

Non-separable case When two classes are not linearly separable, allow slack variables for the points on the wrong side of the border:

Non-separable case The optimization problem becomes: ξ=0 when the point is on the correct side of the margin; ξ>1 when the point passes the hyperplane to the wrong side; 0<ξ<1 when the point is in the margin but still on the correct side.

Non-separable case When a point is outside the boundary, ξ=0. It doesn’t play a big role in determining the boundary ---- not forcing any special class of distribution.

Computation equivalent C replaces the constant. For separable case, C=∞.

Take derivatives of β, β0, ξi, set to zero: Computation A quadratic programming problem. The Lagrange function is: Take derivatives of β, β0, ξi, set to zero: And positivity constraints:

Computation Substitute the three lower equations into the top one, the Lagrangian dual objective function: Karush-Kuhn-Tucker conditions include

Computation From , The solution of β has the form: Non-zero coefficients only for those points i for which These are called “support vectors”. Some will lie on the edge of the margin the remainder have , They are on the wrong side of the margin.

Computation

Computation Smaller C. 85% of the points are support points.

Support Vector Machines Enlarge the feature space to make the procedure more flexible. Basis functions Use the same procedure to construct SV classifier The decision is made by

SVM Recall in linear space: With new basis:

SVM When domain knowledge is available, sometimes we could use explicit transformations. But often we cannot.

SVM h(x) is involved ONLY in the form of inner product! So as long as we define the kernel function Which computes the inner product in the transformed space, we don’t need to know what h(x) itself is! “Kernel trick” Some commonly used Kernels:

SVM Recall αi=0 for non-support vectors, f(x) depends only on the support vectors.

SVM K(x,x’) can be seen as a similarity measure between x and x’. The decision is made essentially by a weighted sum of similarity of the object to all the support vectors.

SVM

SVM Bayes error: 0.029 When noise features are present, SVM suffered from not being able to concentrate on a subspace - all terms of the form 2XjXj′ are given equal weight

SVM How to select kernel and parameters ? Domain knowledge. How complex should the space partition be? Should the surface be smooth? Compare the models by their approximate testing error rate cross-validation - Fit data using multiple kernels/parameters - Estimate error rate for each setting - Select the best-performing one Parameter optimization methods