Support Vector Machines (and Kernel Methods in general)

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
7. Support Vector Machines (SVMs)
Support Vector Machines
Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Support Vector Machines
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines Kernel Machines
Support Vector Machine (SVM) Classification
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Machine Learning Queens College Lecture 13: SVM Again.
Outline Separating Hyperplanes – Separable Case
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
An Introduction to Support Vector Machines (M. Law)
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
1 Support Vector Machines Podpůrné vektorové stroje Babak Mahdian, June 2009.
Support Vector Machine Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 3, 2014.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
SVMs in a Nutshell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Support Vector Machines
Support vector machines
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Support vector machines
Support Vector Machines
Support Vector Machines and Kernels
Support vector machines
Support vector machines
Machine Learning Support Vector Machine Supervised Learning
SVMs for Document Ranking
Support Vector Machines 2
Presentation transcript:

Support Vector Machines (and Kernel Methods in general) Machine Learning March 23, 2010

Last Time Multilayer Perceptron/Logistic Regression Networks Neural Networks Error Backpropagation

Today Support Vector Machines Note: we’ll rely on some math from Optimality Theory that we won’t derive.

Maximum Margin Perceptron (and other linear classifiers) can lead to many equally valid choices for the decision boundary Are these really “equally valid”?

Max Margin How can we pick which is best? Maximize the size of the margin. Small Margin Large Margin Are these really “equally valid”?

Support Vectors Support Vectors are those input points (vectors) closest to the decision boundary 1. They are vectors 2. They “support” the decision hyperplane

Support Vectors Define this as a decision problem The decision hyperplane: No fancy math, just the equation of a hyperplane.

Support Vectors Aside: Why do some cassifiers use or Simplicity of the math and interpretation. For probability density function estimation 0,1 has a clear correlate. For classification, a decision boundary of 0 is more easily interpretable than .5.

Support Vectors Define this as a decision problem The decision hyperplane: Decision Function:

Support Vectors Define this as a decision problem The decision hyperplane: Margin hyperplanes:

Support Vectors The decision hyperplane: Scale invariance

Support Vectors The decision hyperplane: Scale invariance

Support Vectors The decision hyperplane: Scale invariance This scaling does not change the decision hyperplane, or the support vector hyperplanes. But we will eliminate a variable from the optimization The decision hyperplane: Scale invariance

What are we optimizing? We will represent the size of the margin in terms of w. This will allow us to simultaneously Identify a decision boundary Maximize the margin

How do we represent the size of the margin in terms of w? There must at least one point that lies on each support hyperplanes Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).

How do we represent the size of the margin in terms of w? There must at least one point that lies on each support hyperplanes Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).

How do we represent the size of the margin in terms of w? There must at least one point that lies on each support hyperplanes Thus: And:

How do we represent the size of the margin in terms of w? There must at least one point that lies on each support hyperplanes Thus: And:

How do we represent the size of the margin in terms of w? The vector w is perpendicular to the decision hyperplane If the dot product of two vectors equals zero, the two vectors are perpendicular.

How do we represent the size of the margin in terms of w? The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.

Aside: Vector Projection

How do we represent the size of the margin in terms of w? The margin is the projection of x1 – x2 onto w, the normal of the hyperplane. Projection: Size of the Margin:

Maximizing the margin Goal: maximize the margin Linear Separability of the data by the decision boundary

Max Margin Loss Function If constraint optimization then Lagrange Multipliers Optimize the “Primal”

Max Margin Loss Function Optimize the “Primal” Partial wrt b

Max Margin Loss Function Optimize the “Primal” Partial wrt w

Max Margin Loss Function Optimize the “Primal” Partial wrt w \frac{\partial L(\vec{w},b)}{\partial \vec{w}}&=&0\\\vec{w} - \sum_{i=0}^{N-1}\alpha_it_i\vec{x_i}&=&0\\\vec{w}&=&\sum_{i=0}^{N-1}\alpha_it_i\vec{x_i}\\ Now have to find αi. Substitute back to the Loss function

Max Margin Loss Function Construct the “dual” W(\alpha) = \sum_{i=0}^{N-1}\alpha_i - \frac{1}{2}\sum_{i,j=0}^{N-1}\alpha_i\alpha_j t_it_j(\vec{x_i}\cdot\vec{x_j})

Dual formulation of the error Optimize this quadratic program to identify the lagrange multipliers and thus the weights There exist (extremely) fast approaches to quadratic optimization in both C, C++, Python, Java and R

Quadratic Programming If Q is positive semi definite, then f(x) is convex. If f(x) is convex, then there is a single maximum.

Support Vector Expansion New decision Function Independent of the Dimension of x! When αi is non-zero then xi is a support vector When αi is zero xi is not a support vector

Kuhn-Tucker Conditions In constraint optimization: At the optimal solution Constraint * Lagrange Multiplier = 0 \alpha_i(1-t_i(\vec{w}^T\vec{x_i} + b))=0 Only points on the decision boundary contribute to the solution!

Visualization of Support Vectors

Interpretability of SVM parameters What else can we tell from alphas? If alpha is large, then the associated data point is quite important. It’s either an outlier, or incredibly important. But this only gives us the best solution for linearly separable data sets…

Basis of Kernel Methods The decision process doesn’t depend on the dimensionality of the data. We can map to a higher dimensionality of the data space. Note: data points only appear within a dot product. The error is based on the dot product of data points – not the data points themselves.

Basis of Kernel Methods Since data points only appear within a dot product. Thus we can map to another space through a replacement The error is based on the dot product of data points – not the data points themselves.

Learning Theory bases of SVMs Theoretical bounds on testing error. The upper bound doesn’t depend on the dimensionality of the space The lower bound is maximized by maximizing the margin, γ, associated with the decision boundary.

Why we like SVMs They work Easily interpreted. Good generalization Easily interpreted. Decision boundary is based on the data in the form of the support vectors. Not so in multilayer perceptron networks Principled bounds on testing error from Learning Theory (VC dimension)

SVM vs. MLP SVMs have many fewer parameters SVM: Maybe just a kernel parameter MLP: Number and arrangement of nodes and eta learning rate SVM: Convex optimization task MLP: likelihood is non-convex -- local minima R(\theta)=\frac{1}{N}\sum_{n=0}^N\frac{1}{2}\left(y_n-g\left(\sum_k w_{kl}g\left(\sum_jw_{jk}g\left(\sum_iw_{ij}x_{n,i}\right) \right)\right)\right)^2

Soft margin classification There can be outliers on the other side of the decision boundary, or leading to a small margin. Solution: Introduce a penalty term to the constraint function

Soft Max Dual Still Quadratic Programming! W(\alpha) = \sum_{i=0}^{N-1}\alpha_i - \frac{1}{2}\sum_{i,j=0}^{N-1}t_it_j\alpha_i\alpha_j(x_i\cdot x_j)

Soft margin example Points are allowed within the margin, but cost is introduced. Hinge Loss

Probabilities from SVMs Support Vector Machines are discriminant functions Discriminant functions: f(x)=c Discriminative models: f(x) = argmaxc p(c|x) Generative Models: f(x) = argmaxc p(x|c)p(c)/p(x) No (principled) probabilities from SVMs SVMs are not based on probability distribution functions of class instances.

Efficiency of SVMs Not especially fast. Training – n^3 Evaluation – n Quadratic Programming efficiency Evaluation – n Need to evaluate against each support vector (potentially n)

Good Bye Next time: The Kernel “Trick” -> Kernel Methods or How can we use SVMs that are not linearly separable?