Support Vector Machines

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines

CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Instance Based Learning
Separating Hyperplanes
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
x – independent variable (input)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
CS Instance Based Learning1 Instance Based Learning.
ML Concepts Covered in 678 Advanced MLP concepts: Higher Order, Batch, Classification Based, etc. Recurrent Neural Networks Support Vector Machines Relaxation.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Radial Basis Function Networks
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
This week: overview on pattern recognition (related to machine learning)
SVM by Sequential Minimal Optimization (SMO)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
An Introduction to Support Vector Machine (SVM)
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines
Support Vector Machine
An Introduction to Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
COSC 4335: Other Classification Techniques
Support Vector Machines
COSC 4368 Machine Learning Organization
Support Vector Machines 2
Introduction to Machine Learning
Presentation transcript:

Support Vector Machines Elegant combination of statistical learning theory and machine learning – Vapnik Good empirical results Non-trivial implementation Can be slow and memory intensive Binary classifier Much current work Do Quadric review first, (including each layer just linear divider, will help see polynomial kernel issues

SVM Comparisons In order to have a natural way to compare and gain intuition on SVMs, we will first do a brief review of: Quadric/Higher Order Machines Radial Basis Function Networks Note that BP follows similar strategy Non-linear map of input features into new feature space which is now linearly separable But, BP learns the non-linear mapping Perceptron Slide 38-42 – Higher Order (BP slide 8-10) – Multi-layer learning (Rosenblatt's simple perceptron) BP-slide 3 – linearly separable for top layer IBL Slide 14-16 - RBF

SVM Overview Non-linear mapping from input space into a higher dimensional feature space Non adaptive – User defined by Kernel function Linear decision surface (hyper-plane) sufficient in the high dimensional feature space Note that this is the same as we do with standard MLP/BP Avoid complexity of high dimensional feature space with kernel functions which allow computations to take place in the input space, while giving the power of being in the feature space Get improved generalization by placing hyper-plane at the maximum margin

Maximum Margin and Support Vectors

Standard (Primal) Perceptron Algorithm Target minus Output not used. Just add (or subtract) a portion (multiplied by learning rate) of the current pattern to the weight vector If weight vector starts at 0 then learning rate can just be 1 R could also be 1 for this discussion

Dual and Primal Equivalence Note that the final weight vector is a linear combination of the training patterns The basic decision function (primal and dual) is How do we obtain the coefficients αi

Dual Perceptron Training Algorithm Assume initial 0 weight vector

Dual vs. Primal Form Gram Matrix: all (xi·xj) pairs – Done once and stored (can be large) αi: One for each pattern in the training set. Incremented each time it is misclassified, which would have led to a weight change in primal form Magnitude of αi is an indicator of effect of pattern on weights (embedding strength) Note that patterns on borders have large αi while easy patterns never effect the weights Could have trained with just the subset of patterns with αi > 0 (support vectors) and ignored the others Can train in dual. How about execution? Either way (dual could be efficient if support vectors are few) Would if transformed feature space is still not linearly separable? αi would keep growing. Could do early stopping or bound the αi with some maximum C, thus allowing outliers.

Feature Space and Kernel Functions Since most problems require a non-linear decision surface, we do a non-linear map Φ(x) = (Φ1(x),Φ2(x), …,ΦN(x)) from input space to feature space Feature space can be of very high (even infinite) dimensionality By choosing a proper kernel function/feature space, the high dimensionality can be avoided in computation but effectively used for the decision surface to solve complex problems - "Kernel Trick" A Kernel is appropriate if the matrix of all K(xi, xj) is positive semi-definite (has non-negative eigenvalues). Even when this is not satisfied many kernels still work in practice (sigmoid). Phi

Basic Kernel Execution Primal: Dual: Gaussian kernel example, learned reduced weighted nearest neighbor style F is a space made from inner products of the original feature space X just show f(x) for output rather than sign, can always set to sign but f(x) has more info Kernel version: Now we see the real advantage of working in the dual form Note intuition of execution: Gaussian (and other) Kernel similar to reduced weighted K-nearest neighbor

replace middle two layers with just k(x,xi)

Polynomial Kernels For greater dimensionality we can do Walk through with two vectors x1 x2 and z1 z2 (next slide) For greater dimensionality we can do

Polynomial Kernel Example Assume a simple 2-d feature vector x: x1, x2 Note that a new instance x will be paired with training vectors xi from the training set using K(x, xi). We'll call these x z for this example. Note that in the input space x we are getting the 2nd order terms: x12, x22, and x1x2

Polynomial Kernel Example Following is the 3rd degree polynomial kernel

Polynomial Kernel Example Note that for the 2nd degree polynomial with two variables we get the 2nd order terms x12, x22, and x1x2 Compare with quadric. We also add bias weight with SVM. For the 2nd degree polynomial with three variables we would get the 2nd order terms x12, x22, x32, x1x2, x1x3, and x2x3 Note that we only get the dth degree terms. However, with some kernel creation/manipulation we could also include the lower order terms Only a subset of the quadric terms, but we can get the others.

SVM Execution Assume novel instance x = <.4, .8> Assume training set vectors (with bias = 0) .5, .3 y= -1 α=1 -.2, .7 y= 1 α=2 What is the output for this case? Show higher-order and kernel computation

SVM Execution Assume novel instance x = <.4, .8> Assume training set vectors (with bias = 0) .5, .3 y= -1 α=1 -.2, .7 y= 1 α=2 1·-1·(<.5,.3>·<.4,.8>)2 + 2·1·(<-.2,.7>·<.4,.8>)2 = -.1936 + .4608 = .2672 This is Kernel version, what about higher order version? Kernel version

SVM Execution Assume novel instance x = <.4, .8> Assume training set vectors (with bias = 0) .5, .3 y= -1 α=1 -.2, .7 y= 1 α=2 1·-1·(<.5,.3>·<.4,.8>)2 + 2·1·(<-.2,.7>·<.4,.8>)2 = -.1936 + .4608 = .2672 1·-1·(.52 ·.42 + 2·.5·.3·.4·.8 + .32 ·.82) = -.04 + -.096 + -.0576 = -.1936 2·1·((-.2)2 ·.42 + 2·-.2·.7·.4·.8 + .72 ·.82) = 2·(.0064 - .0896 + .3136) = .4608 Showed each term separate for Phi space (higher order)

Kernel Trick So are we getting the same power as the Quadric machine without having to directly calculate the 2nd order terms?

Kernel Trick So are we getting the same power as the Quadric machine without having to directly calculate the 2nd order terms? No. With SVM we weight the scalar result of the kernel, which has constant coefficients in the 2nd order equation! With Quadric we can have a separate learned weight (coefficient) for each term Although do get to individually weight each support vector Assume that the 3rd term above was really irrelevant. How would Quadric/SVM deal with that? Quadric would weight the term as 0. SVM can't – it comes with the entire equation. It can weight the equation (but not individual terms within it). But combinations of weightings of the different equations still lets us separate lots of things (ala RBF – are there enough support vectors to work from to get our desired classification)

Kernel Trick Realities Polynomial Kernel - all monomials of degree 2 x1x3y1y3 + x3x3y3y3 + .... (all 2nd order terms) K(x,y) = <Φ (x)·Φ (y)> = … + (x1x3)(y1y3) + (x3x3)(y3y3) + ... Lot of stuff represented with just one <x·y>2 However, in a full higher order solution we would would like adaptive coefficients for each of these higher order terms (i.e. -2x1 + 3·x1x2 + …) SVM does a weighted (embedding coefficients) sum of them all with individual constant internal coefficients Thus, not as powerful as a higher order system with arbitrary weighting The more desirable arbitrary weighting can be done in an MLP because learning is done in the layers between inputs and hidden nodes SVM feature to higher order feature is a fixed mapping. No learning at that level. Of course, individual weighting requires a theoretically exponential increase in terms/hidden nodes for which we need to find weights as the polynomial degree increases. Also need learning algorithms which can actually find these most salient higher-order features. Do Quadric review first, will help see polynomial kernel issues Remember, we don't learn weights (per se) even for the combined kernel values (just embedding strengths), but the strengths are based on set coefficients.

SVM vs RBF Comparison SVM commonly uses a Gaussian kernel Kernel is a distance metric (ala K nearest neighbor) How does the SVM differ from RBF?

SVM vs RBF Comparison SVM commonly uses a Gaussian kernel How does the SVM differ from RBF? SVM will automatically discover which training instances to use as prototypes during the quadratic optimization SVM works only in the kernel space while RBF calculates values in the much larger exploded space Both weight the different prototypes RBF uses a perceptron style learning algorithm to create weights between prototypes and output classes RBF supports multiple output classes and nodes have a vote for each output class, whereas SVM support vectors can only vote for two classes: 1 or -1 SVM will create a maximum margin hyperplane decision surface Since internal feature coefficients are constants in the Gaussian distance kernel (for both SVM and RBF), SVM will suffer from fixed/irrelevant features just like RBF/k-nearest neighbor Same Kernel Trick weakness – no learning in the kernel map from input to higher order feature space Discuss irrelevant feature weakness of K-nn and the similar issue with SVM not being able to weight individual terms

Choosing a Kernel Can start from a desired feature space and try to construct a kernel More often one starts from a reasonable kernel and may not analyze the feature space Some kernels are a better fit for certain problems, domain knowledge can be helpful Common kernels: Polynomial Gaussian Sigmoidal Application specific

Maximum Margin Maximum margin can lead to overfit due to noise Problem may not be linearly separable even in transformed feature space Soft Margin is a common solution, allows slack variables αi constrained to be >= 0 and less than C. The C allows outliers. How to pick C? Can try different values for the particular application to see which works best.

Soft Margins

Quadratic Optimization Optimizing the margin in the higher order feature space is convex and thus there is one guaranteed solution at the minimum (or maximum) SVM Optimizes the dual representation (avoiding the higher order feature space) with variations on the following Maximizing Σαi tends towards larger α subject to Σαiyi = 0 and α ≤ C (which both tend towards smaller α) Without this term, α = 0 could suffice 2nd term minimizes number of support vectors since Two positive (or negative) instances that are similar (high Kernel result) would increase size of term. Note that both (or either) instances usually not needed. Two non-matched instances which are similar should have larger α since they are likely support vectors at the decision boundary (negative term helps maximize) The optimization is quadratic in the αi terms and linear in the constraints – can drop C maximum for non soft margin While quite solvable, requires complex code and usually done with a numerical methods software package – Quadratic programming Differs from straight training set accuracy due to the alpha sub j term

Execution Typically use dual form which can take advantage of Kernel efficiency If the number of support vectors is small then dual is fast In cases of low dimensional feature spaces, could derive weights from αi and use normal primal execution Can get speed-up by dropping support vectors with embedding coefficients below some threshold

Standard SVM Approach Select a 2 class training set, a kernel function (calculate the Gram Matrix), and choose the C value (soft margin parameter) Pass these to a Quadratic optimization package which will return an α for each training pattern based on a variation of the following (non-bias version) Patterns with non-zero α are the support vectors for the maximum margin SVM classifier. Execute by using the support vectors key slide

A Simple On-Line Alternative Stochastic on-line gradient ascent Can be effective This version assumes no bias Sensitive to learning rate Stopping criteria tests whether it is an appropriate solution – can just go until little change is occurring or can test optimization conditions directly Can be quite slow and usually quadratic programming is used to get an exact solution Newton and conjugate gradient techniques also used – Can work well since it is a guaranteed convex surface – bowl shaped Could skip these next 2 slides

Maintains a margin of 1 (typical in standard SVM implementation) which can always be done by scaling  or equivalently w and b This is done with the (1 - actual) term below, which can update even when correct, as it tries to make the distance of support vectors to the decision surface be exactly 1 If parenthetical term < 0 (i.e. current instance is correct and beyond margin), then don’t update 

Large Training Sets Big problem since the Gram matrix (all (xi·xj) pairs) is O(n2) for n data patterns 106 data patterns require 1012 memory items Can’t keep them in memory Also makes for a huge inner loop in dual training Key insight: most of the data patterns will not be support vectors so they are not needed

Chunking Start with a reasonably sized subset of the Data set (one that fits in memory and does not take too long during training) Train on this subset and just keep the support vectors or the m patterns with the highest αi values Grab another subset, add the current support vectors to it and continue training Note that this training may allow previous support vectors to be dropped as better ones are discovered Repeat until all data is used and no new support vectors are added or some other stopping criteria is fulfilled

SVM Issues Excellent empirical and theoretical potential Multi-class problems not handled naturally. Basic model classifies into just two classes. Can do one model for each class (class i is 1 and all else 0) and then decide between conflicting models using confidence, etc. How to choose kernel – main learning parameter other than margin penalty C. Kernel choice will include other parameters to be defined (degree of polynomials, variance of Gaussians, etc.) Speed and Size: both training and testing, how to handle very large training sets (millions of patterns and/or support vectors) not yet solved Adaptive Kernels: trained during learning?