Support Vector Machines and Kernel Methods

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Support Vector Machines
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machine
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Support Vector Machines Kernel Machines
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Machine Learning Queens College Lecture 13: SVM Again.
This week: overview on pattern recognition (related to machine learning)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machines and Kernel Methods Machine Learning March 25, 2010.
An Introduction to Support Vector Machine (SVM)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines Exercise solutions Ata Kaban The University of Birmingham.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SVMs, Part 2 Summary of SVM algorithm Examples of “custom” kernels Standardizing data for SVMs Soft-margin SVMs.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
SVMs in a Nutshell.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Support Vector Machines Part 2. Recap of SVM algorithm Given training set S = {(x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) | (x i, y i )   n  {+1, -1}
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support vector machines
PREDICT 422: Practical Machine Learning
An Introduction to Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Support Vector Machines and Kernels
Support vector machines
SVMs for Document Ranking
Support Vector Machines 2
Introduction to Machine Learning
Presentation transcript:

Support Vector Machines and Kernel Methods Machine Learning March 25, 2010

Last Time Basics of the Support Vector Machines

Review: Max Margin How can we pick which is best? Maximize the size of the margin. Small Margin Large Margin Are these really “equally valid”?

Review: Max Margin Optimization The margin is the projection of x1 – x2 onto w, the normal of the hyperplane. Projection: Size of the Margin:

Review: Maximizing the margin Goal: maximize the margin Linear Separability of the data by the decision boundary

Review: Max Margin Loss Function Primal Dual

Review: Support Vector Expansion New decision Function Independent of the Dimension of x! When αi is non-zero then xi is a support vector When αi is zero xi is not a support vector

Review: Visualization of Support Vectors

Today How support vector machines deal with data that are not linearly separable Soft-margin Kernels!

Why we like SVMs They work Easily interpreted. Good generalization Easily interpreted. Decision boundary is based on the data in the form of the support vectors. Not so in multilayer perceptron networks Principled bounds on testing error from Learning Theory (VC dimension)

SVM vs. MLP SVMs have many fewer parameters SVM: Maybe just a kernel parameter MLP: Number and arrangement of nodes and eta learning rate SVM: Convex optimization task MLP: likelihood is non-convex -- local minima R(\theta)=\frac{1}{N}\sum_{n=0}^N\frac{1}{2}\left(y_n-g\left(\sum_k w_{kl}g\left(\sum_jw_{jk}g\left(\sum_iw_{ij}x_{n,i}\right) \right)\right)\right)^2

Linear Separability So far, support vector machines can only handle linearly separable data But most data isn’t.

Soft margin example Points are allowed within the margin, but cost is introduced. Hinge Loss

Soft margin classification There can be outliers on the other side of the decision boundary, or leading to a small margin. Solution: Introduce a penalty term to the constraint function

Soft Max Dual Still Quadratic Programming! W(\alpha) = \sum_{i=0}^{N-1}\alpha_i - \frac{1}{2}\sum_{i,j=0}^{N-1}t_it_j\alpha_i\alpha_j(x_i\cdot x_j)

Probabilities from SVMs Support Vector Machines are discriminant functions Discriminant functions: f(x)=c Discriminative models: f(x) = argmaxc p(c|x) Generative Models: f(x) = argmaxc p(x|c)p(c)/p(x) No (principled) probabilities from SVMs SVMs are not based on probability distribution functions of class instances.

Efficiency of SVMs Not especially fast. Training – n^3 Evaluation – n Quadratic Programming efficiency Evaluation – n Need to evaluate against each support vector (potentially n)

Kernel Methods Points that are not linearly separable in 2 dimension, might be linearly separable in 3.

Kernel Methods Points that are not linearly separable in 2 dimension, might be linearly separable in 3.

Kernel Methods We will look at a way to add dimensionality to the data in order to make it linearly separable. In the extreme. we can construct a dimension for each data point May lead to overfitting.

Remember the Dual? Primal Dual W(\alpha) = \sum_{i=0}^{N-1}\alpha_i - \frac{1}{2}\sum_{i,j=0}^{N-1}t_it_j\alpha_i\alpha_j(x_i\cdot x_j)

Basis of Kernel Methods The decision process doesn’t depend on the dimensionality of the data. We can map to a higher dimensionality of the data space. Note: data points only appear within a dot product. The objective function is based on the dot product of data points – not the data points themselves.

Basis of Kernel Methods Since data points only appear within a dot product. Thus we can map to another space through a replacement The objective function is based on the dot product of data points – not the data points themselves.

Kernels The objective function is based on a dot product of data points, rather than the data points themselves. We can represent this dot product as a Kernel Kernel Function, Kernel Matrix Finite (if large) dimensionality of K(xi,xj) unrelated to dimensionality of x

Kernels Kernels are a mapping

Kernels Gram Matrix: Consider the following Kernel:

Kernels Gram Matrix: Consider the following Kernel:

Kernels In general we don’t need to know the form of ϕ. Just specifying the kernel function is sufficient. A good kernel: Computing K(xi,xj) is cheaper than ϕ(xi)

Kernels Valid Kernels: Symmetric Must be decomposable into ϕ functions Harder to show. Gram matrix is positive semi-definite (psd). Positive entries are definitely psd. Negative entries may still be psd

Kernels Given a valid kernels, K(x,z) and K’(x,z), more kernels can be made from them. cK(x,z) K(x,z)+K’(x,z) K(x,z)K’(x,z) exp(K(x,z)) …and more

Incorporating Kernels in SVMs Optimize αi’s and bias w.r.t. kernel Decision function:

Some popular kernels Polynomial Kernel Radial Basis Functions String Kernels Graph Kernels

Polynomial Kernels The dot product is related to a polynomial power of the original dot product. if c is large then focus on linear terms if c is small focus on higher order terms Very fast to calculate

Radial Basis Functions The inner product of two points is related to the distance in space between the two points. Placing a bump on each point.

String kernels Not a gaussian, but still a legitimate Kernel K(s,s’) = difference in length K(s,s’) = count of different letters K(s,s’) = minimum edit distance Kernels allow for infinite dimensional inputs. The Kernel is a FUNCTION defined over the input space. Don’t need to specify the input space exactly We don’t need to manually encode the input.

Graph Kernels Define the kernel function based on graph properties These properties must be computable in poly-time Walks of length < k Paths Spanning trees Cycles Kernels allow us to incorporate knowledge about the input without direct “feature extraction”. Just similarity in some space.

Where else can we apply Kernels? Anywhere that the dot product of x is used in an optimization. Perceptron: D(x)&=&sign\left(\left(\sum_jt_jx_j\right)^Tx+b\right)\\&=&sign\left(\sum_jt_j\left(x_j^Tx\right)+b\right)\\

Kernels in Clustering In clustering, it’s very common to define cluster similarity by the distance between points k-nn (k-means) This distance can be replaced by a kernel. We’ll return to this more in the section on unsupervised techniques

Bye Next time Supervised Learning Review Clustering