1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
Support Vector Machines
Lecture 9 Support Vector Machines
ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
Support vector machine
Computer vision: models, learning and inference Chapter 8 Regression.
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
SVM Support Vectors Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Outline Separating Hyperplanes – Separable Case
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
CS 478 – Tools for Machine Learning and Data Mining SVM.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
An Introduction to Support Vector Machine (SVM)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CpSc 810: Machine Learning Support Vector Machine.
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machines
Support vector machines
PREDICT 422: Practical Machine Learning
Large Margin classifiers
An Introduction to Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support vector machines
Support Vector Machines
Support Vector Machines and Kernels
Support vector machines
Support vector machines
SVMs for Document Ranking
Support Vector Machines 2
Presentation transcript:

1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick

Large-margin linear classifier 2 Let’s assume the linearly separable case. The optimal separating hyperplane separates the two classes and maximizes the distance to the closest point. Unique solution Better test sample performance f(x)=w t x+w 0

Large-margin linear classifier f(x)=w t x+w 0 =r||w||

Large-margin linear classifier {x 1,..., x n }: our training dataset in d-dimension y i  {1,-1}: class label Our goal: Among all f(x) such that Find the optimal separating hyperplane  Find the largest margin M,

Large-margin linear classifier The border is M away from the hyperplane. M is called “margin”. Drop the ||β||=1 requirement, Let M=1 / ||β||, then the easier version is:

Large-margin linear classifier

Non-separable case When two classes are not linearly separable, allow slack variables for the points on the wrong side of the border:

Non-separable case The optimization problem becomes: ξ=0 when the point is on the correct side of the margin; ξ>1 when the point passes the hyperplane to the wrong side; 0<ξ<1 when the point is in the margin but still on the correct side.

Non-separable case When a point is outside the boundary, ξ=0. It doesn’t play a big role in determining the boundary ---- not forcing any special class of distribution.

Computation equivalent C replaces the constant. For separable case, C=∞.

Computation The Lagrange function is: Take derivatives of β, β 0, ξ i, set to zero: And positivity constraints:

Computation Substitute 12.10~12.12 into 12.9, the Lagrangian dual objective function: Karush-Kuhn-Tucker conditions include

Computation From, The solution of β has the form: Non-zero coefficients only for those points i for which These are called “support vectors”. Some will lie on the edge of the margin the remainder have, They are on the wrong side of the margin.

Computation

Smaller C. 85% of the points are support points.

Support Vector Machines Enlarge the feature space to make the procedure more flexible. Basis functions Use the same procedure to construct SV classifier The decision is made by

SVM Recall in linear space: With new basis:

SVM h(x) is involved ONLY in the form of inner product! So as long as we define the kernel function Which computes the inner product in the transformed space, we don’t need to know what h(x) itself is! “Kernel trick” Some commonly used Kernels:

SVM Recall α i =0 for non-support vectors, f(x) depends only on the support vectors.

SVM

K(x,x’) can be seen as a similarity measure between x and x’. The decision is made essentially by a weighted sum of similarity of the object to all the support vectors.

SVM Using kernel trick brings the feature space to very high dimension  many many parameters. Why doesn’t the method suffer from the curse of dimensionality or overfitting??? Vapnic argues that the number of parameters alone, or dimensions alone, is not a true reflection of how flexible the classifier is. Compare two functions in 1-dimension: f(x)=α+βx g(x)=sin(αx)

SVM g(x)=sin(αx) is a really flexible classifier in 1-dimension, although it has only one parameter. f(x)=α+βx can only promise to separate two points every time, although it has one more parameter ?

SVM Vapnic-Chernovenkis dimension: The VC dimension of a class of classifiers {f(x,α)} is defined to be the largest number of points that can be shattered by members of {f(x,α)} A set of points is said to be shattered by a class of function if, no matter how the class labels are assigned, a member of the class can separate them perfectly.

SVM Linear classifier is rigid. A hyperplane classifier has VC dimension of d+1, where d is the feature dimension.

SVM The class sin(αx) has infinite VC dimension. By appropriate choice of α, any number of points can be shattered. The VC dimension of the nearest neighbor classifier is infinity --- you can always get perfect classification in training data. For many classifiers, it is difficult to compute VC dimension exactly. But this doesn’t diminish its value for theoretical arguments. Th VC dimension is a measure of complexity of the class of functions by assessing how wiggly the function can be.

SVM Strengths of SVM: flexibility scales well for high-dimensional data can control complexity and error trade- off explicitly as long as a kernel can be defined, non- traditional (vector) data, like strings, trees can be input Weakness how to choose a good kernel? (a low degree polynomial or radial basis function can be a good start)