Support vector machines

Support vector machines
Predominate application: classification Regression possible but seldom used Advantage: built in generalization (wide margins)

Support Vector Machines: Background
Binary Linear Classification: Predicting class membership y=w1x+w0 x Linear discriminant Linearly separable data in 1D. Decision point and margins

Recall family car classification problem: Discriminant defined in terms of support vectors Hypothesis with greatest distance between S and G Filled circles are the subset of training examples required to define the version space and the hypothesis with maximum margins 3

Linear discriminants in d > 2 dimensions are called “hyperplanes” In SVM, we find the optimal hyperplane by minimizing a quadratic function of attribute vectors subject to linear constraints. Minimize f(x) = ½xTQx + cTx subject to Ax < b where Q is a positive definite matrix Since Q is a positive definite, f(x) is convex. If f(x) is convex and bounded from below, a unique global minimum exist in the solution space. 4

The quadratic programing problem, min f(x) = ½xTQx + cTx subject to Ax < b becomes a linear programing problem if Q is zero (not the case for SVM). Solved by Simplex algorithm Optimal x is at a vertex of convex hull 5

Active set in the quadratic programming problem
The active set determines which constraints will influence the final result of optimization. In the linear programming problem, the active set defines the allowed convex solution space and the vertex that is a solution. In a convex quadratic programming problem, the solution is not necessarily at a vertex but the active set still guides the search for a solution. In the SVM quadratic programming problem, support vectors define the active set. 6

Optimum Separating Hyperplane: 2 classes
7

Linearly-separable 2-class problem: find weights such that
for all instances. Those with are support vectors that determine the margins. Separating hyperplane margins Support vectors Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8

Distance of xt from hyperplane wTx = 0
Let xa and xb be points on the hyperplane wTxa + w0 = wTxb + w0 wT(xa – xb) = 0 xa – xb is in hyperplane w must be normal to hyperplane rt = -1 rt = 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9

Decompose any point x into components parallel and perpendicular to the hyperplane x = xp + d w/||w|| xp is projection of x on d is distance of x from w/||w|| is unit normal of rt = 1 rt = -1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10

g(xp) = 0 = g(x - d w/||w||) 0 = wT(x – d w/||w||) + w0 wTw = ||w||2 0 = wTx + w0 – d ||w|| d = |wTx + w0|/||w|| dt = |g(xt)|/||w|| dt = rt(wTxt + w0)/||w|| rt = 1 rt = -1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11

Maximizing margins Distance of xt to the hyperplane is To maximize,
Result is a linear discriminant, wTxt + w0 = 0, for binary classification with maximum margins (equal distance of xt from the decision boundary for both classes) Number of constraints = size of the training set Solve by Lagrange multipliers 12 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

duality in constrained optimization
In constrained optimization, it is often possible to convert the primal problem (i.e. the original form of the optimization problem) to a dual form. In the Lagrangian dual problem, minimizing the Lagrangian results in primal variables expressed as functions of the Lagrange multipliers, which are called dual variables. Maximizing the dual form with respect to the Lagrange multipliers, under their derived constraints (at least, non-negativity), results in optimal expressions for primal variables. In general, solution of the dual problem provides only an upper bound on the solution of the primal problem. (called duality gap) For convex optimization problems with linear constraints, the duality gap is zero, which is our case. 13

Review: Distance of xt from hyperplane wTx = 0
g(xp) = 0 = g(x - d w/||w||) 0 = wT(x – d w/||w||) + w0 0 = wTx + w0 – d ||w|| d = |wTx + w0|/||w|| dt = |g(xt)|/||w|| dt = rt(wTxt + w0)/||w|| rt = 1 rt = -1 Goal of SVM: find the linear discriminant that maximizes dt for all instances Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14

Solution by Lagrange multipliers
Sub p denotes Lagrangian of “primal” optimization problem Substitute these conditions for a stationary point into the Lagrangian Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15

maximize Ld Eliminate w using conditions for stationary point
Sub d denotes dual optimization problem with variables at maximize Ld Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16

Given the at > 0 that maximize Ld calculate Still need to find w0
Set αt = 0 for data point sufficiently far from discriminant to be ignored in the search for hyperplane with maximum margins. Find remaining at > 0 by quadratic programing Given the at > 0 that maximize Ld calculate Still need to find w0 For support vectors (xt on margin), rt (wTxt + w0) = 1 (rt)2 = 1; therefore w0 = rt - wTxt Usually average w0 calculated from all support vectors Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17

Review: SVM for linearly-separable 2-class problem
In-sample error is zero; focus on generalization by maximizing margins Separating hyperplane margins Support vectors 18

Review: SVM for the linearly-separable 2-class problem
maximize Typical quiz question: In this expression, what is the meaning of variables xt, rt and at? 19

Review: What are the 3 steps to finding the decision boundary
with maximum margins? 1. set αt = 0 for data point sufficiently far from boundary to be ignored in the search for hyperplane with maximum margins. 2. Apply quadratic programing to find the non-zero at by maximizing 3. Evaluate and w0 = <rt – wTxt>, where < > denotes average data points on margins 20

Application of SVM to binary classification when
classes that are not linearly separable Introduce “slack variables” to allow for incorrect classification and correctly classified instances in the margins. Called “soft-margin hyperplanes” Instances in margins and misclassified are allowed but penalized by a function called “soft error”

Slack variables in constraints
xt > 0: new primal variables in Lp soft error = (a) & (b) classified correctly xt = 0 (c) correct but in margin 0<xt<1 (d) misclassified xt >1 22

Soft error also called “hinge” loss function
is the discriminant. r t is label if ytr t >1, xt is correctly classified and not in margin Lhinge(yt, rt)= Penalty for correctly classified data in margin Penalty increases linearly for data misclassified Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23

Typical quiz question: Add hinge loss for examples with rt = -1
to the figure below

Typical quiz question: Given values of yt, and rt, calculate
the soft error in classification of xt Contribution to soft error = Lhinge(yt, rt) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25

SVM with slack variables is called “C-SVM”
In C-SVM the Lagrangian is Typical quiz questions: What are the primal variables in Lp? What are the Lagrange multipliers? What role does constant C play? How is a value of C determined? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26

Lp with new primal variables, constraints and soft error
Primal variables are w, w0, and xt. at and mt are Lagrange multipliers. C is regularization parameter. Smaller C penalizes soft error less. ||w||2 becomes more important in minimizing Lp. Gives larger margins but more data points in the margins. Use validation set to find best choice of C. 27

and 0 < at < C for all t
Setting derivatives of Lp with respect to the “primal” variables, w, w0, and xt equal to zero gives results and C - at - mt = 0 which implies 0 < at < C because Lagrange multiplier are non-negative Substitution into Lp gives the same dual as in linearly separable To be maximized subject to and 0 < at < C for all t at = 0 for instances correctly classified with sufficient margin 0<at<C for instances on margins at = C for instances in margins and/or misclassified Given values at how do we find w and w0? 28

Given values at of C-SVM how do we find w and w0?
Since xt = 0 for points on the margins, process is the same as separable case Evaluate and w0 = <rt – wTxt> where < > denotes average data points on margins 29

n-SVM: another approach to soft margins
n = regularization parameter shown to be an upper bound on fraction of instances in margin r = new primal variable related to margin width = 2r/||w|| Other primal variables are w, w0, and slack variables maximize 30

In the Lagrangian of n-SVM with added constraints,
What are primal variables? What are the Lagrange multipliers? What is the purpose of the constant n? What is the meaning of constant n? If n increases, what is the likely effect on width of margins?

C-SVM verses n-SVM Both regularization parameters serve the same purpose: tradeoff between small soft error and wide margins. Small C penalizes soft error less, favors wide margins. Large n penalizes soft error less, favors wide margins. A validation set is used in both cases to get the best tradeoff Unlike C, the value of n has an important interpretation

Application of Support Vector Machines using Weka software
Find package manager under Setup and Tools Find libsvm in list of packages In Explorer, open Classify Under Choose find libsvm under functions Data set: Breast cancer diagnostics; UCI Machine Learning Repository

Classification with selected attributes
3 out of 5 repeats of 10-fold CV Selected attributes: worst area, mean texture, and worst smoothness Single separating plane Accuracy 97.54% 10-fold Cross Validation repeated 100 times. Selected attributes: worst radius, worst texture, and worst concave points

worst area, mean texture, and worst smoothness attributes
MLP with Weka default settings

MLP with weka default settings except normalization Normalization helps

SVM with Weka default settings Not as good as MLP Many malignant cases classified as benign

Best SVM with Weka Linear u’*v kernel C-SVM classification Cost = 0.9 Coeff0 = 0.01 Soft labels (setting under probability Estimates) M B  classified as 203 9 M 2 355 B Accuracy = 98.07% slightly better than MLP and published result Significant reduction in malignant cases classified as benign

SVM in feature space

feature space Review: When data is not linearly separable a
transformation to feature space, z = F(x) = (1,x12, x22 ) might lead to linearly separable features. attribute space feature space In this case both attribute and feature space are 2D

C-SVM in feature space and slack variables
subject to constraints xt > 0 and for all t (note: w0 in w and F0 = 1) Set derivatives equal to zero yields and C - at - mt =0, which implies 0 < at < C Substitute into Lp to get the dual Same as C-SVM except features rather than attributes

subject to constraints
maximize subject to constraints and 0<at<C for all t As in attribute space, at=0 for instances correctly classified with sufficient margin 0<at<C for support vectors Find these at by quadratic programing at = C for instances in margins and/or misclassified

SVM has a distinct advantage over other feature-space methods
Recall: beer-bottle glass had 9 attributes Full quadratic extension of the linear model has 81 features Complexity of SVM depends on number of support features and is independent of the dimension of feature space

Kernel machines In feature space the discriminant is
Discriminant is sufficient for classification We do not need the weights explicitly Discriminant can be written where K(xt, x) = F(xt)TF(x) is the “kernel”. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 46

The kernel trick The dual can be written
Maximizing Ld determines the parameters at and the discriminant It is not essential that the kernel be related to a transformation of attributes. Kernel can be any function of 2 vectors, each of dimension d+1. Kernel machines still contain the regularization parameter C through the constraint 0<at<C for all t, which must be optimized for each choice of kernel by a validation set.

Possible quiz questions about Kernel machines
What 2 equations in feature space enable the “kernel trick”? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 48

What 2 equations in feature space enable the “kernel trick”? The dual
and the discriminant Explicit weights which cannot be written as a dot product of features, F(xt)TF(x), are not needed. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 49

Possible quiz questions about Kernel machines
Do kernel machines include regularization? If so, how?

Do kernel machines include regularization? If so, how?
Kernel machines still contain the regularization parameter through the constraint 0<at<C for support vectors on margins and at=C for instances in margins and/or misclassified. C must be optimized by a validation set for each choice of kernel.

Kernel machines absorbed C-SVM
Attribute-space C-SVM is considered a special case of kernel machines. C-SVM in attribute space is a linear-kernel machine.

C-SVM as linear kernel machine, K(xt, x) = (xt)T(x)
The dual becomes and the discriminant becomes cannot be written in terms of the kernel but we need w0 = <rt - wTxt > to complete the discriminant To avoid explicit dependence on w, w0 can be a parameter in linear kernel machines

best SVM with WEKA Linear u’*v kernel C-SVM classification Cost = 0.9 Coeff0 = 0.01 (probably WEKA’s notation for w0) Soft labels (setting under probability Estimates) M B  classified as 203 9 M significant improvement in M class 2 355 B Accuracy = 98.07% slightly better than MLP and published result

Some commonly used non-linear kernels

polynomial kernel of degree q Example: q=2 in 2D
Quadratic kernel in 2D attribute space Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 56

2D quadratic kernel as product of basis functions
Similarly for F(y) This 6D feature space is a way to interpret the kernel but does not play any role in applications of the 2D quadratic kernel machine. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 57

Radial-basis function kernel
As s decreases, x must be more like xt to have a large value of K. More data points influence boundary. Shape becomes more complex and less likely to generalize. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 58

kernel engineering K(x,y) should get larger as x and y get more similar This is true for the linear kernel xTy Application-specific kernels start with a good measure of similarity Similarity measure depends on data type Kernels have been developed for data as string, graphs, and images Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 59

Kernel engineering for genomic analysis responsible for increase use of SVM after 2000

Example of kernel engineering:
Empirical kernel map Define a set of M templates mi and get scores s(x, mi) = similarity of x to template mi f(xt)=[s(xt,m1), s(xt,m2),..., s(xt,mM)] is an M-dimensional representation of xt Define K(x,xt) = f(x)T f(xt) Example: select DNA sequences like those from a particular organisms out of a genome database.

Advantages of kernel machines for classification
Generalization is built into the theory Regularization is part of the theory Sparsity of solution is preserved High dimensionality of feature space is not relevant Facilitates kernel engineering Disadvantages of kernel machines for classification Must chose a kernel and evaluate its parameters

Beyond binary classification
Generate K “1 vs all” discriminants gi(x), i =1,...,K Assign x to class with largest gi(x) 2 attributes, multiple classes, axis-aligned rectangle hypothesis class Same type of data (price, power), label is Boolean vector (all zeros but one) Treat K-class classification problem as K 2-class problems: hence, train K hypotheses Examples belonging to Ci are positive for hi Examples belonging to all other classes are negative for hi Training minimized sum of error over all classes Generalization includes doubt if (price, power) datum not in one and only one boundary 63 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

One-class SVM Find the optimum boundary that separates high-density data (similar) from outliers (different) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 64

Attribute space formulation
Consider a hyper-sphere with center a that is the centroid of similar data and radius R that defines a soft boundary on similar data. In the figure, is data involved in finding a but not R (b) is data on sphere (xt = 0) used to find R given a (c) is data close to the boundary that reflects tolerance for soft error (xt > 0) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 65

One-class SVM: Set up the Lagrangian
As in C-SVM, C determines the importance of soft error in minimizing the Lagrangian. Small C means R2 is most important. Favors a smaller but more fuzzy boundary on similar data. Large C penalizes soft error more. Favors a larger but less fuzzy boundary on similar data.

Add Lagrange multipliers at>0 and gt>0 for constraints
Possible quiz questions: What are the primal variables? What are the constraints on primal variables and the associated Lagrange multiplier? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 67

Add Lagrange multipliers at>0 and gt>0 for constraints
Set derivatives with respect to primal variables R, a and xt = 0 0 < at < C Substituting back into Lp we get dual to be maximized given find radius Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 68

One-Class Kernel Machines
Replace dot products in Ld by a kernel 69

One-Class Gaussian Kernel Machine
Consider x a point in outlier space if If the variance is small, sum drops below threshold Rc unless x is very close to some xt . Data become more important in determining the shape of the decision boundary. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 70

Support vector machines

Similar presentations

Presentation on theme: "Support vector machines"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Support vector machines

Similar presentations

Presentation on theme: "Support vector machines"— Presentation transcript:

Similar presentations

About project

Feedback