Presentation on theme: "An Introduction of Support Vector Machine"— Presentation transcript:
1An Introduction of Support Vector Machine Courtesy of Jinwei Gu
2Today: Support Vector Machine (SVM) A classifier derived from statistical learning theory by Vapnik, et al. in 1992SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition taskCurrently, SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, etc.
3Discriminant Function It can be arbitrary functions of x, such as:DecisionTreeLinearFunctionsNonlinearFunctions
4Linear Discriminant Function g(x) is a linear function:x2wT x + b > 0g(x) is a hyper-plane in the n-dimensional feature space, w and x vectorsw1x1+w2x2+b=0wT x + b = 0n(Unit-length) normal vector of the hyper-plane:wT x + b < 0x1
5Linear Discriminant Function denotes +1denotes -1How would you classify these points using a linear discriminant function in order to minimize the error rate?x2Infinite number of answers!x1
6Linear Discriminant Function denotes +1denotes -1How would you classify these points using a linear discriminant function in order to minimize the error rate?x2Infinite number of answers!x1
7Linear Discriminant Function denotes +1denotes -1How would you classify these points using a linear discriminant function in order to minimize the error rate?x2Infinite number of answers!x1
8Linear Discriminant Function denotes +1denotes -1How would you classify these points using a linear discriminant function in order to minimize the error rate?x2Infinite number of answers!Which one is the best?x1
9Large Margin Linear Classifier denotes +1denotes -1The linear discriminant function (classifier) with the maximum margin is the bestx2Margin“safe zone”Margin is defined as the width that the boundary could be increased by before hitting a data pointWhy it is the best?Robust to outliners and thus strong generalization abilityx1
10Large Margin Linear Classifier denotes +1denotes -1Given a set of data points:x2, whereNote: class is +1,-1 NOT 1,0)With a scale transformation on both w and b, the above is equivalent tox1
11Large Margin Linear Classifier denotes +1denotes -1We know thatx2Marginx+x-wT x + b = 1Support VectorswT x + b = 0wT x + b = -1The margin width is:nx1
12Large Margin Linear Classifier denotes +1denotes -1Formulation:x2Marginx+x-wT x + b = 1such thatwT x + b = 0wT x + b = -1nx1Remember: our objective is finding w, b such that margins are maximized
13Large Margin Linear Classifier denotes +1denotes -1Equivalent formulation:x2Marginx+x-wT x + b = 1such thatwT x + b = 0wT x + b = -1nx1
14Large Margin Linear Classifier denotes +1denotes -1Equivalent formulation:x2Marginx+x-wT x + b = 1such thatwT x + b = 0wT x + b = -1nx1
15Solving the Optimization Problem s.t.Quadratic programmingwith linear constraintss.t.LagrangianFunction
16LagrangianGiven a function f(x) and a set of constraints c1..cn, a Lagrangianis a function L(f,c1,..cn,1,.. n) that “incorporates” constraintsin the optimization problemThe optimum is at a point where (Karush-Kuhn-Tucker (KKT) conditions):The third condition is known as complementarity condition
17Solving the Optimization Problem s.t.To minimize L we need to maximize the red box
18Dual Problem Formulation (2) s.t.SinceandBecause of complementary condition 3Lagrangian DualProblems.t., andEach non-zero αi indicates that corresponding xi is a support vector SV.
19Why only SV?The complementarity condition (the third one of KKT, see previousdefinition of Lagrangians) implies that one of the 2 multiplicands is zero.However, for non-SV xi.Therefore only data points on the margin (=SV) will have a non-zeroα (because they are the only ones for wich yi (wTxi + b ) = 1 )
20Final Formulation To summaryze: Again, remember that the constraint (2) can be verified with a non-equality only for the SV!!Find α1…αn such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and(1) Σαiyi = 0(2) αi ≥ 0 for all αi
21The Optimization Problem Solution Given a solution α1…αn to the dual problem Q(α) , solution to the primal is:Each non-zero αi indicates that corresponding xi is a support vector.Then the classifying function for a new point x is (note that we don’t need w explicitly):Notice that it relies on an inner product xiTx between the test point x and the support vectors xi . Only SV are useful for classification of new instances!!.Also keep in mind that solving the optimization problem involved computing the inner products xiTxj between all training points.w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0g(x) = ΣαiyixiTx + b (xi are SV)
22Working exampleA 3-point training set: <(1,1)+><(2,0)+><(2,3),->The maximum margin weight vector will be parallel to the shortest line connecting points of the two classes, that is, the line between (1,1) and (2,3)(the 2 Support Vectors)W
24Soft Margin Classification What if the training set is not linearly separable?Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margins are called soft.ξiξixi are the misclassified examples
25Large Margin Linear Classifier Formulation:such thatParameter C can be viewed as a way to control over-fitting.
26Large Margin Linear Classifier Formulation: (Lagrangian Dual Problem)such that
27Non-linear SVMsDatasets that are linearly separable with noise work out great:xxBut what are we going to do if the dataset is just too hard?How about… mapping data to a higher-dimensional space:xx2This slide is courtesy of
28Non-linear SVMs: Feature Space General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable:Φ: x → φ(x)This slide is courtesy of
29The original objects (left side of the schematic) are mapped, i. e The original objects (left side of the schematic) are mapped, i.e., rearranged, using a set of mathematical functions, known as kernels.φ(x)
30Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function is now:Original formulation was g(x) = ΣαiyixiTx + bNo need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:
31Nonlinear SVMs: The Kernel Trick An example:2-dimensional vectors x=[x1 x2];let K(xi,xj)=(1 + xiTxj)2,Need to show that K(xi,xj) = φ(xi) Tφ(xj):K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]This slide is courtesy of
32Nonlinear SVMs: The Kernel Trick Examples of commonly-used kernel functions:Linear kernel:Polynomial kernel:Gaussian (Radial-Basis Function (RBF) ) kernel:Sigmoid:In general, functions that satisfy Mercer’s condition can be kernel functions.
33Nonlinear SVM: Optimization Formulation: (Lagrangian Dual Problem)such thatThe solution of the discriminant function isThe optimization technique is the same.
34Support Vector Machine: Algorithm 1. Choose a kernel function2. Choose a value for C3. Solve the quadratic programming problem (many software packages available)4. Construct the discriminant function from the support vectors
35Some Issues Choice of kernel Choice of kernel parameters - Gaussian or polynomial kernel is default- if ineffective, more elaborate kernels are needed- domain experts can give assistance in formulating appropriate similarity measuresChoice of kernel parameters- e.g. σ in Gaussian kernel- σ is the distance between closest points with different classifications- In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.Optimization criterion – Hard margin v.s. Soft margin- a lengthy series of experiments in which various parameters are testedThis slide is courtesy of
36Summary: Support Vector Machine 1. Large Margin ClassifierBetter generalization ability & less over-fitting2. The Kernel TrickMap data points to higher dimensional space in order to make them linearly separable.Since only dot product is used, we do not need to represent the mapping explicitly.