Presentation on theme: "Support Vector Machines"— Presentation transcript:
1 Support Vector Machines CMPUT 466/551Nilanjan Ray
2 Agenda Linear support vector classifier Separable caseNon-separable caseNon-linear support vector classifierKernels for classificationSVM as a penalized methodSupport vector regression
3 Linear Support Vector Classifier: Separable Case Primal problemDual problem(simpler optimization)Dual problemin matrix vector form:Compare theimplementationsimple_svm.m
4 Linear SVC (AKA Optimal Hyperplane)… After solving the dual problem we obtain i ‘s;how do construct the hyperplane from here?To obtain use the equation:How do we obtain 0 ?We need the complementary slackness criteria, which are the results ofKarush-Kuhn-Tucker (KKT) conditions for the primal optimization problem.Complementary slackness means:Training points corresponding to non-negative i ‘s are support vectors.0 is computed from for which i ‘s are non-negative.
5 Optimal Hyperplane/Support Vector Classifier In interesting interpretationfrom the equality constraintin the dual problem is as follows.i are forces on both sides of thehyperplane, and the net force iszero on the hyperplane.
6 Linear Support Vector Classifier: Non-separable Case
7 From Separable to Non-separable In the non-separable case the margin width is: , and if in addition, then the margin width is 1. This is the reason that in the primal problemwe have the following inequality constraints:(1)These inequality constraints ensure that there is no point in the margin area. Forthe non-separable case, such constraints must be violated, and it is modified to:So, the primary optimization problem becomes:The positive parameter controls the extentto which points areallowed to violate (1)
8 Non-separable Case: Finding Dual Function Lagrangian function minimization:Solve:Substitute (1), (2) and (3) in L to form the dual function:(1)(2)(3)
9 Dual optimization: dual variables to primal variables After solving the dual problem we obtain i ‘s;how do we construct the hyperplane from here?To obtain use the equation:How do we obtain 0 ?complementary slackness conditions for the primal optimization problem:Training points corresponding to non-negative i ‘s are support vectors.0 is computed from for which:(Average is taken from such points)is chosen by cross-validation should be typically greater than 1/N.
11 Non-linear support vector classifier Let’s take a look at dual cost function for the optimal separating hyperplane:Let’s take a look at the solution of optimal separating hyperplane in terms of dual variables:An invaluable observation: all these equations involve “feature points” in “inner products”
12 Non-linear support vector classifier… An invaluable observation: all these equations involve “feature points” in “inner products”This feature is particularly very convenient when the input feature space has a large dimensionAs for example, consider that we want a classifier which is additive in the feature component,not linear. Such a classifier is expected to perform better on problems with non-linearclassification boundary.hi are non-linear functions of the input feature. Ex. input space: x=(x1, x2), and h’s aresecond order polynomials:So that the classifier is now non-linear:Because of the inner product feature, this non-linear classifier can still be computedby the methods for finding linear optimal hyperplane.
13 Non-linear support vector classifier… Denote:The non-linear classifier:The dual cost function:The non-linear classifierin dual variables:Thus, in the dual variable space the non-linear classifer is expressed just with inner products!
14 Non-linear support vector classifier… With the previous non-linear feature vector,The inner product takes a particularly interesting form:Computational savings:instead of 6 products, wecompute 3 productsKernel function
15 Kernel FunctionsSo, if the inner product can be expressed in terms of a function symmetricfunction K:then we can apply the SV tool.Well not quite! We need another property of K called positive (semi) definiteness.Why? The dual function has an answer to this question.The maximization of the dual is convex when the matrix K is positive semi-definiteThus the kernel function K must satisfy two properties: symmetry and p.d.
16 Kernel Functions…Thus we need such h(x)’s that define kernel function.In practice we don’t even need to define h(x)! All we need is the kernel function!Example kernel functions:dth degree polynomialRadial kernelNeural networkThe real question is now designing a kernel function