Presentation is loading. Please wait.

Presentation is loading. Please wait.

Support Vector Machine: An Introduction. (C) 2001-2005 by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Similar presentations


Presentation on theme: "Support Vector Machine: An Introduction. (C) 2001-2005 by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For."— Presentation transcript:

1 Support Vector Machine: An Introduction

2 (C) 2001-2005 by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For x in the side of  : w T x + b  0; d =  1. Distance from x to H: r = w T x/|w|  (  b/|w|) = g(x) /|w| Given: {(x i, d i ); i = 1 to N, d i  {+1,  1}}. A linear hyper-plane classifier is a hyper-plane consisting of points x such that H = {x| g(x) = w T x + b = 0} g(x): a discriminant function!. x1x1 x2x2 w  b/|w| x r H

3 (C) 2001-2005 by Yu Hen Hu 3 Distance from a Point to a Hyper-plane Hence r = (w T x * +b)/|w| = g(x*)/|w| If x* is on the other side of H (same side as the origin), then r =  (w T x * +b)/|w| =  g(x*)/|w| The hyper-plane H is characterized by w T x + b = 0 (*) w: normal vector perpendicular to H. (*) says any vector x on H that project to w will have a length of OA =  b/|w|. Consider a special point C corresponding to vector x *. Its magnitude of projection onto vector w is w T x * /|w| = OA + BC. Or equivalently, w T x * /|w| =  b/|w| + r r w X* H A B C

4 (C) 2001-2005 by Yu Hen Hu 4 Optimal Hyper-plane: Linearly Separable Case For d i = +1, g(x i ) = w T x i + b   |w|  w o T x i + b o  1 For d i =  1, g(x i ) = w T x i + b   |w|  w o T x i + b o   1  Optimal hyper-plane should be in the center of the gap.  Support Vectors  Samples on the boundaries. Support vectors alone can determine optimal hyper-plane.  Question: How to find optimal hyper-plane? x1x1 x2x2  

5 (C) 2001-2005 by Yu Hen Hu 5 Separation Gap For x i being a support vector, For d i = +1, g(x i ) = w T x i + b =  |w|  w o T x i + b o = 1 For d i =  1, g(x i ) = w T x i + b =  |w|  w o T x i + b o =  1 Hence w o = w/(  |w|), b o = b/(  |w|). But distance from x i to hyper-plane is  = g(x i )/|w|. Thus w o = w/g(x i ), and  = 1/|w o |. The maximum distance between the two classes is 2  = 2/|w o |. The objective is to find w o, b o to minimize |w o | (so that  is maximized) subject to the constraints that w o T x i + b o  1 for d i = +1; and w o T x i + b o  1 for d i =  1. Combine these constraints, one has: d i  (w o T x i + b o )  1

6 (C) 2001-2005 by Yu Hen Hu 6 Quadratic Optimization Problem Formulation Given {(x i, d i ); i = 1 to N}, find w and b such that  (w) = w T w/2 is minimized subject to N constraints d i  (w T x i + b)  0; 1  i  N. Method of Lagrange Multiplier

7 (C) 2001-2005 by Yu Hen Hu 7 Optimization (continued) The solution of Lagrange multiplier problem is at a saddle point where the minimum is sought w.r.t. w and b, while the maximum is sought w.r.t.  i. Kuhn-Tucker Condition: at the saddle point,  i [d i (w T x i + b)  1] = 0 for 1  i  N. If x i is NOT a support vector, the corresponding  i = 0! Hence, only support vector will affect the result of optimization!

8 (C) 2001-2005 by Yu Hen Hu 8 A Numerical Example 3 inequalities: 1  w + b   1; 2  w + b  +1; 3  w + b  + 1 J = w 2 /2   1 (  w  b  1)   2 (2w+b  1)   3 (3w+b  1)  J/  w = 0  w =   1 + 2  2 + 3  3  J/  b = 0  0 =  1   2   3 Kuhn-Tucker condition implies: (a)  1 (  w  b  1) = 0 (b)  2 (2w+b  1) = 0 (c);  3 (3w + b  1) = 0 Later, we will see the solution is  1 =  2 = 2 and  3 = 0. This yields w = 2, b =  3. Hence the solution of decision boundary is: 2x  3 = 0. or x = 1.5! This is shown as the dash line in above figure. (1,  1) (2,1) (3,1)

9 (C) 2001-2005 by Yu Hen Hu 9 Primal/Dual Problem Formulation Given a constrained optimization problem with a convex cost function and linear constraints; a dual problem with the Lagrange multipliers providing the solution can be formulated. Duality Theorem (Bertsekas 1995) (a)If the primal problem has an optimal solution, then the dual problem has an optimal solution with the same optimal values. (b) In order for w o to be an optimal solution and  o to be an optimal dual solution, it is necessary and sufficient that w o is feasible for the primal problem and  (w o ) = J(w o,b o,  o ) = Min w J(w,b o,  o )

10 (C) 2001-2005 by Yu Hen Hu 10 Formulating the Dual Problem At the saddle point, we have and, substituting these relations into above, then we have the Dual Problem Maximize Subject to: and  i  0 for i = 1, 2, …, N. Note

11 (C) 2001-2005 by Yu Hen Hu 11 Numerical Example (cont’d) or Q(  ) =  1 +  2 +  3  [0.5  1 2 + 2  2 2 + 4.5  3 2  2  1  2  3  1  3 + 6  2  3 ] subject to constraints:  1 +  2 +  3 = 0, and  1  0,  2  0, and  3  0. Use Matlab  Optimization tool box command: x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq) The solution is [  1  2  3 ] = [2 2 0] as expected.

12 (C) 2001-2005 by Yu Hen Hu 12 Implication of Minimizing ||w|| Let D denote the diameter of the smallest hyper-ball that encloses all the input training vectors {x 1, x 2, …, x N }. The set of optimal hyper-planes described by the equation W o T x + b o = 0 has a VC-dimension h bounded from above as h  min {  D 2 /  2 , m 0 } + 1 where m 0 is the dimension of the input vectors, and  = 2/||w o || is the margin of the separation of the hyper- planes. VC-dimension determines the complexity of the classifier structure, and usually the smaller the better.

13 (C) 2001-2005 by Yu Hen Hu 13 Non-separable Cases Recall that in linearly separable case, each training sample pair (x i, d i ) represents a linear inequality constraint d i (w T x i + b)  1, i = 1, 2, …, N (*) If the training samples are not linearly separable, the constraint can be modified to yield a soft constraint: d i (w T x i + b)  1  i, i = 1, 2, …, N (**) {  i ; 1  i  N} are known as slack variables. Note that originally, (*) is a normalized version of d i g(x i )/|w|  . With the slack variable  I, that eq. becomes d i g(x i )/|w|   (1  i ). Hence with the slack variable, we allow some samples x i fall within the gap. Moreover, if  i > 1, then the corresponding (x i, d i ) is mis-classified because the sample will fall on the wrong side of the hyper-plane H.

14 (C) 2001-2005 by Yu Hen Hu 14 Non-Separable Case Since  i > 1 implies mis- classification, the cost function must include a term to minimize the number of samples that are mis- classified: where is a Lagrange multiplier. But this formulation is non-convex and a solution is difficult to find using existing nonlinear optimization algorithms. Hence, we may instead use an approximated cost function With this approximated cost function, the goal is to maximize  (minimize ||W||) while minimize  i (  0 ).  i : not counted if x i outside gap and on the correct side. 0 <  i < 1: x i inside gap, but on the correct side.  i > 1: x i on the wrong side (inside or outside gap). 01 

15 (C) 2001-2005 by Yu Hen Hu 15 Primal Problem Formulation Primal Optimization Problem Given {(x i, d i );1  i  N}. Find w, b such that is minimized subject to the constraints (i)  i  0, and (ii)d i (w T x i + b)  1  i for i = 1, 2, …, N. Using  i and  i as Lagrange multipliers, the unconstrained cost function becomes

16 (C) 2001-2005 by Yu Hen Hu 16 Dual Problem Formulation Note that Dual Optimization Problem Given {(x i,  i );1  i  N}. Find Lagrange multipliers {  i ; 1  i  N} such that is maximized subject to the constraints (i) 0   i  C (a user-specified positive number) and (ii)

17 (C) 2001-2005 by Yu Hen Hu 17 Solution to the Dual Problem By the Karush-Kuhn-Tucker condition: for i = 1, 2, …, N, (i)  i [d i (w T x i + b)  1 +  i ] = 0(*) (ii)  i  i = 0 At optimal point  i +  i = C. Thus, one may deduce that if 0 <  i < C, then  i = 0 and d i (w T x i +b) = 1 if  i = C, then  i  0 and d i (w T x i +b) = 1-  i  1 if  i = 0, then d i (w T x i +b)  1: x i is not a support vector Finally, the optimal solutions are: where I o = {i; 0 <  i < C}

18 (C) 2001-2005 by Yu Hen Hu 18 Inner Product Kernels In general, if the input is first transformed via a set of nonlinear functions {  i (x)} and then subject to the hyperplane classifier Define the inner product kernel as one may obtain a dual optimization problem formulation as: Often, dim of  (=p+1) >> dim of x!

19 (C) 2001-2005 by Yu Hen Hu 19 Polynomial Kernel Consider a polynomial kernel Let K(x,y) =  T (x)  (y), then  (x) = [1 x 1 2, , x m 2,  2 x 1, ,  2x m,  2 x 1 x 2, ,  2 x 1 x m,  2 x 2 x 3, ,  2 x 2 x m, ,  2 x m  1 x m ] = [1  1 (x), ,  p (x)] where p = 1 +m + m + (m  1) + (m  2) +  + 1 = (m+2)(m+1)/2 Hence, using a kernel, a low dimensional pattern classification problem (with dimension m) is solved in a higher dimensional space (dimension p+1). But only  j (x) corresponding to support vectors are used for pattern classification!

20 (C) 2001-2005 by Yu Hen Hu 20 Numerical Example: XOR Problem Training samples: (  1  1;  1), (  1 1 +1), (1  1 +1), (1 1  1) x = [x 1, x 2 ] T. Use K(x,y) = (1 + x T y) 2 one has  (x) = [1 x 1 2 x 2 2  2 x 1,  2 x 2,  2 x 1 x 2 ] T Note dim[  (x)] = 6 > dim[x] = 2! Dim(K) = N s = # of support vectors.

21 (C) 2001-2005 by Yu Hen Hu 21 XOR Problem (Continued) Note that K(x i, x j ) can be calculated directly without using  ! The corresponding Lagrange multiplier  = (1/8)[1 1 1 1] T. Hence the hyper-plane is: y = w T  (x) =  x 1 x 2 (x 1, x 2 ) (  1,  1)(  1, +1)(+1,  1) (+1,+1) y =  1 x 1 x 2 11 +1 11

22 (C) 2001-2005 by Yu Hen Hu 22 Other Types of Kernels type of SVMK(x,y)Comments Polynomial learning machine (x T y + 1) p p: selected a priori Radial basis function  2 : selected a priori Two-layer perceptron tanh(  o x T y +  1 )only some  o and  1 values are feasible. What kernel is feasible? It must satisfy the "Mercer's theorem"!

23 (C) 2001-2005 by Yu Hen Hu 23 Mercer's Theorem Let K(x,y) be a continuous, symmetric kernel, defined on a  x,y  b. K(x,y) admits an eigen-function expansion with i > 0 for each i. This expansion converges absolutely and uniformly if and only if for all  (x) such that

24 (C) 2001-2005 by Yu Hen Hu 24 Testing with Kernels For many types of kernels,  (x) can not be explicitly represented or even found. However, Hence there is no need to know  (x) explicitly! For example, in the XOR problem, f = (1/8)[  1 +1 +1  1] T. Suppose that x = (  1, +1), then

25 (C) 2001-2005 by Yu Hen Hu 25 SVM Using Nonlinear Kernels Using kernel, low dimensional feature vectors will be mapped to high dimensional (may be infinite dim) kernel feature space where the data are likely to be linearly separable. 00 00 PP PP K(x,x j ) x1x1 xNxN Nonlinear transformKernel evaluation + W + f 00 00 PP PP x1x1 xNxN Nonlinear transform


Download ppt "Support Vector Machine: An Introduction. (C) 2001-2005 by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For."

Similar presentations


Ads by Google