Presentation is loading. Please wait.

Presentation is loading. Please wait.

2806 Neural Computation Support Vector Machines Lecture 6 2005 Ari Visa.

Similar presentations


Presentation on theme: "2806 Neural Computation Support Vector Machines Lecture 6 2005 Ari Visa."— Presentation transcript:

1 2806 Neural Computation Support Vector Machines Lecture 6 2005 Ari Visa

2 Agenda n Some historical notes n Some theory n Support Vector Machines n Conclusions

3 Some Historical Notes Linear discriminant functions (Fischer 1936) -> one should know the underlying distributions Smith 1969: A multicategory classifier using two- category procedures. -> Linear machines have been applied to larger and larger data sets, linear programming (Block & Levin 1970) and stochastic approximation methods (Yau & Schumpert 1968) -> neural networks direction : Minsky & Papert: Perceptron 1969

4 Some Historical Notes n Boser, Guyon,Vapnik 1992 and Schölkopf, Burges, Vapnik 1995 gave the key ideas. n The Kuhn-Tucker construction (1951)

5 Some Theory n A multicategory classifier using two-category procedures n a) Reduce the problem to two-class problems n b) Use c(c-1)/2 linear discriminants, one for every pair of classes n a) and b) can lead to unclassified regions

6 Some Theory n Consider the training sample{(x i, d i )} N i=1 where x i is the input pattern for the ith example and d i is the corresponding desired response. The pattern presented by the subset d i = +1 and the pattern represented by the d i = -1 are linearly separable. n c) By defining a linear machine: g i (x) = w i t x +w i0 n and assigning x to  i if g i (x) > g j (x) for all j  i

7 Some Theory n a discriminant function g(x) = w T x +b w = weight vector b = bias or threshold We may write: w T x +b  0 for d i = +1 w T x +b< 0 for d i = -1 The margin of separation = the separation between the hyperplane and the closest data point.

8 Some Theory n The goal in training a Support Vector Machine is to find the separating hyperplane with the largest margin. n g(x) = w 0 t x +b 0 gives an algebraic measure of distance from x to the hyperplane (w 0 and b 0 denote the optimum value) n x =x p + r w 0 /  w o  n r = g(x)/   w o   n Support vectors = data points that lie closest to the decision surface

9 Some Theory n Now the algebraic distance from the support vector x (s) to the optimal hyperplane is r = g(x)/  w o  n This is = 1/  w o  if d (s) = +1 or = - 1/  w o  if d (s) = -1 n  = 2r   = 2 /  w o  n The optimal hyperplane is unique (= the maximum possible separation between positive and negative examples).

10 Some Theory n Finding the optimal Hyperplane n Problem: Given the training sample {(x i, d i )} N i=1, find the optimum values of weight vector w and bias b such that they satisfy the constraint d i (w T x i +b)  1 for i = 1,2,…N and the weight vector w minimizes the cost function:  (w) = ½ w T w

11 Some Theory n The cost function  (w) is a convex function of w. n The constraints are linear in w. n The constrained optimization problem may be solved by the method of Lagrange multipliers. n J(w,b,  ) = ½ w T w –  N i=1  i [d i (w T x i +b)-1] n The solution to the constrained optimization problem is determined by the saddle point of the J(w,b,  ) (has to beminimized with respect to w and b; has to be maximized with respect to . 

12 Some Theory n Kuhn-Tucker condition and solution of the dual problem. n Duality theorem: a) If the primal problem has an optimal solution, the dual problem has an optimal solution and the corresponding optimal values are equal. b)In order for w o to be an optimal primal solution and  o to be an optimal dual solution, it is necessary and sufficient that w o is feasible for the primal problem, and  (w o ) = J(w o,b o,  o ) = min w J(w,b o,  o )  n J(w,b,  ) = ½ w T w –  N i=1  i d i w T x i - b  N i=1  i d i +  N i=1  i

13 Some Theory n The dual problem: Given the training sample {(x i, d i )} N i=1, find the Lagrange multipliers {  i } N i=1 that maximize the objective function J(w,b,  ) = Q (  ) =  N i=1  i - ½  N i=1  N j=1  i  j d i d j x i T x j subject to the constraints 1)  N i=1  i d i = 0 2)  i  0 for i = 1,2,…N n w o =  N i=1  o,i d i x i n b o = 1 – w o T x (s) for d (s) = 1

14 Some Theory n Optimal Hyperplane for Nonseparable Patterns n The margin of separation between classes is said to be soft if a data point (x i,d i ) violates the following condition: d i (w T x i +b)  1, i = 1…N n Slack variable {  i } N i=1 = measures the deviation of a data point from the ideal condition of pattern separability : d i (w T x i +b)  1- , for i = 1,2,…N

15 Some Theory n Our goal is to find a separating hyperplane for which the missclassification error, averaged on the training set, is minimized. n We may minimize the functional  (  ) =  N i=1 I(  i – 1) with respect to the weight vector w, subject to the constraint d i (w T x i +b)  1- , and the constraint  w 2 .  n Minimization of  (  ) with respect to w is a nonconvex optimization problem (=NP-complete)

16 Some Theory n We approximate the functional  (  ) by writing:  (w,  ) = ½ w T w – C  N i=1  i n The first term is related to minimizimg the VC dimension and the second term is an upper bound on the number of test errors. n C is determined either experimentally or analytically by estimating the VC dimension.

17 Some Theory n Problem: Given the training sample {(x i,d i )} N i=1, find the optimum values of weight vector w and bias b such that they satisfy the constraint d i (w T x i +b)  1-  i for i = 1,2,…N,  i  0 for all i and such the weight vector w and the slack variables  i minimize the cost function:  (w,  ) = ½ w T w – C  N i=1  i where C is a user-specified positive parameter 

18 Some Theory n The dual problem for nonseparable patterns: Given the training sample {(x i,d i )} N i=1, find the Lagrange multipliers {  i } N i=1 that maximize the objective function Q (  ) =  N i=1  i - ½  N i=1  N j=1  i  j d i d j x i T x j subject to the constraints 1)  N i=1  i d i = 0 2) 0  i  C for i = 1,2,…,N where C is a user-specified positive parameter n The optimum solution: w o =  Ns i=1  o,i d i x i where N s is the number of support vectors. n  i [d i (w T x i +b)-1+  i ] = 0 i =1,2,...,N Take the mean value of b o from all such data points (x i,d i ) in the training set that 0 <  o,i < C.

19 Support Vector Machines n The goal of a support vector machine is to find the particular hyper plane for which the margin of separation  is maximized. n The support vectors consist of a small subset of the training data extracted by the algorithm. Depending on how inner-product kernel is generated we may construct different learning machines characterized by nonlinear decision surfaces of their own. n Polynomial learning machines n Radial-basis function networks n Two-layer perceptrons

20 Support Vector Machines n The idea: n 1. Nonlinear mapping of an input vector into a high-dimensional feature space that is hidden from both the input and output n 2. Construction of an optimal hyperplane for separating the features discovered in step 1.

21 Support Vector Machines n Let x denote a vector drawn from the input space (dimension m 0 ). Let {  i (x)} m1 i=1 denote a set of nonlinear transformatons from the input space to the feature space (dimension m 1 ).  i (x) is defined a priori for all j. We may define a hyperplane:  m1 j=1 w j  j (x) + b = 0   m1 j=0 w j  j (x) where it is assumed that  0 (x) = 1 for all x so that w o denotes b. n The decision surface: w T  (x) = 0 n w =  N i=1 ,i d i  (x i )   N i=1 ,i d i  T (x i )  (x) =0 n K(x,x i ) =  T (x)  (x i ) =  m1 j=0  j (x)  j (x i ) for i=1,2,...N n The optimal hyperplane:  N i=1 ,i d i K(x,x i ) n Mercer’s Theorem tells us whether or not a candidate kernel is actually an inner-product kernel in some space.

22 Support Vector Machines n The expansion of the inner-product kernel K(x,x i ) permits us to construct a decision surface that is nonlinear in the input space, but its image in the feature space is linear. n Given the training sample {(x i,d i )} N i=1, find the Lagrange multipliers {  i } N i=1 that maximize the objective function Q (  ) =  N i=1  i - ½  N i=1  N j=1  i  j d i d j K(x i, x j ) subject to the constraints 1)  N i=1  i d i = 0 2) 0  i  C for i = 1,2,…,N where C is a user-specified positive parameter. n K = {K(x i,x j )} N i,j=1 w o =  Ns i=1  o,i d i  (x i ) where  (x i ) is the image induced in the feature space due to x i. The first component of w o represents the optimum bias b 0.

23 Support Vector Machines n The requirement on the kernel K(x,x i ) is to satisfy Mercer’s theorem. n The inner-product kernels for polynomial and radial-basis function types always satisfy Mercer’s theorem. n The dimensionality of the feature space is determined by the number of support vectors extracted from the training data by the solution to the constrained optimization problem. n The underlying theory of a SVM avoids the need for heuristics often used in the design of conventional RBF networks and MLPs.

24 Support Vector Machines n In the RBF type of a SVM, the number of radial-basis functions and their centers are determined automatically by the number of support vectors and their values, respectively n In the two-layer perceptron type of a support vector machine, the number of hidden neurons and their weight vectors are determined automatically by the number of support vectors and their values, respectively n Conceptual problem: Dimensionality of the feature space is made very large. n Computational problem: The curse of dimesionality is avoided by using the notation of an inner-product kernel and solving the dual form of the constrained optimization problem formulated in the input space.

25 Support Vector Machines n The XOR problem: (x 1 OR x 2 ) AND NOT (x 1 AND x 2 )

26 Support Vector Machines for Nonlinear Regression n Consider a nonlinear regressive model in which the dependence of a scalar d on a vector x is described by d =f(x) + v n A set of training data {(x i,d i )} N i=1, where x i is a sample value of the input vector x and d i is the corresponding value of the model output d. The problem is to provide an estimate of the dependence of d on x. n y =  m1 j=0 w j  j (x) = w T  (x) n Minimize the empirical risk R emp = 1/N  N i=1 L  (d i, y i ) subject to the inequality  w  2  c o

27 Support Vector Machines for Nonlinear Regression n Introduce two sets of nonnegative slack variables {  i } N i=1 and {  ’ i } N i=1 d i - w T  (x i )   +  i i =1,2,...,N w T  (x i ) - d i   +  ’ i i =1,2,...,N  i  0,  ’ i  0 i =1,2,...,N  (w, ,  ’) = ½ w T w + C  N i=1 (  i +  ’ i ) J(w, ,  ’, ,  ’, ,  ’) = C  N i=1 (  i +  ’ i ) + ½ w T w -  N i=1  i [w T  (x i ) - d i +  +  i ] -  N i=1  ’ i [d i - w T  (x i ) +  +  ’ i ] -  N i=1 (  i  i +  ’ i  ’ i )  w =  N i=1 ( ,i - , ’ i )  (x i )  i = C - ,i and  ’ i = C -  ’,i K(x i,x j ) =  T (x i )  (x j )

28 Support Vector Machines for Nonlinear Regression n Given the training sample {(x i,d i )} N i=1, find the Lagrange multipliers {  i } N i=1 and {  ’ i } N i=1 that maximize the objective function n Q (  i,  ’ i ) =  N i=1 d i (  i -  ’ i ) -  N i=1 (  i +  ’ i ) - ½  N i=1  N j=1 (  i -  ’ i )(  j -  ’ j ) K(x i x j ) n subject to the constraints : 1)  N i=1 (  i -  ’ i ) = 0 2) 0  i  C for i = 1,2,…,N 0  ’ i  C for i = 1,2,…,N where C is a user-specified constant. n The parameters  and C are free parameters and selected by the user. They must be tuned simultaneously.  n Regression is intrinsically more difficult than pattern classification.

29 Summary n The SVM is an elegant and highly principled learning method for the design of a feedforward network with a single layer of nonlinear units. n The SVM includes the polynomial learning machine, radial-basis function network, and two-layer perceptron as special cases. n SVM provides a method for controlling model complexity independently of dimensinality. n The SVM learning algorithm operates only in a batch mode. n By using a suitable inner-product kernel, the SVM automaticly computes all the important parameters pertaining to that choice of kernel. n In terms of running time, SVMs are currently slower than other neural networks.


Download ppt "2806 Neural Computation Support Vector Machines Lecture 6 2005 Ari Visa."

Similar presentations


Ads by Google