Download presentation

Presentation is loading. Please wait.

Published byYasmine Greely Modified over 2 years ago

1
7. Support Vector Machines (SVMs) Basic Idea: 1.Transform the data with a non-linear mapping so that it is linearly separable. Cf Cover’s theorem: non-linearly separable data can be transformed into a new feature space which is linearly separable if 1) mapping is non-linear 2) dimensionality of feature space is high enough 2.Construct the ‘optimal’ hyperplane (linear weighted sum of outputs of first layer) which maximises the degree of separation (the margin of separation: denoted by ) between the 2 classes

2
MLPs and RBFN stop training when all points are classified correctly. Thus the decision surfaces are not optimised in the sense that the generalization error is not minimized MLP RBF SVM

3
x m0 Input: m-D vector x1x1 x2x2 First layer: mapping performed from the input space into a feature space of higher dimension where the data is now linearly separable using a set of m 1 non-linear functions (cf RBFNs) m1 x) w m1 y = Output y = w i x i ) +b y = w T x) b= w 0 = bias

4
1.After learning both RBFN and MLP decision surfaces might not be at the optimal position. For example, as shown in the figure, both learning rules will not perform further iterations (learning) since the error criterion is satisfied (cf perceptron) 2.In contrast the SVM algorithm generates the optimal decision boundary (the dotted line) by maximising the distance between the classes which is specified by the distance between the decision boundary and the nearest data points 3.Points which lie exactly away from the decision boundary are known as Support Vectors 4.Intuition is that these are the most important points since moving them moves the decision boundary

5
Moving a support vector moves the decision boundary Moving the other vectors has no effect The algorithm to generate the weights proceeds in such a way that only the support vectors determine the weights and thus the boundary

6
x m0 Input: m-D vector x1x1 x2x2 K N x) K(x, x N ) = T (x). x N ) aNdNaNdN y = Output y = a i d i K(x, x i ) + b b = bias However, we shall see that the output of the SVM can also be interpreted as a weighted sum of the inner (dot) products of the images of the input x and the support vectors x i in the feature space, which is computed by an inner product kernel function K(x,x m ) Where: T (x) = [ 1 (x), 2 (x),.., m1 (x)] T I.e. image of x in feature space and d i = +/- 1 depending on the class of x i

7
Why should inner product kernels be involved in pattern recognition? -- Intuition is that they provide some measure of similarity -- cf Inner product in 2D between 2 vectors of unit length returns the cosine of the angle between them. e.g. x = [1, 0] T, y = [0, 1] T I.e. if they are parallel inner product is 1 x T x = x.x = 1 If they are perpendicular inner product is 0 x T y = x.y = 0

8
Differs to MLP (etc) approaches in a fundamental way In MLPs complexity is controlled by keeping number of hidden nodes small Here complexity is controlled independently of dimensionality The mapping means that the decision surface is constructed in a very high (often infinite) dimensional space However, the curse of dimensionality (which makes finding the optimal weights difficult) is avoided by using the notion of an inner product kernel (see: the kernel trick, later) and optimising the weights in the input space

9
SVMs are a superclass of network containing both MLPs and RBFNs (and both can be generated using the SV algorithm) Strengths: Previous slide: i.e. complexity/capacity is independent of dimensionality of the data thus avoiding curse of dimensionality Statistically motivated => Can get bounds on the error, can use the theory of VC dimension and structural risk minimisation (theory which characterises generalisation abilities of learning machines) Finding the weights is a quadratic programming problem guaranteed to find a minimum of the error surface. Thus the algorithm is efficient and SVMs generate near optimal classification and are insensitive to overtraining Obtain good generalisation performance due to high dimension of feature space

10
Most important (?): by using a suitable kernel, SVM automatically computes all network parameters for that kernel. Eg RBF SVM: automatically selects the number and position of hidden nodes (and weights and bias) Weaknesses: Scale (metric) dependent Slow training (compared to RBFNs/MLPs) due to computationally intensive solution to QP problem especially for large amounts of training data => need special algorithms Generates complex solutions (normally > 60% of training points are used as support vectors) especially for large amounts of training data. E.g. from Haykin: increase in performance of 1.5% over MLP. However, MLP used 2 hidden nodes, SVM used 285 Difficult to incorporate prior knowledge

11
The SVM was proposed by Vapnik and colleagues in the 70’s but has only recently become popular early 90’s). It (and other kernel techniques) is currently a very active (and trendy) topic of research See for example: or (book): AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods). N. Cristianini and J. Shawe-Taylor, Cambridge University Press ISBN: for recent developments

12
First consider a linearly separable problem where the decision boundary is given by g(x) = w T x+ b = 0 And a set of training data X={(x i,d i ): i=1,.., N} where d i = +1 if x i is in class 1 and –1 if it’s in class 2. Let the optimal weight-bias combination be w 0 and b 0 xpxp xnxn x w Now: x = x p + x n = x p + r w 0 / ||w 0 || where: r = ||x n || Since: g(x p ) = 0, g(x) = w 0 T (x p + r w 0 / ||w 0 ||) + b 0 g(x) = r w 0 T w 0 / ||w 0 || = r ||w 0 || or: r = g(x)/ ||w 0 ||

13
Thus, as g(x) gives us the algebraic distance to the hyperplane, we want: g(x i ) = w 0 T x i + b 0 >= 1 for d i = 1 and g(x i ) = w 0 T x i + b 0 <= -1 for d i = -1 (remembering that w 0 and b 0 can be rescaled without changing the boundary) with equality for the support vectors x s. Thus, considering points on the boundary and that: r = g(x)/ ||w 0 || we have: r = 1/ ||w 0 || for d S = 1 and r = -1/ ||w 0 || for d S = -1 and so the margin of separation is: = 2 / ||w 0 || Thus, the solution w 0 maximises the margin of separation Maximising this margin is equivalent to minimising ||w||

14
We now need a computationally efficient algorithm to find w 0 and b 0 using the training data (x i, d i ). That is we want to minimise: F(w) = 1/2 w T w subject to: d i (w T x i + b) >= 1 for i= 1,.. N which is known as the primal problem. Note that the cost function F is convex in w (=> a unique solution) and that the constraints are linear in w. Thus we can solve for w using the technique of Lagrange multipliers (non-maths: technique for solving constrained optimisation problems). For a geometrical interpretation of Lagrange multipliers see Bishop, 95, Appendix C.

15
First we construct the Lagrangian function: L(w, a) = 1/2 w T w - i a i [d i (w T x i + b) - 1] where a i are the Lagrange multipliers. L must be minimised with respect to w and b and maximised with respect to a i (it can be shown that such problems have a saddle-point at the optimal solution). Note that the Karush-Kuhn-Tucker (or, intuitively, the maximisation/constraint) conditions means that at the optimum: a i [d i (w T x i + b) - 1] = 0 This means that unless the data point is a support vector a i = 0 and the respective points are not involved in the optimisation. We then set the partial derivatives of L wr to b and w to zero to obtain the conditions of optimality: w = i a i d i x i and i a i d i = 0

16
Given such a constrained convex problem, we can reform the primal problem using the optimality conditions to get the equivalent dual problem: Given the training data sample {(x i,d i ), i=1, …,N}, find the Lagrangian multipliers a i which maximise: subject to the constraints: and a i >0 Notice that the input vectors are only involved as an inner product

17
Once the optimal Lagrangian multipliers a 0, i have been found we can use them to find the optimal w: w 0 = a 0,i d i x i and the optimal bias from the fact that for a positive support vector: w T x i + b 0 = 1 => b 0 = 1 - w 0 T x i [However, from a numerical perspective it is better to take the mean value of b 0 resulting from all such data points in the sample] Since a 0,i = 0 if x i is not a support vector, ONLY the support vectors determine the optimal hyperplane which was our intuition

18
For a non-linearly separable problem we have to first map data onto feature space so that they are linear separable x i x i ) with the procedure for determining w the same except that x i is replaced by x i ) that is: Given the training data sample {(x i,d i ), i=1, …,N}, find the optimum values of the weight vector w and bias b w = a 0,i d i x i ) where a 0,i are the optimal Lagrange multipliers determined by maximising the following objective function subject to the constraints a i d i =0 ; a i >0

19
Example XOR problem revisited: Let the nonlinear mapping be : (x) = (1,x 1 2, 2 1/2 x 1 x 2, x 2 2, 2 1/2 x 1, 2 1/2 x 2 ) T And: (x i )=(1,x i1 2, 2 1/2 x i1 x i2, x i2 2, 2 1/2 x i1, 2 1/2 x i2 ) T Therefore the feature space is in 6D with input data in 2D x 1 = (-1,-1), d 1 = - 1 x 2 = (-1,1), d 2 = 1 x 3 = (1,-1), d 3 = 1 x 4 = (-1,-1), d 4 = -1

20
Q(a)= a i – ½ a i a j d i d j x i ) T x j ) = -1/2 ( ) a 1 a 1 +1/2 ( )a 1 a 2 +… +a 1 +a 2 +a 3 +a 4 =a 1 +a 2 +a 3 +a 4 – ½(9 a 1 a 1 - 2a 1 a 2 -2 a 1 a 3 +2a 1 a 4 +9a 2 a 2 + 2a 2 a 3 -2a 2 a 4 +9a 3 a 3 -2a 3 a 4 +9 a 4 a 4 ) To minimize Q, we only need to calculate Partial Q /partial a i = 0 (due to optimality conditions) which gives 1 = 9 a 1 - a 2 - a 3 + a 4 1 = -a a 2 + a 3 - a 4 1 = -a 1 + a a 3 - a 4 1 = a 1 - a 2 - a a 4

21
The solution of which gives the optimal values: a 0,1 =a 0,2 =a 0,3 =a 0,4 =1/8 w 0 = a 0,i d i x i ) = 1/8[ x 1 )- x 2 )- x 3 )+ x 4 )] Where the first element of w 0 gives the bias b

22
From earlier we have that the optimal hyperplane is defined by: w 0 T x) = 0 That is: w 0 T x) which is the optimal decision boundary for the XOR problem. Furthermore we note that the solution is unique since the optimal decision boundary is unique

23
Output for polynomial RBF

24
SVM building procedure: 1.Pick a nonlinear mapping 2.Solve for the optimal weight vector However: how do we pick the function In practical applications, if it is not totally impossible to find it is very hard In the previous example, the function is quite complex: How would we find it? Answer: the Kernel Trick

25
Notice that in the dual problem the image of input vectors only involved as an inner product meaning that the optimisation can be performed in the (lower dimensional) input space and that the inner product can be replaced by an inner-product kernel Q(a) = a i – ½ a i a j d i d j x i ) T x j ) = a i – ½ a i a j d i d j x i x j How do we relate the output of the SVM to the kernel K? Look at the equation of the boundary in the feature space and use the optimality conditions derived from the Lagrangian formulations

26

27
In the XOR problem, we chose to use the kernel function: K(x, x i ) = (x T x i +1) 2 = 1+ x 1 2 x i x 1 x 2 x i1 x i2 + x 2 2 x i x 1 x i1,+ 2x 2 x i2 Which implied the form of our nonlinear functions: (x) = (1,x 1 2, 2 1/2 x 1 x 2, x 2 2, 2 1/2 x 1, 2 1/2 x 2 ) T And: (x i )=(1,x i1 2, 2 1/2 x i1 x i2, x i2 2, 2 1/2 x i1, 2 1/2 x i2 ) T However, we did not need to calculate at all and could simply have used the kernel to calculate: Q(a) = a i – ½ a i a j d i d j x i x j Maximised and solved for a i and derived the hyperplane via:

28
We therefore only need a suitable choice of kernel function cf: Mercer’s Theorem: Let K(x,y) be a continuous symmetric kernel that defined in the closed interval [a,b]. The kernel K can be expanded in the form (x,y) = x) T y) provided it is positive definite. Some of the usual choices for K are: Polynomial SVM (x T x i +1) p p specified by user RBF SVMexp(-1/(2 ) || x – x i || 2 ) specified by user MLP SVM tanh(s 0 x T x i + s 1 )Mercer’s theorem not satisfied for all s 0 and s 1

29
How to recover from a given K ???? Not essential that we do… Further development 1. In practical applications, it is found that the support vector machine can outperform other learning machines 2. How to choose the kernel? 3. How much better is the SVM compared with traditional machine? Feng J., and Williams P. M. (2001) The generalization error of the symmetric and scaled support vector machine IEEE T. Neural Networks Vol. 12, No IEEE T. Neural Networks

30
What about regularisation? Important that we don’t allow noise to spoil our generalisation: we want a soft margin of separation Introduce slack variables e i >= 0 such that: d i (w T x i + b) >= 1 – e i for i= 1,.. N Rather than: d i (w T x i + b) >= 1 0 < e i <= 1 e i > 1 e i = 0 But all 3 are support vectors since d i (w T x i + b) = 1 – e i

31
Thus the slack variables measure our deviation from the ideal pattern separability and also allow us some freedom in specifying the hyperplane Therefore formulate new problem to minimise: F(w, e ) = 1/2 w T w + C e i subject to: d i (w T x i + b) >= 1 for i= 1,.. N And : e i >= 0 Where C acts as a (inverse) regularisation parameter which can be determined experimentally or analytically.

32
The solution proceeds in the same way as before (Lagrangian, formulate dual and maximise) to obtain optimal a i for: Q(a)= a i – ½ a i a j d i d j x i, x j ) subject to the constraints a i d i =0 ; 0<= a i <= C Thus, the nonseparable problem differs from the separable one only in that the second constraint is more stringent. Again the optimal solution is: w 0 = a 0,i d i x i ) However, this time the KKT conditions imply that: e i = 0 if a i < C

33
SVMs for non-linear regression SVMs can also be used for non-linear regression. However, unlike MLPs and RBFs the formulation does not follow directly from the classification case Starting point: we have input data X = {(x 1,d 1 ), …., (x N,d N )} Where x i is in D dimensions and d i is a scalar. We want to find a robust function f(x) that has at most deviation from the targets d, while at the same time being as flat (in the regularisation sense of a smooth boundary) as possible.

34
Thus setting: f(x) = w T (x) + b The problem becomes, minimise: ½ w T w (for flatness) [think of gradient between (0,0) and (1,1) if weights are (1,1) vs (1000, 1000)] Subject to: d i - w T (x i ) + b <= w T (x i ) + b - d i <=

35
L(f,y) This formalisation is called -insensitive regression as it is equivalent to minimising the empirical risk (amount you might be wrong) using an -insensitive loss function: L(f, d, x) = | f(x) – d | - for | f(x) – d | < else

36
Comparing -insensitive loss function to least squares loss function (used for MLP/RBFN) More robust (robust to small changes in data/ model) Less sensitive to outliers Non-continuous derivative Cost function is: C i L(f, d i, x i ) Where C can be viewed as a regularisation parameter

37
Regression for different function selected is the flattest Original (O)

38
We now introduce 2 slack variables, e i and e i * as in the case of nonlinearly separable data and write: d i - w T (x i ) + b <= e i w T (x i ) + b - d i <= e i * Where: e i, e i * >= 0 Thus: C L(f, d i, x i ) = C e i + e i * ) And the problem becomes to minimise: F(w, e ) = 1/2 w T w + C e i + e i * ) subject to: d i - w T (x i ) + b = 0

39
We now form the Lagrangian, and find the dual. Note that this time, there will be 2 sets of Lagrangian multipliers as there are 2 constraints. The dual to be maximised is: Where and C are free parameters that control the approximating function: f(x) = w T (x) = i a i – a i *) K x, x i )

40
From the KKT conditions we now have: a i ( e i - d i + w T x i + b) = 0 a i * ( e i * + d i - w T x i - b) = 0 This means that the Lagrange multipliers will only be non-zero for points where: | f(x i ) – d i | >= That is, only for points outside the tube. Thus these points are the support vectors and we have a sparse expansion of w in terms of x

41
SVs Data points Controls the amount of SVs selected

42
Only non-zero a’s can contribute: the Lagrange multipliers act like forces on the regression. However, they can only be applied at points outside or touching the tube Points where forces act

43
One note of warning: Regression is much harder than classification for 2 reasons 1. Regression is intrinsically more difficult than classification 2. and C must be tuned simultaneously

44
Research issues: Incorporation of prior knowledge e.g. 1.train a machine, 2.add in virtual support vectors which incorporate known invariances, of SVs found in 1. 3.retrain Speeding up training time? Various techniques, mainly to deal with reducing the size of the data set. Chunking: use subsets of the data at a time and only keep SVs. Also, more sophisticated versions which use linear combinations of the training points as inputs

45
Optimisation packages/techniques? Off the shelf ones are not brilliant (including the MATLAB one). Sequential Minimal Opitmisation (SMO) widely used. For details of that and others see: A. J. Smola and B. Schölkopf. A Tutorial on Support Vector Regression. NeuroCOLT Technical Report NC-TR , Royal Holloway College, University of London, UK, 1998.A Tutorial on Support Vector Regression Selection of and C

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google