# 7. Support Vector Machines (SVMs)

## Presentation on theme: "7. Support Vector Machines (SVMs)"— Presentation transcript:

7. Support Vector Machines (SVMs)
Basic Idea: Transform the data with a non-linear mapping f so that it is linearly separable. Cf Cover’s theorem: non-linearly separable data can be transformed into a new feature space which is linearly separable if 1) mapping is non-linear 2) dimensionality of feature space is high enough Construct the ‘optimal’ hyperplane (linear weighted sum of outputs of first layer) which maximises the degree of separation (the margin of separation: denoted by r) between the 2 classes

MLPs and RBFN stop training when all points are classified correctly
MLPs and RBFN stop training when all points are classified correctly. Thus the decision surfaces are not optimised in the sense that the generalization error is not minimized SVM r MLP RBF

x1 x2 b= w0 = bias Input: m-D vector y = Output y = S wi f(xi) +b y = wT f(x) wm1 fm1 (x) xm0 First layer: mapping performed from the input space into a feature space of higher dimension where the data is now linearly separable using a set of m1 non-linear functions (cf RBFNs)

After learning both RBFN and MLP decision surfaces might not be at the optimal position. For example, as shown in the figure, both learning rules will not perform further iterations (learning) since the error criterion is satisfied (cf perceptron) In contrast the SVM algorithm generates the optimal decision boundary (the dotted line) by maximising the distance between the classes r which is specified by the distance between the decision boundary and the nearest data points Points which lie exactly r/2 away from the decision boundary are known as Support Vectors Intuition is that these are the most important points since moving them moves the decision boundary

Moving a support vector moves the decision boundary
Moving the other vectors has no effect The algorithm to generate the weights proceeds in such a way that only the support vectors determine the weights and thus the boundary

However, we shall see that the output of the SVM can also be interpreted as a weighted sum of the inner (dot) products of the images of the input x and the support vectors xi in the feature space, which is computed by an inner product kernel function K(x,xm) b = bias x1 y = Output y = S ai di K(x, xi) + b x2 Input: m-D vector aNdN KN (x) = K(x, xN) = fT(x). f( xN) xm0 Where: fT(x) = [f1(x), f2(x), .. , fm1(x)]T I.e. image of x in feature space and di = +/- 1 depending on the class of xi

Why should inner product kernels be involved in pattern recognition?
-- Intuition is that they provide some measure of similarity -- cf Inner product in 2D between 2 vectors of unit length returns the cosine of the angle between them. e.g. x = [1, 0]T , y = [0, 1]T I.e. if they are parallel inner product is 1 xT x = x.x = 1 If they are perpendicular inner product is 0 xT y = x.y = 0

Differs to MLP (etc) approaches in a fundamental way
In MLPs complexity is controlled by keeping number of hidden nodes small Here complexity is controlled independently of dimensionality The mapping means that the decision surface is constructed in a very high (often infinite) dimensional space However, the curse of dimensionality (which makes finding the optimal weights difficult) is avoided by using the notion of an inner product kernel (see: the kernel trick, later) and optimising the weights in the input space

SVMs are a superclass of network containing both MLPs and RBFNs (and both can be generated using the SV algorithm) Strengths: Previous slide: i.e. complexity/capacity is independent of dimensionality of the data thus avoiding curse of dimensionality Statistically motivated => Can get bounds on the error, can use the theory of VC dimension and structural risk minimisation (theory which characterises generalisation abilities of learning machines) Finding the weights is a quadratic programming problem guaranteed to find a minimum of the error surface. Thus the algorithm is efficient and SVMs generate near optimal classification and are insensitive to overtraining Obtain good generalisation performance due to high dimension of feature space

Most important (?): by using a suitable kernel, SVM automatically computes all network parameters for that kernel. Eg RBF SVM: automatically selects the number and position of hidden nodes (and weights and bias) Weaknesses: Scale (metric) dependent Slow training (compared to RBFNs/MLPs) due to computationally intensive solution to QP problem especially for large amounts of training data => need special algorithms Generates complex solutions (normally > 60% of training points are used as support vectors) especially for large amounts of training data. E.g. from Haykin: increase in performance of 1.5% over MLP. However, MLP used 2 hidden nodes, SVM used 285 Difficult to incorporate prior knowledge

The SVM was proposed by Vapnik and colleagues in the 70’s but has only recently become popular early 90’s). It (and other kernel techniques) is currently a very active (and trendy) topic of research See for example: or (book): AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods). N. Cristianini and J. Shawe-Taylor, Cambridge University Press ISBN: for recent developments

First consider a linearly separable problem where the decision boundary is given by
g(x) = wTx+ b = 0 And a set of training data X={(xi,di): i=1, .., N} where di = +1 if xi is in class 1 and –1 if it’s in class 2. Let the optimal weight-bias combination be w0 and b0 xn x xp w Now: x = xp + xn = xp + r w0 / ||w0|| where: r = ||xn|| Since: g(xp) = 0, g(x) = w0T(xp + r w0 / ||w0||) + b0 g(x) = r w0T w0 / ||w0|| = r ||w0|| or: r = g(x)/ ||w0||

Thus, as g(x) gives us the algebraic distance to the hyperplane, we want: g(xi) = w0Txi + b0 >= 1 for di = 1 and g(xi) = w0Txi + b0 <= -1 for di = -1 (remembering that w0 and b0 can be rescaled without changing the boundary) with equality for the support vectors xs. Thus, considering points on the boundary and that: r = g(x)/ ||w0|| we have: r = 1/ ||w0|| for dS = 1 and r = -1/ ||w0|| for dS = -1 and so the margin of separation is: r = 2 / ||w0|| Thus, the solution w0 maximises the margin of separation Maximising this margin is equivalent to minimising ||w||

We now need a computationally efficient algorithm to find w0 and b0 using the training data (xi, di). That is we want to minimise: F(w) = 1/2 wTw subject to: di(wTxi + b) >= 1 for i= 1, .. N which is known as the primal problem. Note that the cost function F is convex in w (=> a unique solution) and that the constraints are linear in w. Thus we can solve for w using the technique of Lagrange multipliers (non-maths: technique for solving constrained optimisation problems). For a geometrical interpretation of Lagrange multipliers see Bishop, 95, Appendix C.

First we construct the Lagrangian function:
L(w , a) = 1/2 wTw - Si ai [di(wTxi + b) - 1] where ai are the Lagrange multipliers. L must be minimised with respect to w and b and maximised with respect to ai (it can be shown that such problems have a saddle-point at the optimal solution). Note that the Karush-Kuhn-Tucker (or, intuitively, the maximisation/constraint) conditions means that at the optimum: ai [di(wTxi + b) - 1] = 0 This means that unless the data point is a support vector ai = 0 and the respective points are not involved in the optimisation. We then set the partial derivatives of L wr to b and w to zero to obtain the conditions of optimality: w = Si ai di xi and Si ai di = 0

Given such a constrained convex problem, we can reform the primal problem using the optimality conditions to get the equivalent dual problem: Given the training data sample {(xi,di), i=1, …,N}, find the Lagrangian multipliers ai which maximise: subject to the constraints: and ai >0 Notice that the input vectors are only involved as an inner product

Once the optimal Lagrangian multipliers a0, i have been found we can use them to find the optimal w:
w0 = S a0,i di xi and the optimal bias from the fact that for a positive support vector: wTxi + b0 = 1 => b0 = 1 - w0Txi [However, from a numerical perspective it is better to take the mean value of b0 resulting from all such data points in the sample] Since a0,i = 0 if xi is not a support vector, ONLY the support vectors determine the optimal hyperplane which was our intuition

For a non-linearly separable problem we have to first map data onto
feature space so that they are linear separable xi f(xi) with the procedure for determining w the same except that xi is replaced by f(xi) that is: Given the training data sample {(xi,di), i=1, …,N}, find the optimum values of the weight vector w and bias b w = S a0,i di f(xi) where a0,i are the optimal Lagrange multipliers determined by maximising the following objective function subject to the constraints S ai di =0 ; ai >0

Example XOR problem revisited:
Let the nonlinear mapping be : f(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T And: f(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T Therefore the feature space is in 6D with input data in 2D x1 = (-1,-1), d1= - 1 x2 = (-1,1), d2= 1 x3 = (1,-1), d3= 1 x4 = (-1,-1), d4= -1

Q(a)= S ai – ½ S S ai aj di dj f(xi) Tf(xj)
= -1/2 ( ) a1 a1 +1/2 ( )a1 a2 +… +a1 +a2 +a3 +a4 =a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4 +9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a a4 a4 ) To minimize Q, we only need to calculate Partial Q /partial ai = 0 (due to optimality conditions) which gives 1 = 9 a1 - a2 - a3 + a4 1 = -a a2 + a3 - a4 1 = -a1 + a2 + 9 a3 - a4 1 = a1 - a2 - a a4

The solution of which gives the optimal values:
a0,1 =a0,2 =a0,3 =a0,4 =1/8 w0 = S a0,i di f(xi) = 1/8[f(x1)- f(x2)- f(x3)+ f(x4)] Where the first element of w0 gives the bias b

From earlier we have that the optimal hyperplane is defined by:
w0T f(x) = 0 That is: w0T f(x) which is the optimal decision boundary for the XOR problem. Furthermore we note that the solution is unique since the optimal decision boundary is unique

Output for polynomial RBF

SVM building procedure:
Pick a nonlinear mapping f Solve for the optimal weight vector However: how do we pick the function f? In practical applications, if it is not totally impossible to find f, it is very hard In the previous example, the function f is quite complex: How would we find it? Answer: the Kernel Trick

Notice that in the dual problem the image of input vectors only involved as an inner product meaning that the optimisation can be performed in the (lower dimensional) input space and that the inner product can be replaced by an inner-product kernel Q(a) = S ai – ½ S S ai aj di dj f(xi) T f(xj) = S ai – ½ S S ai aj di dj K(xi, xj) How do we relate the output of the SVM to the kernel K? Look at the equation of the boundary in the feature space and use the optimality conditions derived from the Lagrangian formulations

In the XOR problem, we chose to use the kernel function:
K(x, xi) = (x T xi+1)2 = 1+ x12 xi x1x2 xi1xi2 + x22 xi22 + 2x1xi1 ,+ 2x2xi2 Which implied the form of our nonlinear functions: f(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T And: f(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T However, we did not need to calculate f at all and could simply have used the kernel to calculate: Q(a) = S ai – ½ S S ai aj di dj K(xi, xj) Maximised and solved for ai and derived the hyperplane via:

We therefore only need a suitable choice of kernel function cf:
Mercer’s Theorem: Let K(x,y) be a continuous symmetric kernel that defined in the closed interval [a,b]. The kernel K can be expanded in the form K (x,y) = f(x) T f(y) provided it is positive definite. Some of the usual choices for K are: Polynomial SVM (x T xi+1)p p specified by user RBF SVM exp(-1/(2s2) || x – xi||2) s specified by user MLP SVM tanh(s0 x T xi + s1) Mercer’s theorem not satisfied for all s0 and s1

How to recover f from a given K ???? Not essential that we do…
Further development 1. In practical applications, it is found that the support vector machine can outperform other learning machines 2. How to choose the kernel? 3. How much better is the SVM compared with traditional machine? Feng J., and Williams P. M. (2001) The generalization error of the symmetric and scaled support vector machine IEEE T. Neural Networks Vol.  12, No. 5.

Important that we don’t allow noise to spoil our generalisation: we want a soft margin of separation Introduce slack variables ei >= 0 such that: di(wTxi + b) >= 1 – ei for i= 1, .. N Rather than: di(wTxi + b) >= 1 0 < ei <= 1 ei > 1 ei = 0 But all 3 are support vectors since di(wTxi + b) = 1 – ei

Thus the slack variables measure our deviation from the ideal pattern separability and also allow us some freedom in specifying the hyperplane Therefore formulate new problem to minimise: F(w, e ) = 1/2 wTw + C Sei subject to: di(wTxi + b) >= 1 for i= 1, .. N And : ei >= 0 Where C acts as a (inverse) regularisation parameter which can be determined experimentally or analytically.

The solution proceeds in the same way as before (Lagrangian, formulate dual and maximise) to obtain optimal ai for: Q(a)= S ai – ½ S S ai aj di dj K(xi, xj) subject to the constraints S ai di =0 ; 0<= ai <= C Thus, the nonseparable problem differs from the separable one only in that the second constraint is more stringent. Again the optimal solution is: w0 = S a0,i di f(xi) However, this time the KKT conditions imply that: ei = 0 if ai < C

SVMs for non-linear regression
SVMs can also be used for non-linear regression. However, unlike MLPs and RBFs the formulation does not follow directly from the classification case Starting point: we have input data X = {(x1,d1), …., (xN,dN)} Where xi is in D dimensions and di is a scalar. We want to find a robust function f(x) that has at most e deviation from the targets d, while at the same time being as flat (in the regularisation sense of a smooth boundary) as possible.

Thus setting: f(x) = wTf(x) + b The problem becomes, minimise: ½ wTw (for flatness) [think of gradient between (0,0) and (1,1) if weights are (1,1) vs (1000, 1000)] Subject to: di - wTf(xi) + b <= e wTf(xi) + b - di <= e

+e -e e L(f,y) This formalisation is called e -insensitive regression as it is equivalent to minimising the empirical risk (amount you might be wrong) using an e -insensitive loss function: L(f, d, x) = | f(x) – d | - e for | f(x) – d | < e = else

Comparing e -insensitive loss function to least squares loss function (used for MLP/RBFN)
More robust (robust to small changes in data/ model) Less sensitive to outliers Non-continuous derivative Cost function is: C Si L(f, di, xi) Where C can be viewed as a regularisation parameter

O + e Original (O) e = 0.1 O - e e = 0.2 e = 0.5 Regression for different e: function selected is the flattest

We now introduce 2 slack variables, ei and ei
We now introduce 2 slack variables, ei and ei* as in the case of nonlinearly separable data and write: di - wTf(xi) + b <= e + ei wTf(xi) + b - di <= e + ei* Where: ei , ei* >= 0 Thus: C S L(f, di, xi) = C S (ei + ei*) And the problem becomes to minimise: F(w, e ) = 1/2 wTw + C S (ei + ei*) subject to: di - wTf(xi) + b <= e + ei wTf(xi) + b - di <= e + ei* And : ei , ei* >= 0

We now form the Lagrangian, and find the dual
We now form the Lagrangian, and find the dual. Note that this time, there will be 2 sets of Lagrangian multipliers as there are 2 constraints. The dual to be maximised is: Where e and C are free parameters that control the approximating function: f(x) = wTf(x) = Si (ai – ai*) K (x, xi)

+e -e From the KKT conditions we now have:
ai (e +ei - di + wTxi + b) = 0 ai* (e +ei* + di - wTxi - b) = 0 This means that the Lagrange multipliers will only be non-zero for points where: | f(xi) – di | >= e That is, only for points outside the e tube. +e -e Thus these points are the support vectors and we have a sparse expansion of w in terms of x

e = 0.2 e = 0.5 SVs Data points e = 0.1 e = 0.02 e Controls the amount of SVs selected

Only non-zero a’s can contribute: the Lagrange multipliers act like forces on the regression. However, they can only be applied at points outside or touching the e tube Points where forces act

One note of warning: Regression is much harder than classification for 2 reasons 1. Regression is intrinsically more difficult than classification 2. e and C must be tuned simultaneously

Research issues: Incorporation of prior knowledge e.g. train a machine, add in virtual support vectors which incorporate known invariances, of SVs found in 1. retrain Speeding up training time? Various techniques, mainly to deal with reducing the size of the data set. Chunking: use subsets of the data at a time and only keep SVs. Also, more sophisticated versions which use linear combinations of the training points as inputs

Optimisation packages/techniques?
Off the shelf ones are not brilliant (including the MATLAB one). Sequential Minimal Opitmisation (SMO) widely used. For details of that and others see: A. J. Smola and B. Schölkopf. A Tutorial on Support Vector Regression. NeuroCOLT Technical Report NC-TR , Royal Holloway College, University of London, UK, 1998. Selection of e and C

Download ppt "7. Support Vector Machines (SVMs)"

Similar presentations