Download presentation

1
**Support Vector Machines Kernel Machines**

Ata Kaban The University of Birmingham

2
**Remember the XOR problem?**

Today we learn how to solve it.

3
**Remember the XOR problem?**

4
**Support Vector Machines (SVM)**

Method for supervised learning problems Classification Regression Two key ideas Assuming linearly separable classes, learn separating hyperplane with maximum margin Expand input into high-dimensional space to deal with linearly non-separable cases (such as the XOR)

5
**Separating hyperplane**

Training set: (xi, yi), i=1,2,…N; yi{+1,-1} Hyperplane: wx+b=0 This is fully determined by (w,b) where x=(x1, x2, …, xd), w=(w1, w2, …, wd), wx=(w1x1+w2x2+…+wdxd) dot product

6
Maximum margin According to a theorem from Learning Theory, from all possible linear decision functions the one that maximises the margin of the training set will minimise the generalisation error. [of course, given enough data points and assuming that data is not noisy!]

7
**Maximum margin w wx+b>0 wx+b<0 wx+b=0**

Note1: with c being a constant, the decision functions (w,b) and (cw, cb) are the same. Note2: but margins as measured by the outputs of the function xwx+b are not the same if we take (cw, cb). Definition: geometric margin: the margin given by the canonical decision function, which is when c=1/||w|| Strategy: 1) we need to maximise the geometric margin! (cf result from learning theory) 2) subject to the constraint that training examples are classified correctly w wx+b>0 wx+b<0 wx+b=0

8
Maximum margin According to Note1, we can demand the function output for the nearest points to be +1 and –1 on the two sides of the decision function. This removes the scaling freedom. Denoting a nearest positive example x+ and a nearest negative example x-, this is Computing the geometric margin (that has to be maximised): And here are the constraints:

9
**Maximum margin – summing up**

Given a linearly separable training set (xi, yi), i=1,2,…N; yi{+1,-1} Minimise ||w||2 Subject to This is a quadratic programming problem with linear inequality constraints. There are well known procedures for solving it wx+b>1 wx+b<1 wx+b=1 wx+b=0 wx+b=-1

10
Support vectors The training points that are nearest to the separating function are called support vectors. What is the output of our decision function for these points?

11
**Solving (not req for exam)**

Construct & minimise the Lagrangian Take derivatives wrt. w and b, equate them to 0 The Lagrange multipliers i are called ‘dual variables’ Each training point has an associated dual variable. parameters are expressed as a linear combination of training points only SVs will have non-zero i

12
(not req for exam)

13
**Solving (not req for exam)**

Plug this back into the Lagrangian to obtain the dual formulation (homework) The resulting dual that is solved for by using a QP solver: The b does not appear in the dual so it is determined separately from the initial constraints (homework) Data enters only in the form of dot products!

14
**Classifying new data points (not req for exam)**

Once the parameters (*, b*) are found by solving the required quadratic optimisation on the training set of points, the SVM is ready to be used for classifying new points. Given new point x, its class membership is sign[f(x, *, b*)], where Data enters only in the form of dot products!

15
Solution The solution of the SVM, i.e. of the quadratic programming problem with linear inequality constraints has the nice property that the data enters only in the form of dot products! Dot product (notation & memory refreshing): given x=(x1,x2,…xn) and y=(y1,y2,…yn), then the dot product of x and y is xy=(x1y1+x2y2+…+xnyn). This is nice because it allows us to make SVMs non-linear without complicating the algorithm see on next slide. If you want to use SVM in practice, many sw are available, see e.g. the Resources page at the back of this handout. If you want to understand what the sw does, then you need to master the previous slides marked as ‘not req for exam’.

16
**Non-linear SVMs Transform x (x)**

The linear algorithm depends only on xxi, hence transformed algorithm depends only on (x)(xi) Use kernel function K(x,y) such that K(x,y)= (x)(y)

17
Examples of kernels Example1: x, y points in 2D input, 3D feature space Example2: the that corresponds to this kernel has infinite dimension. Note: Not every function is a proper kernel. There is a theorem called Mercer Theorem that characterises proper kernels. To test a new input x when working with kernels (square of dot product)

19
**Making new kernels from the old**

New kernels can be made from valid kernels by allowed operations e.g. addition, multiplication and rescaling of kernels gives a proper kernel as long as the resulting Gram matrix is positive definite. Also, given a real-valued function f(x) over inputs x, then the following is a valid kernel

20
**Using SVM for classification**

Prepare the data matrix Select the kernel function to use Execute the training algorithm using a QP solver to obtain the i values Unseen data can be classified using the i values and the support vectors

21
**Applications Handwritten digits recognition**

Of interest to the US Postal services 4% error was obtained about 4% of the training data were SVs only Text categorisation Face detection DNA analysis …many others

22
**Discriminative versus generative classification methods**

SVMs learn the discrimination boundary. They are called discriminatory approaches. This is in contrast to learning a model for each class, like e.g. Bayesian classification does. This latter approach is called generative approach. SVM tries to avoid overfitting in high dimensional spaces (cf regularisation)

23
**Conclusions SVMs learn linear decision boundaries (cf perceptrons)**

They pick the hyperplane that maximises the margin The optimal hyperplane turns out to be a linear combination of support vectors Transform nonlinear problems to higher dimensional space using kernel functions; then there is more chance that in the transformed space the classes will be linearly separable.

24
Resources SW & practical guide to SVM for beginners Kernel machines website: Burges, C.J. C: A tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, Vol.2, nr.2, pp.121—167, Available from Cristianini & Shawe-Taylor: SVM book (in the School library)

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google