Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Support Vector Machines Podpůrné vektorové stroje Babak Mahdian, June 2009.

Similar presentations


Presentation on theme: "1 Support Vector Machines Podpůrné vektorové stroje Babak Mahdian, June 2009."— Presentation transcript:

1 1 Support Vector Machines Podpůrné vektorové stroje Babak Mahdian, June 2009

2 2 Most of slides are taken from the presentations provided by: 1. Chih-Jen Lin (National Taiwan University) 2. Colin Campbell (Bristol University) 3. Andrew W. Moore (Carnegie Mellon University) 4. Jan Flusser (AS CR, ÚTIA)

3 3 Outline 1. SVMs for binary classification. 2. Soft margins and multi-class classification.

4 4 A classifier derived from statistical learning by Vladimir Vapnik et al. in 1992. Currently SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, etc.

5 5 Preliminaries: Consider a binary classification problem: input vectors are x i and y i = {1,-1} are the targets or labels. The index i labels the pattern pairs (i = 1,…,m). The x i define a space of labelled points called input space.

6 6 Binary classification:

7 7 Let us to separate the input data via a hyperplane.

8 8

9 9

10 10 Which of the linear separators is optimal?

11 11 SVM search for a separating hyperplane, which maximizes the margin between the two data sets. A hyperplane which has the miximum distance from nearest points of both sides.

12 12 Such a hyperplane exhibits the best generalization. It creates a “safe zone”. The closest points are called support vectors (they directly support where the hyperplane should be). Any change in support vectors shift the hyperplane. Any change in non-support vectors do not shift the hyperplane.

13 13 In an arbitrary-dimensional space a separating hyperplane can be written: where b is the bias, and w the weights, and x is a data point. Thus, we will consider a decision function of the form:

14 14 w x + b > 0 w x + b < 0 w x + b = 0

15 15 We can note that the sign of (w x + b) is invariant under a positive rescaling: We will implicitly fix a scale in the following way (infinite number of hyperplane equations): for the support vectors.

16 16 Now, support vectors are +1 or –1 and other vector >1 or < – 1.

17 17 Now, we easily can show that: The equation is for two support vectors on each side of the separating hyperplane.

18 18 The margin will be given by the projection of the vector (x 1 – x 2 ) onto the normal vector to the hyperplane i.e. w/||w|| from which we deduce that length of the margin is equal to 2/||w||.

19 19 Lagrange multipliers In optimization, the method of Lagrange multipliers provides a strategy for finding the maximum/minimum of a function subject to constraints. For example: maximize f(x,y), subject to g(x,y)=c We introduce a new variable (λ) called a Lagrange multiplier, and study the Lagrange function defined by: If (x,y) is a maximum for the original constrained problem, then there exists a λ such that (x,y,λ) is a stationary point for the Lagrange function.

20 20 Example 1/2: Suppose you wish to maximize f(x,y) = x + y subject to the constraint x 2 + y 2 = 1.

21 21 Example 2/2: Formally, set g(x,y) − c = x 2 + y 2 − 1, and Λ(x,y,λ) = f(x,y) + λ(g(x,y) − c) = x + y + λ(x 2 + y 2 − 1) Set the derivative dΛ = 0, which yields the system of equations: Solving the equation gives that the stationary points are: which gives the maximum

22 22 Maximization of the margin is thus equivalent to minimization of the functional: subject to the constraints: Please note that y i = {+1, -1}. We need to optimize a quadratic function subject to linear constraints. Quadratic optimization problems are a well- known. The Karush-Kuhn-Trucker criterion ok => Lagrange

23 23 Thus the task is to find an optimum of the primal objective function (consisted of my function to be minimized and constraints): Solving the saddle point equations gives the constraint:

24 24 and gives: which when substituted back in L gives us that we should maximize the functional (the Wolfe dual): I need to optimize (maximize)  (the rest is data) and it’s quadratic => quadratic programming. The primal was also quadratic (in w). Optimization theory: for every primal is dual and the solution of primal is same as the solution of dual.

25 25 subject to the constraints: (they are Lagrange multipliers) And Important note: there is only one solution (which is our hyperplane). Different starting points give the same solution (unlike ANN).

26 26 After finding alphas, the decision function is then: where z is a new data point.

27 27  6 =1.4  1 =0.8  2 =0  3 =0  4 =0  5 =0  7 =0  8 =0.6  9 =0  10 =0 After finding alphas, some are 0 (non-support vectors) and some not (as aforementioned shifting non-support vectors do not shift the hyperplane).

28 28 Some alphas are zero, some non-zero and sometime some are very large. Two reasons: 1. a correct, but unusual data point 2. an outlier Very large alphas have a big influence on hyperplane ’ s position.

29 29 Example (find SVM): We see that the optimal hyperplane is g(x) = x 1 =0. All four points are support vectors. The margin size is 2.

30 30 Example (find SVM): The equation of a hyperplane: w 1 x 1 + w 2 x 2 + w 0 = 0. We need to minimize ½||w.w|| subject to constraints: y i (w T x + w 0 ) ≥ 1 pro i=1,...,4. When substituted, we get:

31 31 Example (find SVM): Lagrange:

32 32 Example (find SVM): 7 equations and 7 unknown variables. After solving the equations: This gives more solutions, but all end up with: w 1 = 1, w 2 = 0, w 0 = 0.

33 33 Recap: 1. The classifier is a separating hyperplane. 2. Most “important” training points are support vectors; they define the hyperplane. 3. Quadratic optimization algorithms can identify which training points x i are support vectors with non- zero Lagrangian multipliers α i.

34 34 So far we don ’ t know how to handle non-separable datasets. Rather than fitting nonlinear curves to the data, SVM handles this by using a kernel function to map the data into a different space where a hyperplane can be used to do the separation.

35 35 The kernel function may transform the data into a higher dimensional space to make it possible to perform the separation.

36 36 For the dual objective function we notice that the datapoints, x i, only appear inside an inner product. To get a better representation of the data we can therefore map the datapoints into an alternative higher dimensional space through a replacement: i.e. we have used a mapping This higher dimensional space must be a Hilbert Space.

37 37 The function will be called a kernel, so: The kernel function is a function that is eqiuvalent to an inner product in some feature space.

38 38 Example of kernels: Linear: K(x i,x j )= x i T x j Polynomial of power p: K(x i,x j )= (1+ x i T x j ) p Gaussian (radial-basis function):

39 39 Kernel function implicitly maps data to a high-dimensional space (without the need to compute each  (x) explicitly). Example: 2-dimensional vectors x=[x 1 x 2 ]; let K(x i,x j ) = (1 + x i T x j ) 2, Need to show that K(x i, x j )=  (x i ) T  (x j ): K(x i,x j ) = (1 + x i T x j ) 2 = = 1+ x i1 2 x j1 2 + 2 x i1 x j1 x i2 x j2 + x i2 2 x j2 2 + 2x i1 x j1 + 2x i2 x j2 = = [1 x i1 2 √2x i1 x i2 x i2 2 √2x i1 √2x i2 ] T [1 x j1 2 √2 x j1 x j2 x j2 2 √2x j1 √2x j2 ]= =  (x i )  (x j ), where  (x) = [1 x 1 2 √2 x 1 x 2 x 2 2 √2x 1 √2x 2 ]

40 40 Which functions are kernels? Kernel functions should satisfy Mercer’s condition: For any g(x) for which: it must be the case that:

41 41 A simple criterion is that the kernel should be positive semi-definite. If a kernel is positive semi-definite i.e.: where {c 1,..., c n } are real numbers, then there exists a function  (x) defining an inner product of possibly higher dimension i.e.:

42 42 Thus the following steps are used to train an SVM : 1. Choose kernel function K (x i,x j ) 2. Maximize: subject to:

43 43 3. The bias b is found as follows: 4. The optimal  i go into the decision function:

44 44 Choosing the Kernel Function and its parameters Probably the most tricky part of using SVM. Many principles have been proposed. In practice, a low degree polynomial kernel or RBF kernel with a reasonable width is a good initial try for most applications. It was said that for text classification, linear kernel is the best choice, because of the high feature dimension.

45 45 Multi-class problems SVM constructed for binary classification. Multi-class classification: Many problems involve multiclass classification. A number of schemes have been outlined. One of the simplest schemes is to use a directed acyclic graph (DAG) with the learning task reduced to binary classification at each node:

46 46 We see that this way is good for a small number of classes (for example, 7). What about a large number of classes (for example, 100)? One way is using the “one against all” approach.

47 47 Soft margins Most real life datasets contain noise. SVM can fit this noise leading to poor generalization (the problem of overfitting). Rather we allow some misclassifications. Can be readily viewed as relaxation of the hard margin constraint. ii

48 48 The effect of outliers and noise can be reduced by introducing a soft margin. Two schemes are commonly used: L 1 error norm (we bound the alpha - Lagrange multipliers): Not a bad idea. Abnormal points have a large alpha value. L 2 error norm: When construct the kernel, a small element is added.

49 49 Effect of soft margins L1L1

50 50 For the L 1 error norm (prior to introducing kernels) we introduce a positive slack variable We actually allow some points to be within the margin bound. So, minimize the sum of errors and in the same time ||w|| (the condition to maximize the margin): So, C is a tradeoff. If C is infinity => normal hard margin. If C < infinity, we will have some terms being non-zero among the slack variables.

51 51

52 52 This is readily formulated as a primal objective function : with Lagrange multipliers

53 53 The derivatives with respect to w, b and  give: Resubstituting back in the primal objective function obtain the same dual objective function as before. However, r i <= 0 and C -  i - r i = 0, hence  i <= C and the constraint 0 <=  i is replaced by 0 <=  i <= C.

54 54 Patterns with values 0 <=  i <= C will be referred to as non-bound and those with  i = 0 or  i = C will be said to be at bound. Some theory about the optimal value C. In practice, the optimal value of C must be found by experimentation using a validation.

55 55 In an alternative approach, -SVM, it can be shown that solutions for an L 1 -error norm are the same as those obtained from maximizing: and lies on the range 0 to 1. Better interpretation and more transparent of soft- margins and C. The fraction of training errors is upper bounded by and also provides a lower bound on the fraction of points which are support vectors.

56 56 For many real-life datasets there is an imbalance between the amount of data in different classes, or the significance of the data in the two classes can be quite different. The relative balance between the detection rate for different classes can be easily shifted by introducing asymmetric soft margin parameters. Thus for binary classification with an L 1 error norm: for y i =+1, and for y i =-1, etc. Thus, it allows a control on false positives rate. We do a favor for one class.

57 57 Some SVM Software LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ SVM Light http://svmlight.joachims.org/ TinySVM http://chasen.org/~taku/software/TinySVM/ WEKA http://www.cs.waikato.ac.nz/ml/weka/

58 58 Conclusion SVMs are currently among the better performers for a number of classification tasks. SVM techniques have been extended to a number of tasks such as regression. Tuning SVMs remains a black art: selecting a specific kernel and parameters is usually done in a try-and- see manner (a lengthy series of experiments in which various parameters are tested).


Download ppt "1 Support Vector Machines Podpůrné vektorové stroje Babak Mahdian, June 2009."

Similar presentations


Ads by Google