Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines

Generalized Linear Discriminant Functions A linear discriminant function g(x) can be written as: g(x) = wo + Σ i w i x i i = 1, …, d (d is number of features). We could add additional terms to obtain a quadratic function: g(x) = wo + Σ i w i x i + Σ i Σ j w ij x i x j The quadratic discriminant function introduces d(d-1)/2 coefficients corresponding to the products of attributes. The surfaces are thus more complicated(hyperquadric surfaces).

Generalized Linear Discriminant Functions We could even add more terms w ijk x i x j x k and obtain the class of polynomial discriminant functions. The generalized form is g(x) = Σ i a i φ i (x) g(x) = a t φ Where the summation goes over all functions φ i (x). The φ i (x) functions are called the phi or φ functions. The function is now linear on the φ i (x). The functions map a d-dimensional x-space into a d’ dimensional y-space. Example: g(x) = a 1 + a 2 x + a 3 x 2 φ = (1 x x 2 ) t

Figure 5.5

Support Vector Machines What are support vector machines (SVMs)? A very popular classifier that is based on the concepts previously discussed on linear discriminants and the new concept of margins. To begin, SVMs preprocess the data by representing all examples in a higher dimensional space. With sufficiently high dimensions the classes can be separated by a hyperplane.

The Margin

The Goal in Support Vector Machines Now, let t be 1 or – 1 depending on the example x being of class positive or negative. A separating hyperplane ensures that: t g(x) >= 0 The goal in support vector machines is to find the separating hyperplane with the “largest” margin. Margin is the distance between the hyperplane and the closest example to it.

The Support Vectors Now the distance from a pattern x to a hyperplane is g(x) / ||w||. So let’s change our objective to finding a vector w that maximizes the margin m in the equation: t g(x) / ||w|| >= m We can also say that the support vectors are those patterns x for which t g(x) / ||w|| = 1, because we can rescale the w vector and leave the hyperplane in the same place. Support vectors are equally close to the hyperplane. These are the patterns that are most difficult to separate. These are the most “informative” patterns.

The Support Vectors We said we want to find a vector w that maximizes the equation: t g(x) / ||w|| >= 1 This means all we really need to do is to maximize ||w|| -1 under certain constraints. So we have the following optimization problem: arg min w ½ ||w|| 2 subject to t g(x) >= 1 This can be solved using Lagrange Multipliers

The Support Vectors What happens when there are unavoidable errors? arg min w ½ ||w|| 2 + λ ∑ e i subject to t g(x i ) >= 1 - e i where e i is the error incurred by example x i These are known as slack variables.

The Support Vectors We can write this in a dual form (Karush-Kuhn-Tucker construction). max ∑  i – ½ ∑ ∑  i  j t i t j (x i. x j ) subject to 0 <=  i <= λ and ∑  i x i = 0

The Support Vectors The final result is a set of  i, one for each training example. The optimal hyperplane can be expressed in the dual representation as: f(x) = ∑ y i  i + b where w = ∑ y i  i x i

The Support Vectors We can use kernel functions to map from the original space to a new space. max ∑  i – ½ ∑ ∑  i  j t i t j (  (x i ).  (x j ) ) subject to 0 <=  i <= λ and ∑  i x i = 0

The Support Vectors Computing the dot product is simplified: Polynomial kernels:  (x i ).  (x j ) = 1 + 2 ∑ x i x j + ∑ x i 2 x j 2 + … But fortunately that is equal to: (1 + x i. x j ) 2 = K( x i, x j ) In general all we need is to compute the dot product of all examples in the original space. This results in the Gram matrix K

The Support Vectors The final formulation is as follows: max ∑  i – ½ ∑ ∑  i  j t i t j K (x i. x j ) subject to 0 <=  i <= λ and ∑  i x i = 0

Historical Background Vladimir Vapnik: Publications: 6 books and over a hundred research papers. Developed a theory for expected risk minimization. Invented Support Vector Machines

Historical Background Alexey Chervonenkis With Vladimir Vapnik developed the concept of the Vapnik-Chervonenkis dimension.

An Example The XOR problem is known to be non-separable: x1 x2 -1 0 1 0 1 We use phi functions (1, 1.41x1, 1.41x2, 1.41x1x2, x1 2, x2 2 ) (hidden in the kernel function).

An Example The optimal hyperplane is found to be g(x1,x2) = x1x2 = 0. The margin is p = 1.41 1.41 x1 1.41 x1x2 -1 0 1 0 1 b = 1.41 g = 0

Benefits of SVMs Benefits:  The complexity of the classifier is based on the number of support vectors rather than the dimensionality of the feature space.  This makes the algorithm less prone to overfitting

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Similar presentations

Presentation on theme: "Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Similar presentations

Presentation on theme: "Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines."— Presentation transcript:

Similar presentations

About project

Feedback