# Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

## Presentation on theme: "Support Vector Machines (SVMs) Chapter 5 (Duda et al.)"— Presentation transcript:

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
CS479/679 Pattern Recognition Dr. George Bebis

Learning through “empirical risk” minimization
Estimate g(x) from a finite set of observations by minimizing an error function, for example, the training error (also called empirical risk): class labels:

Learning through “empirical risk” minimization (cont’d)
Conventional empirical risk minimization does not imply good generalization performance. There could be several different functions g(x) which all approximate the training data set well. Difficult to determine which function would have the best generalization performance.

Learning through “empirical risk” minimization (cont’d)
Solution 1 Solution 2 Which solution is better?

Statistical Learning: Capacity and VC dimension
To guarantee good generalization performance, the capacity (i.e., complexity) of the learned functions must be controlled. Functions with high capacity are more complicated (i.e., have many degrees of freedom). low capacity high capacity

Statistical Learning: Capacity and VC dimension (cont’d)
How do we measure capacity? In statistical learning, the Vapnik-Chervonenkis (VC) dimension is a popular measure of capacity. The VC dimension can predict a probabilistic upper bound on the generalization error of a classifier.

Statistical Learning: Capacity and VC dimension (cont’d)
A function that (1) minimizes the empirical risk and (2) has low VC dimension will generalize well regardless of the dimensionality of the input space: with probability (1-δ); (n: # of training examples) (Vapnik, 1995, “Structural Risk Minimization Principle”) structural risk minimization n

VC dimension and margin of separation
Vapnik has shown that maximizing the margin of separation (i.e., empty space between classes) is equivalent to minimizing the VC dimension. The optimal hyperplane is the one giving the largest margin of separation between the classes.

Margin of separation and support vectors
How is the margin defined? The margin is defined by the distance of the nearest training samples from the hyperplane. We refer to these samples as support vectors. Intuitively speaking, these are the most difficult samples to classify.

Margin of separation and support vectors (cont’d)
different solutions corresponding margins

SVM Overview Primarily two-class classifiers but can be extended to multiple classes. It performs structural risk minimization to achieve good generalization performance. The optimization criterion is the margin of separation between classes. Training is equivalent to solving a quadratic programming problem with linear constraints.

Linear SVM: separable case
Linear discriminant Class labels Consider the equivalent problem: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0

Linear SVM: separable case (cont’d)
The distance of a point xk from the separating hyperplane should satisfy the constraint: To constraint the length of w (uniqueness), we impose: Using the above constraint:

Linear SVM: separable case (cont’d)

Linear SVM: separable case (cont’d)
Using Langrange optimization, minimize: Easier to solve the “dual” problem (Kuhn-Tucker construction):

Linear SVM: separable case (cont’d)
The solution is given by: dot product

Linear SVM: separable case (cont’d)
dot product It can be shown that if xk is not a support vector, then the corresponding λk=0. Only the support vectors contribute to the solution!

Linear SVM: non-separable case
Allow miss-classifications (i.e., soft margin classifier) by introducing positive error (slack) variables ψk :

Linear SVM: non-separable case (cont’d)
The constant c controls the trade-off between margin and misclassification errors. Aims to prevent outliers from affecting the optimal hyperplane.

Linear SVM: non-separable case (cont’d)
Easier to solve the “dual” problem (Kuhn-Tucker construction):

Nonlinear SVM Extending these concepts to the non-linear case involves mapping the data to a high-dimensional space h: Mapping the data to a sufficiently high dimensional space is likely to cast the data linearly separable in that space.

Nonlinear SVM (cont’d)
Example:

Nonlinear SVM (cont’d)

Nonlinear SVM (cont’d)
The disadvantage of this approach is that the mapping might be very computationally intensive to compute! Is there an efficient way to compute ? non-linear SVM:

The kernel trick Compute dot products using a kernel function

The kernel trick (cont’d)
Comments Kernel functions which can be expressed as a dot product in some space satisfy the Mercer’s condition (see Burges’ paper) The Mercer’s condition does not tell us how to construct Φ() or even what the high dimensional space is. Advantages of kernel trick No need to know Φ() Computations remain feasible even if the feature space has high dimensionality.

Polynomial Kernel K(x,y)=(x . y) d

Polynomial Kernel - Example

Common Kernel functions

Example

Example (cont’d) h=6

Example (cont’d)

Example (cont’d) (Problem 4)

Example (cont’d) w0=0

Example (cont’d) w =

Example (cont’d) w = The discriminant

Comments SVM is based on exact optimization, not on approximate methods (i.e., global optimization method, no local optima) Appears to avoid overfitting in high dimensional spaces and generalize well using a small training set. Performance depends on the choice of the kernel and its parameters. Its complexity depends on the number of support vectors, not on the dimensionality of the transformed space.