Presentation is loading. Please wait.

Presentation is loading. Please wait.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Similar presentations


Presentation on theme: "Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis."— Presentation transcript:

1 Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis

2 Learning through “empirical risk” minimization Typically, a discriminant function g(x) is estimated from a finite set of examples by minimizing an error function, e.g., the training error: class labels: correct class predicted class empirical risk minimization

3 Learning through “empirical risk” minimization (cont’d) Conventional empirical risk minimization does not imply good generalization performance. – There could be several different functions g(x) which all approximate the training data set well. – Difficult to determine which function would have the best generalization performance. Solution 1Solution 2 Which solution is best?

4 Statistical Learning: Capacity and VC dimension To guarantee good generalization performance, the complexity or capacity of the learned functions must be controlled. Functions with high capacity are more complicated (i.e., have many degrees of freedom). low capacity high capacity

5 Statistical Learning: Capacity and VC dimension (cont’d) How can we measure the capacity of a discriminant function? – In statistical learning, the Vapnik-Chervonenkis (VC) dimension is a popular measure of the capacity of a classifier. – The VC dimension can predict a probabilistic upper bound on the generalization error of a classifier.

6 Statistical Learning: Capacity and VC dimension (cont’d) It can be shown that a classifier that: (1) minimizes the empirical risk and (2) has low VC dimension will generalize well regardless of the dimensionality of the input space with probability (1-δ); (n: # of training examples) (Vapnik, 1995, “Structural Risk Minimization Principle”)n structural risk minimization (h: VC dimension)

7 VC dimension and margin of separation Vapnik has shown that maximizing the margin of separation (i.e., empty space between classes) is equivalent to minimizing the VC dimension. The optimal hyperplane is the one giving the largest margin of separation between the classes.

8 Margin of separation and support vectors How is the margin defined? – The margin is defined by the distance of the nearest training samples from the hyperplane. – Intuitively speaking, these are the most difficult samples to classify. – We refer to these samples as support vectors.

9 Margin of separation and support vectors (cont’d) different solutionscorresponding margins

10 SVM Overview Primarily two-class classifiers but can be extended to multiple classes. It performs structural risk minimization to achieve good generalization performance. The optimization criterion is the margin of separation between classes. Training is equivalent to solving a quadratic programming problem with linear constraints.

11 Linear SVM: separable case Linear discriminant Class labels Consider the equivalent problem: Decide ω 1 if g(x) > 0 and ω 2 if g(x) < 0

12 Linear SVM: separable case (cont’d) The distance r of a point x k from the separating hyperplane should satisfy the constraint: To constraint the length of w (for uniqueness), we impose:

13 Linear SVM: separable case (cont’d) Maximize margin: quadratic programming problem equivalent

14 Linear SVM: separable case (cont’d) Using Langrange optimization, minimize: Easier to solve the “dual” problem (Kuhn-Tucker construction):

15 Linear SVM: separable case (cont’d) The solution is given by: The discriminant is given by: dot product for any k

16 Linear SVM: separable case (cont’d) It can be shown that if x k is not a support vector, then the corresponding λ k =0. The solution depends on the support vectors only!

17 Linear SVM: non-separable case Allow miss-classifications (i.e., soft margin classifier) by introducing positive error (slack) variables ψ k : c: const

18 Linear SVM: non-separable case (cont’d) The constant c controls the trade-off between margin and misclassification errors. Aims to prevent outliers from affecting the optimal hyperplane.

19 Linear SVM: non-separable case (cont’d) Easier to solve the “dual” problem (Kuhn-Tucker construction):

20 Extensions to the non-linear case involves mapping the data to a an h-dimensional space: Mapping the data to a sufficiently high dimensional space is likely to cast the data linearly separable in that space. Nonlinear SVM

21 Nonlinear SVM (cont’d) Example:

22 Nonlinear SVM (cont’d) linear SVM: non-linear SVM:

23 The disadvantage of this approach is that the mapping might be very computationally intensive to compute! Is there an efficient way to compute ? Nonlinear SVM (cont’d) non-linear SVM:

24 The kernel trick Compute dot products using a kernel function

25 The kernel trick (cont’d) Mercer’s condition – Kernel functions which can be expressed as a dot product in some space satisfy the Mercer’s condition (see Burges’ paper) – The Mercer’s condition does not tell us how to construct Φ() or even what the high dimensional space is. Advantages of kernel trick – No need to know Φ() – Computations remain feasible even if the feature space has high dimensionality.

26 Polynomial Kernel K(x,y)=(x. y) d

27 Polynomial Kernel (cont’d)

28 Common Kernel functions

29 Example

30 Example (cont’d) h=6

31 Example (cont’d)

32 (Problem 4)

33 Example (cont’d) w 0 =0

34 Example (cont’d) w =

35 Example (cont’d) The discriminant w =

36 Comments SVM is based on exact optimization, not on approximate methods (i.e., global optimization method, no local optima) Appears to avoid overfitting in high dimensional spaces and generalize well using a small training set. Performance depends on the choice of the kernel and its parameters. Its complexity depends on the number of support vectors, not on the dimensionality of the transformed space.


Download ppt "Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis."

Similar presentations


Ads by Google