Presentation on theme: "Support Vector Machines (SVMs) Chapter 5 (Duda et al.)"— Presentation transcript:
1 Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis
2 Learning through “empirical risk” minimization Estimate g(x) from a finite set of observations by minimizing an error function, for example, the training error (also called empirical risk):class labels:
3 Learning through “empirical risk” minimization (cont’d) Conventional empirical risk minimization does not imply good generalization performance.There could be several different functions g(x) which all approximate the training data set well.Difficult to determine which function would have the best generalization performance.
4 Learning through “empirical risk” minimization (cont’d) Solution 1Solution 2Which solution is better?
5 Statistical Learning: Capacity and VC dimension To guarantee good generalization performance, the capacity (i.e., complexity) of the learned functions must be controlled.Functions with high capacity are more complicated (i.e., have many degrees of freedom).low capacityhigh capacity
6 Statistical Learning: Capacity and VC dimension (cont’d) How do we measure capacity?In statistical learning, the Vapnik-Chervonenkis (VC) dimension is a popular measure of capacity.The VC dimension can predict a probabilistic upper bound on the generalization error of a classifier.
7 Statistical Learning: Capacity and VC dimension (cont’d) A function that(1) minimizes the empirical risk and(2) has low VC dimensionwill generalize well regardless of the dimensionality of the input space:with probability (1-δ); (n: # of training examples)(Vapnik, 1995, “Structural Risk Minimization Principle”)structural riskminimizationn
8 VC dimension and margin of separation Vapnik has shown that maximizing the margin of separation (i.e., empty space between classes) is equivalent to minimizing the VC dimension.The optimal hyperplane is the one giving the largest margin of separation between the classes.
9 Margin of separation and support vectors How is the margin defined?The margin is defined by the distance of the nearest training samples from the hyperplane.We refer to these samples as support vectors.Intuitively speaking, these are the most difficult samples to classify.
10 Margin of separation and support vectors (cont’d) different solutionscorresponding margins
11 SVM OverviewPrimarily two-class classifiers but can be extended to multiple classes.It performs structural risk minimization to achieve good generalization performance.The optimization criterion is the margin of separation between classes.Training is equivalent to solving a quadratic programming problem with linear constraints.
12 Linear SVM: separable case Linear discriminantClass labelsConsider the equivalent problem:Decide ω1 if g(x) > 0 and ω2 if g(x) < 0
13 Linear SVM: separable case (cont’d) The distance of a point xk from the separating hyperplane should satisfy the constraint:To constraint the length of w (uniqueness), we impose:Using the above constraint:
14 Linear SVM: separable case (cont’d) quadraticprogrammingproblemmaximizemargin:
15 Linear SVM: separable case (cont’d) Using Langrange optimization, minimize:Easier to solve the “dual” problem (Kuhn-Tucker construction):
16 Linear SVM: separable case (cont’d) The solution is given by:dot product
17 Linear SVM: separable case (cont’d) dot productIt can be shown that if xk is not a support vector, then the corresponding λk=0.Only the support vectorscontribute to the solution!
18 Linear SVM: non-separable case Allow miss-classifications (i.e., soft margin classifier) by introducing positive error (slack) variables ψk :
19 Linear SVM: non-separable case (cont’d) The constant c controls the trade-off between margin and misclassification errors.Aims to prevent outliers from affecting the optimal hyperplane.
20 Linear SVM: non-separable case (cont’d) Easier to solve the “dual” problem (Kuhn-Tucker construction):
21 Nonlinear SVMExtending these concepts to the non-linear case involves mapping the data to a high-dimensional space h:Mapping the data to a sufficiently high dimensional space is likely to cast the data linearly separable in that space.
24 Nonlinear SVM (cont’d) The disadvantage of this approach is that the mappingmight be very computationally intensive to compute!Is there an efficient way to compute ?non-linear SVM:
25 The kernel trickCompute dot products using a kernel function
26 The kernel trick (cont’d) CommentsKernel functions which can be expressed as a dot product in some space satisfy the Mercer’s condition (see Burges’ paper)The Mercer’s condition does not tell us how to construct Φ() or even what the high dimensional space is.Advantages of kernel trickNo need to know Φ()Computations remain feasible even if the feature space has high dimensionality.
37 CommentsSVM is based on exact optimization, not on approximate methods (i.e., global optimization method, no local optima)Appears to avoid overfitting in high dimensional spaces and generalize well using a small training set.Performance depends on the choice of the kernel and its parameters.Its complexity depends on the number of support vectors, not on the dimensionality of the transformed space.