2 OutlineThe problem: finding a sparse decision (and regression) machine that uses kernelsThe solution: Support Vector Machines (SVMs) and Relevance Vector Machines (RVMs)The core ideas behind the solutionsThe mathematical details
3 The problem (1) Methods introduced in chapters 3 and 4 Take into account all data points in the training set -> cumbersomeDo not take advantage of kernel methods-> basis functions have to be explicitExample: Least squares and logistic regression
4 The problem (2)Kernel methods require evaluation of the kernel function for all pairs of-> cumbersome
5 The solution (1)Support vector machines (SVMs) are kernel machines that compute a decision boundary making sparse use of data points
6 The solution (2)Relevance vector machines (RVMs) are kernel machines that compute a posterior class probability making sparse use of data points
7 The solution (3) SVMs as well as RVMs can also be used for regression even sparser!
8 SVM: The core idea (1)That class separator which maximizes the margin between itself and the nearest data points will have the smallest generalization error:
11 RVM: The core idea (1)Exclude basis vectors whose presence reduces the probability of the observed data
12 RVM: The core idea (2) For classification and regression:
13 SVM: The details (1)Equation of the decision surface: Distance of a point from the decision surface:
14 SVM: The details (2)Distance of a point from the decision surface: Maximum margin solution:
15 SVM: The details (3)Distance of a point from the decision surface: We therefore may rescale , such that for the point closest to the surface.
16 SVM: The details (4)Therefore, we can reduce to under the constraint
17 SVM: The details (5)To solve this, we introduce Lagrange multipliers and minimize Equivalently, we can maximize the dual representation where the kernel function can be chosen without specifying explicitly.
18 SVM: The details (6)Because of the constraint only those survive for which is on the margin, i.e. This leads to sparsity.
19 SVM: The details (7)Based on numerical optimization of the parameters and , predictions on new data points can be made by evaluating the sign of
20 SVM: The details (8)In cases where the data points are not separable in feature space, we need a soft margin, i.e. a (limited) tolerance for misclassified points. To achieve this, we introduce slack variables with
22 SVM: The details (10)The same procedure as before (with additional Lagrange multipliers and corresponding additional constraints) again yields a sparse kernel-based solution:
23 SVM: The details (11)The soft-margin approach can be formulated as minimizing the regularized error function This formulation can be extended to use SVMs for regression: where and are slack variables describing the position of a data point above or below a tube of width 2ϵ around the estimate y.
25 SVM: The details (13)Again, optimization using Lagrange multipliers yields a sparse kernel-based solution:
26 SVM: Limitations Output is a decision, not a posterior probability Extension of classification to more than two classes is problematicThe parameters C and ϵ have to be found by methods such as cross validationKernel functions are required to be positive definite