Presentation is loading. Please wait.

Presentation is loading. Please wait.

SVM — Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Similar presentations


Presentation on theme: "SVM — Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training."— Presentation transcript:

1 SVM — Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training data into a higher dimension With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”) With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors)

2 SVM — History and Applications Vapnik and colleagues (1992) — groundwork from Vapnik & Chervonenkis ’ statistical learning theory in 1960s Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization) Used both for classification and prediction Applications: –handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests

3

4

5

6

7

8

9

10

11 SVM — Linearly Separable A separating hyperplane can be written as W ● X + b = 0 where W={w 1, w 2, …, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as w 0 + w 1 x 1 + w 2 x 2 = 0 The hyperplane defining the sides of the margin: H 1 : w 0 + w 1 x 1 + w 2 x 2 ≥ 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 ≤ – 1 for y i = – 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem: Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers

12 Support vectors This means the hyperplane can be written as The support vectors define the maximum margin hyperplane! –All other instances can be deleted without changing its position and orientation

13 Finding support vectors Support vector: training instance for which  i > 0 Determine  i and b ?— A constrained quadratic optimization problem –Off-the-shelf tools for solving these problems –However, special-purpose algorithms are faster –Example: Platt’s sequential minimal optimization algorithm (implemented in WEKA) Note: all this assumes separable data!

14

15 Extending linear classification Linear classifiers can’t model nonlinear class boundaries Simple trick: –Map attributes into new space consisting of combinations of attribute values –E.g.: all products of n factors that can be constructed from the attributes Example with two attributes and n = 3:

16

17

18 Nonlinear SVMs “Pseudo attributes” represent attribute combinations Overfitting not a problem because the maximum margin hyperplane is stable –There are usually few support vectors relative to the size of the training set Computation time still an issue –Each time the dot product is computed, all the “pseudo attributes” must be included

19 A mathematical trick Avoid computing the “pseudo attributes”! Compute the dot product before doing the nonlinear mapping Example: for compute Corresponds to a map into the instance space spanned by all products of n attributes

20 Other kernel functions Mapping is called a “kernel function” Polynomial kernel We can use others: Only requirement: Examples:

21 Problems with this approach 1 st problem: speed –10 attributes, and n = 5  >2000 coefficients –Use linear regression with attribute selection –Run time is cubic in number of attributes 2 nd problem: overfitting –Number of coefficients is large relative to the number of training instances –Curse of dimensionality kicks in

22 Sparse data SVM algorithms speed up dramatically if the data is sparse (i.e. many values are 0) Why? Because they compute lots and lots of dot products Sparse data  compute dot products very efficiently –Iterate only over non-zero values SVMs can process sparse datasets with 10,000s of attributes

23 Applications Machine vision: e.g face identification –Outperforms alternative approaches (1.5% error) Handwritten digit recognition: USPS data –Comparable to best alternative (0.8% error) Bioinformatics: e.g. prediction of protein secondary structure Text classifiation Can modify SVM technique for numeric prediction problems


Download ppt "SVM — Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training."

Similar presentations


Ads by Google