Presentation is loading. Please wait.

Presentation is loading. Please wait.

The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.

Similar presentations


Presentation on theme: "The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from."— Presentation transcript:

1 The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space Kernel Technique Based on Mercer ’ s Condition (1909)

2 A Simple Example of Kernel Polynomial Kernel of Degree 2: Let and the nonlinear map defined by. Then.  There are many other nonlinear maps,, that satisfy the relation:

3 Power of the Kernel Technique Consider a nonlinear mapthat consists of distinct features of all the monomials of degree d. Then. For example:  Is it necessary? We only need to know !  This can be achieved

4 More Examples of Kernel is an integer:  Polynomial Kernel : ) (Linear Kernel :  Gaussian (Radial Basis) Kernel :  The -entry of represents the “similarity” of data pointsand

5 Nonlinear SVM Motivation  Linear SVM: (Linear separator: ) min s. t. (QP) By QP “duality”,. Maximizing the margin in the “dual space” gives: min  Dual SSVM with separator: min s. t.

6 Nonlinear Smooth SVM  Replace by a nonlinear kernel : min  Use Newton-Armijo algorithm to solve the problem  Each iteration solves m+1 linear equations in m+1 variables  Nonlinear classifier depends on entire dataset : Nonlinear Classifier:

7 Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on # of example  Separating surface depends on almost entire dataset  Complexity of nonlinear SVM  Runs out of memory while storing the kernel matrix  Long CPU time to compute the dense kernel matrix  Need to generate and store entries  Need to store the entire dataset even after solving the problem

8 Solving the SVM with Massive Dataset  Limit the SVM to dataset of a few thousand points  Solution I: SMO (Sequential Minimal Optimization)  Standard optimization techniques require that the the data are held in memory  Solve the sub-optimization problem defined by the working set (size =2)  Increase the objective function iteratively  Solution II: RSVM (Reduced Support Vector Machine)

9 Reduced Support Vector Machine (ii) Solve the following problem by the Newton’s method min (iii) The nonlinear classifier is defined by the optimal solution in step (ii): Using gives lousy results! (i) Choose a random subset matrixof entire data matrix Nonlinear Classifier:

10 A Nonlinear Kernel Application Checkerboard Training Set: 1000 Points in Separate 486 Asterisks from 514 Dots

11 Conventional SVM Result on Checkerboard Using 50 Randomly Selected Points Out of 1000

12 RSVM Result on Checkerboard Using SAME 50 Random Points Out of 1000

13 RSVM on Moderate Sized Problems (Best Test Set Correctness %, CPU seconds) Cleveland Heart 297 x 13, 30 86.47 3.04 85.92 32.42 76.88 1.58 BUPA Liver 345 x 6, 35 74.86 2.68 73.62 32.61 68.95 2.04 Ionosphere 351 x 34, 35 95.19 5.02 94.35 59.88 88.70 2.13 Pima Indians 768 x 8, 50 78.64 5.72 76.59 328.3 57.32 4.64 Tic-Tac-Toe 958 x 9, 96 98.75 14.56 98.43 1033.5 88.24 8.87 Mushroom 8124 x 22, 215 89.04 466.20 N/A 83.90 221.50

14 RSVM on Large UCI Adult Dataset Standard Deviation over 50 Runs = 0.001 Average Correctness % & Standard Deviation, 50 Runs (6414, 26148) 84.470.00177.030.014210 3.2% (11221, 21341) 84.710.00175.960.016225 2.0% (16101, 16461) 84.900.00175.450.017242 1.5% (22697, 9865) 85.310.00176.730.018284 1.2% (32562, 16282) 85.070.00176.950.013326 1.0%

15 Reduced Set: Plays the Most Important Role in RSVM  It is natural to raise two questions:  Is there a way to choose the reduced set other than random selection so that RSVM will have a better performance?  Is there a mechanism to determine the size of reduced set automatically or dynamically?

16 Reduced Set Selection According to the Data Scatter in Input Space  Expected these points to be representative sample  Choose reduced set randomly but only keep the points in the reduced set that are more than a certain minimal distance apart

17 1 2 3 5 4 6 7 8 9 11 10 12 Data Scatter in Input Space is NOT Good Enough  An example is given as following: Training data analogous to XOR problem

18 Mapping to Feature Space  Map the input data via nonlinear mapping :  Equivalent to polynomial kernel with degree 2:

19 Data Points in the Feature Space 1 2 3 5 4 6 7 8 9 11 10 12 36 25 14 8 11 9 12 7 10

20 The Polynomial Kernel Matrix

21 1 2 3 5 4 6 7 8 9 11 10 12 Experiment Result

22 Express the Classifier as Linear Combination of Kernel Functions is a linear combination of a set of kernel functions  In SSVM, the nonlinear separating surface is:  In RSVM, the nonlinear separating surface is: is a linear combination of a set of kernel functions

23 Motivation of IRSVM The Strength of Weak Ties  The strength of weak ties  Mark S. Granovetter, The American Journal of Sociology, Vol. 78, No. 6 1360-1380, May, 1973  If the kernel functions are very similar, the space spanned by these kernel functions will be very limited.

24 Incremental Reduced SVMs  Start with a very small reduced set, then add a new data point only when the kernel vector is dissimilar to the current function set  This point contributes the most extra information for generating the separating surface  Repeat until several successive points cannot be added

25 How to measure the dissimilarity? the kernel vector to the column space of is greater than a threshold  Add a point into the reduced set if the distance from

26  This distance can be determined by solving a least squares problem Solving Least Squares Problems  The LSP has a unique solution if and

27 IRSVM Algorithm pseudo-code (sequential version) 1 Randomly choose two data from the training data as the initial reduced set 2 Compute the reduced kernel matrix 3 For each data point not in the reduced set 4 Computes its kernel vector 5 Computes the distance from the kernel vector 6 to the column space of the current reduced kernel matrix 7 If its distance exceed a certain threshold 8 Add this point into the reduced set and form the new reduced kernel matrix 9 Until several successive failures happened in line 7 10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel 11 A new data point is classified by the separating surface

28 Speed up IRSVM  The main cost depends on but not on  Take this advantage this, we examine a batch data points at the same  We have to solve the LSP many times and the complexity is

29 IRSVM Algorithm pseudo-code (Batch version) 1 Randomly choose two data from the training data as the initial reduced set 2 Compute the reduced kernel matrix 3 For a batch data point not in the reduced set 4 Computes their kernel vectors 5 Computes the corresponding distances from these kernel vector 6 to the column space of the current reduced kernel matrix 7 For those points’ distance exceed a certain threshold 8 Add those point into the reduced set and form the new reduced kernel matrix 9 Until no data points in a batch were added in line 7,8 10 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel 11 A new data point is classified by the separating surface

30 IRSVM on Four Public Datasets

31 IRSVM on UCI Adult datasets

32 Time comparison on Adult datasets

33 IRSVM 10 Runs Average on 6414 Points Adult Training Set


Download ppt "The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from."

Similar presentations


Ads by Google