Presentation on theme: "Reduced Support Vector Machine"— Presentation transcript:
1 Reduced Support Vector Machine Nonlinear Classifier:(i) Choose a random subset matrixof entiredata matrix(ii) Solve the following problem by the Newton’s methodmin(iii) The nonlinear classifier is defined by the optimal solutionin step (ii):Usinggives lousy results!
2 Reduced Set: Plays the Most Important Role in RSVM It is natural to raise two questions:Is there a way to choose the reduced set otherthan random selection so that RSVM will havea better performance?Is there a mechanism to determine the size ofreduced set automatically or dynamically?
3 Reduced Set Selection According to the Data Scatter in Input Space Choose reduced set randomly but only keep thepoints in the reduced set that are more than acertain minimal distance apartExpected these points to be representative sample
4 Data Scatter in Input Space is NOT Good Enough An example is given as following:123546789111012Training data analogous to XOR problem
5 Mapping to Feature Space Map the input data via nonlinear mapping：Equivalent to polynomial kernel with degree 2:
6 Data Points in the Feature Space 3625148 119 127 10123546789111012
9 Express the Classifier as Linear Combination of Kernel Functions is a linear combination of a set of kernel functionsIn SSVM, the nonlinear separating surface is:In RSVM, the nonlinear separating surface is:is a linear combination of a set of kernel functions
10 Motivation of IRSVM The Strength of Weak Ties If the kernel functions are very similar, the spacespanned by these kernel functions will be very limited.The strength of weak tiesMark S. Granovetter,The American Journal of Sociology, Vol. 78, No. 6, May, 1973
11 Incremental Reduced SVMs Start with a very small reduced set , then add anew data point only when the kernel vectoris dissimilar to the current function setThis point contributes the most extra informationfor generating the separating surfaceRepeat until several successive points cannot beadded
12 How to measure the dissimilarity? the kernel vector to the columnspace of is greater than a thresholdAdd a point into the reduced set if the distance from
13 Solving Least Squares Problems This distance can be determined by solvinga least squares problemThe LSP has a unique solution ifand
14 IRSVM Algorithm pseudo-code (sequential version) 1 Randomly choose two data from the training data as the initial reduced set2 Compute the reduced kernel matrix3 For each data point not in the reduced setComputes its kernel vectorComputes the distance from the kernel vectorto the column space of the current reduced kernel matrixIf its distance exceed a certain threshold8 Add this point into the reduced set and form the new reduced kernel matrix9 Until several successive failures happened in line 710 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel11 A new data point is classified by the separating surface
15 Speed up IRSVM We have to solve the LSP many times and the complexity isThe main cost depends onbut not onTake this advantage this, we examine a batch datapoints at the same
16 IRSVM Algorithm pseudo-code (Batch version) 1 Randomly choose two data from the training data as the initial reduced set2 Compute the reduced kernel matrix3 For a batch data point not in the reduced setComputes their kernel vectorsComputes the corresponding distances from these kernel vectorto the column space of the current reduced kernel matrixFor those points’ distance exceed a certain threshold8 Add those point into the reduced set and form the new reduced kernel matrix9 Until no data points in a batch were added in line 7,810 Solve the QP problem of nonlinear SVMs with the obtained reduced kernel11 A new data point is classified by the separating surface
20 IRSVM 10 Runs Average on 6414 Points Adult Training Set
21 Empirical Risk Minimization (ERM) andare not needed)Replace the expected risk overby anaverage over the training exampleThe empirical risk:Find the hypothesiswith the smallest empiricalriskOnly focusing on empirical risk will cause overfitting
22 VC Confidence (The Bound between ) The following inequality will be held with probabilityC. J. C. Burges, A tutorial on support vector machines forpattern recognition,Data Mining and Knowledge Discovery 2 (2) (1998), p
23 Why We Maximize the Margin? (Based on Statistical Learning Theory) The Structural Risk Minimization (SRM):The expected risk will be less than or equal toempirical risk (training error)+ VC (error) bound
24 Bioinformatics Challenge Learning in very high dimensions with very fewsamplesColon cancer dataset:2000 # of gene vs. 62 samplesAcute leukemia dataset:7129 # of gene vs. 72 samplesFeature selection will be needed
25 Feature Selection Approaches Filter model: the attribute set is filtered toproduce the most promising subset beforelearning commencesWeight score approachWrapper model: the learning algorithm iswrapped into the selection procedure1-norm SVMIRSVM
26 Feature Selection –Filter Model Using Weight Score Approach
27 Filter Model – Weight Score Approach whereandare the mean and standard deviation offeature for training examples of positive or negative class.
28 Filter Model – Weight Score Approach is defined as the ratio between the difference of the means of expression levels and the sum of standard deviation in two classes.Selecting genes with largest as our top features.The weight score is calculated with the information about a single feature.The highly linear correlated features might be selected by this approach.
29 Wrapper Model – IRSVM Find a Linear Classifier: Randomly choose a very small feature subset from the input features as the initial feature reduced set.Select a feature vector not in the current feature reduced set and computing the distance between this vector and the space spanned by current feature reduced set.If the distance is larger than a given gap, then we add this feature vector to the feature reduced set.Repeat step II and step III until there are no feature can be added to the current feature reduced set.Features in the resulting feature reduced set is our final result of feature selection.