Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.

Similar presentations


Presentation on theme: "Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray."— Presentation transcript:

1 Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray Somorjai, Institute for Biodiagnostics, NRC Canada,

2 Outline of the presentation Description of the algorithm Results on Agnostic Learning vs. Prior Knowledge (AL vs. PK) challenge datasets Conclusions

3 Motivation to enter the Challenge For small sample size / high dimensional datasets, the feature selection procedure will adapt to the peculiarities of the training dataset (sample bias). An ideal model selection procedure would produce stable estimates of the classification error rate and the identities of discovered features would not vary much across the different random splits. Our experiments with Linear Programming SVM (LP-SVM) on biomedical datasets produced results more robust to the sample bias and demonstrated the property stated above. Decided to check LP-SVM’s robustness property in a controlled experiment- an independent platform of the AL vs. PK challenge.

4 Classification with LP-SVM The formulation of LP-SVM (known as Liknon, Bhattacharya et al.) is very similar to the conventional linear SVM, except for the objective function, which is linear, due to the L1 norm of the regularization term. The solution of the LP-SVM is a linear discriminant, in which the weight magnitudes identify those original features that are important for class discrimination. Different values of the regularization parameter C in the optimization problem produce different discriminants.

5 Outline of the algorithm 1)Available training data are processed in 10-fold stratified (based on the existing proportions of the class samples) crossvalidation: 9/10 of the data are for training, 1/10 for independent testing. 2) The training portion is split randomly into balanced training and unbalanced monitoring sets. 3) We perform 31 random splits.

6 Evolution of the models 4)The training set is used to find several LP-SVM discriminants, determined by the sequence of values of the regularization parameter C, in every split. Increasing C increases the number of features. 5)A balanced error rate (BER) for every discriminant is estimated on the monitoring set. 6)The discriminant / model with the smallest monitoring BER is retained.

7 Example of the evolution of the models on a synthetic data

8 Feature profiles and other classifiers 7)In a single fold, a feature profile is derived, by counting the frequency of inclusion of the features in the best BER discriminant (we have 31 best BER discriminants). 8) As a result, we have an ensemble of linear discriminants operating on the selected features and the feature profile is to be tested with other classifiers (Several thresholds of the frequency of inclusion were examined for different datasets, to test state-of-art classification rules, such as KNN, fisher. Etc.).

9 Final model selection 9)The performance of all competing models derived in a single fold is estimated by BER on the independent test set. Thus we have 10 estimates. 9)The final model is selected out of the 10 estimated models. 10)The identities of the features occurring in all profiles can also be examined separately.

10 Experimental setup: algorithmic parameters for AL vs. PK datasets T1+T2 - size of the training set; M1+M2 - size of the monitoring set; V1+V2 - size of the validation set; Dim - dimensionality of the data; Models - number of the models tested; Th - threshold of the frequency of inclusion of feature in the feature profile.

11 ADA results Identity: the identities of the features occurring in all profiles 100%- 2, 8, 9, 18, 20, 24, 30 Last 1, test err 0.181812 Last 2, test err 0.181899 Last 3, test err 0.183274 (Th 55%)

12 GINA results Identity: the identities of the features occurring in all profiles More than 85% - 367, 815, 510, 648, 424 Last 3: knn1: 0.060, knn3: 0.058, ens: 0.153 Last 1, test err 0.0583085 Last 2, test err 0.0539862 Last 3, test err 0.0533136 (Th 50%)

13 HIVA results Identity: the identities of the features occurring in all profiles 90% Last 1, test err 0.299795 Last 2, test err 0.313 Last 3, test err 0.305 Th 20% Best entry ( former) 0.2939

14 NOVA results Identity: the identities of the features occurring in all profiles 100% Last 1, test err 0.075 Last 2, test err 0.074 Last 3, test err 0.081 Th 80% Best entry ( former) 0.0725

15 SYLVA results Identity: the identities of the features occurring in all profiles 100% - 202, 55 Last 1, test err 0.0195 Last 2, test err 0.0197 Last 3, test err 0.01897 Th 20%

16 Determination of C values Given N 1 and N 2 measurements x of individual feature k in two classes, C value is: Sort the C values corresponding to d features in ascending order and solve a model for each. The idea behind comes from the analysis of the dual. If many features, then many models have to be solved. Computationally not feasible, C values have to be condensed.

17 Different ways of condensing C The challenge submissions differed in how the C values were chosen. Initially a histogram was used. Based on the final ranking this method worked out better for HIVA and NOVA. In the last submissions a rate of change of a slightly modified objective function of primal was used. This worked better for ADA, GINA and SYLVA. Still looking for a less heuristic and more precise method…

18 Conclusions The main advantages of our method are simplicity and the interpretability of the results. The disadvantage is high computational burden. Ensembles tend to perform better than individual rules, except for GINA. Same feature identities were consistently discovered in all splits and folds. The derived feature identities have to be compared with the ground truth in the Prior Knowledge track. Some arbitrariness, unavoidable in this experiment, will be dealt with in the future work- the threshold in feature profile, number of samples for training and monitoring, number of splits, number of the models.

19 Many thank’s To Muoi Tran for discussions and support, For your attention!


Download ppt "Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray."

Similar presentations


Ads by Google