Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

Similar presentations


Presentation on theme: "Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005."— Presentation transcript:

1 Correlation Aware Feature Selection http://mpa.itc.it Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005

2 Overview  On Feature Selection  Correlation Aware Ranking  Synthetic Example

3 Feature Selection Step-wise variable selection: n*<N effective variables modeling the classification function N features N steps Step 1Step N … One feature vs. N features …

4 Feature Selection Step-wise selection of the features. Steps Ranked Features Discarded Features

5 Ranking Classifier independent filters Prefiltering is risky: you might discard features that turns out to be important. (ignoring labelling) Induced by a classifier

6 Support Vector Machines Classification function: Optimal Separating Hyperplane

7 The classification/ranking machine  The RFE idea: given N features (genes) Train a SVM Compute a cost function J from the weight coefficients of the the SVM Rank features in terms of contribution to J Discard the feature less contributing to J Reapply procedure on the N-1 features This is called Recursive Feature Elimination (RFE) Features are ranked according to their contribute to the classification, given the training data. Time and data consuming, and at risk of selection bias Guyon et al. 2002

8 RFE-based Methods Considering chunks of data at a time :  Parametrics Sqrt(N) – RFE Bisection – RFE  Non-Parametrics E – RFE (adapting to weight distribution): thresholding weights to a value w*

9 Variable Elimination Given F={x 1, x 2, …, x H } such that: for a given threshold T. w(x 1 )~w(x 2 ) ~ … ~ ε < w* w(x 1 )+w(x 2 )+ … >> w* Each single weight is negligible Correlated genes BUT

10 Correlated Genes (1)

11 Correlated Genes (2)

12 Synthetic Data Binary problem 100 (50 +50) samples of 1000 genes: genes 1  50 : randomly extracted from N(1,1) and N(-1,1) respectively genes 50  100 : randomly extracted from N(1,1) and N(-1,1) respectively (1 repeated 50 times) genes 101  1000 extracted from UNIF(-4,4) Class 1: 50 Class 2: 50 501x50 N(1,1) N(-1,1) 1 feat repeated Unif(-4,4) 1 50100 1000 51 significant features

13 Our algorithm step j

14 Methodology  Implemented within the BioDCV system (50 replicates)  Realized through R - C code interaction

15 Synthetic Data 1 50 1001000 Gene 100 is consistently ranked as 2nd steps

16 Work in Progress  Preservation of high correlated genes with low initial weights on microarrays datasets  Robust correlation measures  Different techniques to detect F l families (clustering, gene functions)

17

18 Synthetic Data Stepfeatures 1-50 features 51-100 features >100 000282 100158 20067 30059 400167 50039 60031 70013 8009 90506 100055 110032 120020 1300 140018 SAVED500 Stepfeatures 1-50 features 51-100 features >100 000282 100158 20067 30059 400167 50039 60031 70013 8009 90496 100055 110032 120020 1300 140018 SAVED500

19 Synthetic Data 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 227 559 864 470 363 735 Features discarded at step 9 from E-RFE procedure: Correlation Correction: Saves feature 100

20 INFRASTRUCTURE MPACluster -> available for batch jobs Connecting with IFOM -> 2005 Running at IFOM -> 2005/2006 Production on GRID resources (spring 2005) Challenges ALGORITHMS II 1.Gene list fusion: suite of algebraic/statistical methods 2.Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi- supervised analysis 3.New SVM Kernels for prediction on spectrometry data within complete validation ALGORITHMS II 1.Gene list fusion: suite of algebraic/statistical methods 2.Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi- supervised analysis 3.New SVM Kernels for prediction on spectrometry data within complete validation Challenges for predictive profiling

21 Prefiltering is risky: you might discard features that turns out to be important. Nevertheless, wrapper methods are quite costing. Moreover, in the gene expression data, you have to deal also with particular situations like clones or highly correlated features that may represent a pitfall for several selection methods. A classic alternative is to map into linear combination of features, and then select. Principal Component Analysis Metagenes (a simplified model for pathways: but biological suggestions require caution) But we are not working anymore with the original features. eigen-craters for unexploded bomb risk maps

22

23 A few issues in feature selection with a particular interest on classification of genomic data WHY? To ease computational burden To enhance information Discard the (apparently) less significant features and train in a simplified space: alleviate the curse of dimensionality Highlight (and rank) the most important features and improve the knowledge of the underlying process. HOW? As a pre-processing stepAs a learning step Employ a statistical filter (t-test, S2N) Link the feature ranking to the classification task: wrapper methods, …

24 Prefiltering is risky: you might discard features that turns out to be important. Nevertheless, wrapper methods are quite costing. Moreover, in the gene expression data, you have to deal also with particular situations like clones or highly correlated features that may represent a pitfall for several selection methods. A classic alternative is to map into linear combination of features, and then select. Principal Component Analysis Metagenes (a simplified model for pathways: but biological suggestions require caution) But we are not working anymore with the original features. eigen-craters for unexploded bomb risk maps

25 Feature Selection within Complete Validation Experimental Setups Complete Validation is needed to decouple model tuning from (ensemble) model accuracy estimation: otherwise selection bias effects … Accumulating rel. importance from Random Forest models for the identification of sensory drivers (with P. Granitto, IASMA)


Download ppt "Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005."

Similar presentations


Ads by Google