Download presentation

Presentation is loading. Please wait.

Published byAndrew O'Keefe Modified over 4 years ago

1
**Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3**

NIPS 2006 Workshop on Causality and Feature Selection Using SVM Weight-Based Methods to Identify Causally Relevant and Non-Causally Relevant Variables Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3 1Department of Biomedical Informatics, 2Department of Mathematics, 3Department of Cancer Biology, Vanderbilt University, Nashville, TN, USA

2
**Major Goals of Variable Selection**

Construct faster and more cost-effective classifiers. Improve the prediction performance of the classifiers. Get insights in the underlying data-generating process.

3
**Taxonomy of Variables Variables Relevant Irrelevant Causally relevant**

D Relevant Irrelevant E T Response Say that this classification (into causally relevant and irrelevant) is done for the purposes of this paper. In real-world networks, there may be thousands of variables upstream of T. In this case, we can identify them by random selection. What we are concerned with here is to select causal variables _locally_. J F Causally relevant Non-causally relevant K L M

4
**Support Vector Machine (SVM) Weight-Based Variable Selection Methods**

Scale up to datasets with many thousands of variables and as few as dozens of samples Often yield variables that are more predictive than the ones output by other variable selection techniques or the full (unreduced) variable set (Guyon et al, 2002; Rakotomamonjy 2003) Currently unknown: Do we get insights on the causal structure ? (Hardin et al, 2004): Irrelevant variables will be given a 0 weight by a linear SVM in the sample limit; Linear SVM may assign 0 weight to strongly relevant variables and nonzero weight to weakly relevant variables. Hardin et al Kohavi-John relevance definition

5
**Simulation Experiments**

Network structure 1 P(Y=0) = ½ and P(Y=1) = ½. Y is hidden from the learner; {Xi}i=1,…,N are binary variables with P(Xi=0|Y=0) = q and P(Xi=1|Y=1) = q. {Zi}i=1,..,M are independent binary variables with P(Zi=0) = ½ and P(Zi=1) = ½. T is a binary response variable with P(T=0|X1=0) = 0.95 and P(T=1|X1=1) = 0.95. Y (hidden from the learner) q = 0.95 Network 1a q = 0.99 Network 1b Mention that the network structures obey Causal Markov Condition X1 Causally relevant X2 … XN Relevant variables T Z1 Z2 ZM … Response Irrelevant variables

6
**Simulation Experiments**

Network structure 1 in real-world distributions Adrenal gland cancer pathway produced by Ariadne Genomics PathwayStudio software version 4.0 ( Disease and its putative causes (except for kras)

7
**Simulation Experiments**

Network structure 2 {Xi}i=1,..,N are independent binary variables with P(Xi=0) = ½ and P(Xi=1) = ½. {Zi}i=1,..,M are independent binary variables with P(Zi=0) = ½ and P(Zi=1) = ½. Y is a “synthesis variable” with the following function: Causally relevant T is a binary response variable defined as where vi’s are generated from the uniform random U(0,1) distribution and are fixed for all experiments. X1 X2 … XN T Y Z1 Z2 ZM … Response Irrelevant variables Relevant variables

8
**Simulation Experiments**

Network structure 2 in real-world distributions Putative causes of the disease Targets of putative causes of the disease

9
Data Generation Generated 30 training samples of sizes = {100, 200, 500, 1000} for different values of N (number of all relevant variables) = {10, 100} and M (number of irrelevant variables) = {10,100,1000}. Generated testing samples of size 5000 for different values of N and M. Added noise to simulate random measurement errors: replace {0%, 1%, 10%} of each variable values with values randomly sampled from the distribution of that variable in simulated data. Mention that these sample sizes are realistic, e.g. what is used in molecular high-throughput data analysis. Not asymptotic.

10
**Overview of Experiments with SVM Weight-Based Methods**

Variable selection by SVM weights & classification - Used C = {0.001, 0.01, 0.1, 1, 10, 100, 1000} - Classified 10%, 20%,…,90%, 100% top-ranked variables Also classified baselines (causally relevant, non-causally relevant, all relevant, and irrelevant). Variable selection by SVM-RFE & classification - Removed one variable at a time - 75% training/25% testing

11
SVM Formulation Used

12
Results Give preview: I will show experiments that lead to the point that SVM weight-based methods cannot be used for local causal discovery.

13
**I. SVMs Can Assign Higher Weights to the Irrelevant Variables than to the Non-Causally Relevant Ones**

Average ranks of variables (by SVM weights) over 30 random training samples of size 100 (w/o noise) from network 1a with 100 relevant and irrelevant variables Explain meaning of ranks: high rank high weight One may say – use the values of C that better fits the data. They both fit well… C is small (≤0.01) C is large (≥0.1)

14
**I. SVMs Can Assign Higher Weights to the Irrelevant Variables than to the Non-Causally Relevant Ones**

AUC analysis for discrimination between groups of all relevant and irrelevant variables based on SVM weights AUC classification performance obtained on the 5,000-sample independent testing set: results for variable ranking based on SVM weights

15
**II. SVMs Can Select Irrelevant Variables More Frequently than Non-Causally Relevant Ones**

Probability of selecting variables (by SVM-RFE) estimated over 30 random training samples of size 100 (w/o noise) from network 1a with 100 relevant and irrelevant variables C is small (≤0.01) C is large (≥0.1)

16
**II. SVMs Can Select Irrelevant Variables More Frequently than Non-Causally Relevant Ones**

AUC classification performance obtained on the 5,000-sample independent testing set: results for variable selection by SVM-RFE

17
III. SVMs Can Assign Higher Weights to the Non-Causally Relevant Variables Than to the Causally Relevant Ones Average ranks of variables (by SVM weights) over 30 random training samples of size 500 (w/o noise) from network 2 with 100 relevant and irrelevant variables AUC analysis for discrimination between groups of causally relevant and non-causally relevant variables based on SVM weights

18
**IV. SVMs Can Select Non-Causally Relevant Variables More Frequently Than the Causally Relevant Ones**

Probability of selecting variables (by SVM-RFE) estimated over 30 random training samples of size 500 (w/o noise) from network 2 with 100 relevant and irrelevant variables

19
**V. SVMs Can Assign Higher Weights to the Irrelevant Variables Than to the Causally Relevant Ones**

Average ranks of variables (by SVM weights) over 30 random training samples of size 100 (w/o noise) from network 2 with 100 relevant and irrelevant variables AUC analysis for discrimination between groups of causally relevant and non-causally relevant variables based on SVM weights

20
**VI. SVMs Can Select Irrelevant Variables More Frequently Than the Causally Relevant Ones**

Probability of selecting variables (by SVM-RFE) estimated over 30 random training samples of size 100 (w/o noise) from network 2 with 100 relevant and irrelevant variables

21
**Theoretical Example 1 (Network structure 2)**

P(X1=-1) = ½, P(X1=1) = ½, P(X2=-1) = ½, and P(X2=1) = ½. Y is a “synthesis variable” with the following function: T is a binary response variable defined as: X1 X2 T Y Mention that this holds in the sample limit Variables X1, X2, and Y have expected value 0 and variance 1. The application of linear SVMs results in the following weights: 1/2 for X1, 1/2 for X2, and for Y. Therefore, the non-causally relevant variable Y receives higher SVM weight than the causally relevant ones.

22
**Theoretical Example 2 Y T = + T = - X T Y X T X Y Y T Y T | X X Y | T**

G1 G2 X The maximum-gap inductive bias is inconsistent with local causal discovery.

23
**Discussion Using nonlinear SVM weight-based methods**

Preliminary experiment: When polynomial SVM-RFE is used, non-causally relevant variable is never selected in network structure 2. However, the performance of polynomial SVM-RFE is similar to linear SVM-RFE. The framework of formal causal discovery (Spirtes et al, 2000) provides algorithms that can solve these problems, e.g. HITON (Aliferis et al, 2003) or MMPC & MMMB (Tsamardinos et al, 2003; Tsamardinos et al, 2006). Methods based on modified SVM formulations, e.g. 0-norm and 1-norm penalties (Weston et al, 2003; Zhu et al, 2004). 4. Extend empirical evaluation to different distributions

24
Conclusion Causal interpretation of the current SVM weight-based variable selection techniques must be conducted with great caution by practitioners The inductive bias employed by SVMs is locally causally inconsistent. New SVM methods may be needed to address this issue and this is an exciting and challenging area of research. Say that this is not of theoretical concern. This interpretation is used in bioinformatics.

Similar presentations

Presentation is loading. Please wait....

OK

Distributed Representations of Sentences and Documents

Distributed Representations of Sentences and Documents

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google