Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3

Slides:



Advertisements
Similar presentations
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Advertisements

Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.
Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
A Partition Modelling Approach to Tomographic Problems Thomas Bodin & Malcolm Sambridge Research School of Earth Sciences, Australian National University.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
ECG Signal processing (2)
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Lecture 5: Causality and Feature Selection Isabelle Guyon
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Reduced Support Vector Machine
Principle of Locality for Statistical Shape Analysis Paul Yushkevich.
Support Vector Machines
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Data mining and statistical learning - lecture 13 Separating hyperplane.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Distributed Representations of Sentences and Documents
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
197 Case Study: Predicting Breast Cancer Invasion with Artificial Neural Networks on the Basis of Mammographic Features MEDINFO 2004, T02: Machine Learning.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Filter + Support Vector Machine for NIPS 2003 Challenge Jiwen Li University of Zurich Department of Informatics The NIPS 2003 challenge was organized to.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Face Detection Using Large Margin Classifiers Ming-Hsuan Yang Dan Roth Narendra Ahuja Presented by Kiang “Sean” Zhou Beckman Institute University of Illinois.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Chapter 7. Classification and Prediction
Introduction to translational and clinical bioinformatics Connecting complex molecular information to clinically relevant decisions using molecular.
Trees, bagging, boosting, and stacking
Ch3: Model Building through Regression
Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3 NIPS 2006 Workshop on Causality and Feature Selection Using SVM Weight-Based Methods to Identify Causally Relevant and Non-Causally Relevant Variables Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3 1Department of Biomedical Informatics, 2Department of Mathematics, 3Department of Cancer Biology, Vanderbilt University, Nashville, TN, USA

Major Goals of Variable Selection Construct faster and more cost-effective classifiers. Improve the prediction performance of the classifiers. Get insights in the underlying data-generating process.

Taxonomy of Variables Variables Relevant Irrelevant Causally relevant D Relevant Irrelevant E T Response Say that this classification (into causally relevant and irrelevant) is done for the purposes of this paper. In real-world networks, there may be thousands of variables upstream of T. In this case, we can identify them by random selection. What we are concerned with here is to select causal variables _locally_. J F Causally relevant Non-causally relevant K L M

Support Vector Machine (SVM) Weight-Based Variable Selection Methods Scale up to datasets with many thousands of variables and as few as dozens of samples Often yield variables that are more predictive than the ones output by other variable selection techniques or the full (unreduced) variable set (Guyon et al, 2002; Rakotomamonjy 2003) Currently unknown: Do we get insights on the causal structure ? (Hardin et al, 2004): Irrelevant variables will be given a 0 weight by a linear SVM in the sample limit; Linear SVM may assign 0 weight to strongly relevant variables and nonzero weight to weakly relevant variables. Hardin et al  Kohavi-John relevance definition

Simulation Experiments Network structure 1 P(Y=0) = ½ and P(Y=1) = ½. Y is hidden from the learner; {Xi}i=1,…,N are binary variables with P(Xi=0|Y=0) = q and P(Xi=1|Y=1) = q. {Zi}i=1,..,M are independent binary variables with P(Zi=0) = ½ and P(Zi=1) = ½. T is a binary response variable with P(T=0|X1=0) = 0.95 and P(T=1|X1=1) = 0.95. Y (hidden from the learner) q = 0.95  Network 1a q = 0.99  Network 1b Mention that the network structures obey Causal Markov Condition X1 Causally relevant X2 … XN Relevant variables T Z1 Z2 ZM … Response Irrelevant variables

Simulation Experiments Network structure 1 in real-world distributions Adrenal gland cancer pathway produced by Ariadne Genomics PathwayStudio software version 4.0 (http://www.ariadnegenomics.com/). Disease and its putative causes (except for kras)

Simulation Experiments Network structure 2 {Xi}i=1,..,N are independent binary variables with P(Xi=0) = ½ and P(Xi=1) = ½. {Zi}i=1,..,M are independent binary variables with P(Zi=0) = ½ and P(Zi=1) = ½. Y is a “synthesis variable” with the following function: Causally relevant T is a binary response variable defined as where vi’s are generated from the uniform random U(0,1) distribution and are fixed for all experiments. X1 X2 … XN T Y Z1 Z2 ZM … Response Irrelevant variables Relevant variables

Simulation Experiments Network structure 2 in real-world distributions Putative causes of the disease Targets of putative causes of the disease

Data Generation Generated 30 training samples of sizes = {100, 200, 500, 1000} for different values of N (number of all relevant variables) = {10, 100} and M (number of irrelevant variables) = {10,100,1000}. Generated testing samples of size 5000 for different values of N and M. Added noise to simulate random measurement errors: replace {0%, 1%, 10%} of each variable values with values randomly sampled from the distribution of that variable in simulated data. Mention that these sample sizes are realistic, e.g. what is used in molecular high-throughput data analysis. Not asymptotic.

Overview of Experiments with SVM Weight-Based Methods Variable selection by SVM weights & classification - Used C = {0.001, 0.01, 0.1, 1, 10, 100, 1000} - Classified 10%, 20%,…,90%, 100% top-ranked variables Also classified baselines (causally relevant, non-causally relevant, all relevant, and irrelevant). Variable selection by SVM-RFE & classification - Removed one variable at a time - 75% training/25% testing

SVM Formulation Used

Results Give preview: I will show experiments that lead to the point that SVM weight-based methods cannot be used for local causal discovery.

I. SVMs Can Assign Higher Weights to the Irrelevant Variables than to the Non-Causally Relevant Ones Average ranks of variables (by SVM weights) over 30 random training samples of size 100 (w/o noise) from network 1a with 100 relevant and irrelevant variables Explain meaning of ranks: high rank  high weight One may say – use the values of C that better fits the data. They both fit well… C is small (≤0.01) C is large (≥0.1)

I. SVMs Can Assign Higher Weights to the Irrelevant Variables than to the Non-Causally Relevant Ones AUC analysis for discrimination between groups of all relevant and irrelevant variables based on SVM weights AUC classification performance obtained on the 5,000-sample independent testing set: results for variable ranking based on SVM weights

II. SVMs Can Select Irrelevant Variables More Frequently than Non-Causally Relevant Ones Probability of selecting variables (by SVM-RFE) estimated over 30 random training samples of size 100 (w/o noise) from network 1a with 100 relevant and irrelevant variables C is small (≤0.01) C is large (≥0.1)

II. SVMs Can Select Irrelevant Variables More Frequently than Non-Causally Relevant Ones AUC classification performance obtained on the 5,000-sample independent testing set: results for variable selection by SVM-RFE

III. SVMs Can Assign Higher Weights to the Non-Causally Relevant Variables Than to the Causally Relevant Ones Average ranks of variables (by SVM weights) over 30 random training samples of size 500 (w/o noise) from network 2 with 100 relevant and irrelevant variables AUC analysis for discrimination between groups of causally relevant and non-causally relevant variables based on SVM weights

IV. SVMs Can Select Non-Causally Relevant Variables More Frequently Than the Causally Relevant Ones Probability of selecting variables (by SVM-RFE) estimated over 30 random training samples of size 500 (w/o noise) from network 2 with 100 relevant and irrelevant variables

V. SVMs Can Assign Higher Weights to the Irrelevant Variables Than to the Causally Relevant Ones Average ranks of variables (by SVM weights) over 30 random training samples of size 100 (w/o noise) from network 2 with 100 relevant and irrelevant variables AUC analysis for discrimination between groups of causally relevant and non-causally relevant variables based on SVM weights

VI. SVMs Can Select Irrelevant Variables More Frequently Than the Causally Relevant Ones Probability of selecting variables (by SVM-RFE) estimated over 30 random training samples of size 100 (w/o noise) from network 2 with 100 relevant and irrelevant variables

Theoretical Example 1 (Network structure 2) P(X1=-1) = ½, P(X1=1) = ½, P(X2=-1) = ½, and P(X2=1) = ½. Y is a “synthesis variable” with the following function: T is a binary response variable defined as: X1 X2 T Y Mention that this holds in the sample limit Variables X1, X2, and Y have expected value 0 and variance 1. The application of linear SVMs results in the following weights: 1/2 for X1, 1/2 for X2, and for Y. Therefore, the non-causally relevant variable Y receives higher SVM weight than the causally relevant ones.

Theoretical Example 2 Y T = + T = - X T Y X T X Y Y T Y T | X X Y | T G1 G2 X The maximum-gap inductive bias is inconsistent with local causal discovery.

Discussion Using nonlinear SVM weight-based methods Preliminary experiment: When polynomial SVM-RFE is used, non-causally relevant variable is never selected in network structure 2. However, the performance of polynomial SVM-RFE is similar to linear SVM-RFE. The framework of formal causal discovery (Spirtes et al, 2000) provides algorithms that can solve these problems, e.g. HITON (Aliferis et al, 2003) or MMPC & MMMB (Tsamardinos et al, 2003; Tsamardinos et al, 2006). Methods based on modified SVM formulations, e.g. 0-norm and 1-norm penalties (Weston et al, 2003; Zhu et al, 2004). 4. Extend empirical evaluation to different distributions

Conclusion Causal interpretation of the current SVM weight-based variable selection techniques must be conducted with great caution by practitioners The inductive bias employed by SVMs is locally causally inconsistent. New SVM methods may be needed to address this issue and this is an exciting and challenging area of research. Say that this is not of theoretical concern. This interpretation is used in bioinformatics.