Feature Selection Bioinformatics Data Analysis and Tools

Slides:

Advertisements

Similar presentations

DECISION TREES. Decision trees  One possible representation for hypotheses.

Advertisements

Random Forest Predrag Radenković 3237/10

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.

Yue Han and Lei Yu Binghamton University.

Minimum Redundancy and Maximum Relevance Feature Selection

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Lecture 4: Embedded methods

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Elena Marchiori Department of Computer Science

Computational Genomics and Proteomics Lecture 9 CGP Part 2: Introduction Based on: Martin Bachler slides.

Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:

Feature Selection Lecture 5

CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.

Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.

Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%

Nearest Neighbor Classifiers other names: –instance-based learning –case-based learning (CBL) –non-parametric learning –model-free learning.

Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.

Ensemble Learning (2), Tree and Forest

Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Whole Genome Expression Analysis

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

315 Feature Selection. 316 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:

Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

Today Ensemble Methods. Recap of the course. Classifier Fusion

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

COT6930 Course Project. Outline Gene Selection Sequence Alignment.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.

Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

School of Computer Science & Engineering

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

COMP61011 Foundations of Machine Learning Feature Selection

Machine Learning Feature Creation and Selection

Data Mining Practical Machine Learning Tools and Techniques

PCA, Clustering and Classification by Agnieszka S. Juncker

Overfitting and Underfitting

Machine Learning in Practice Lecture 22

Chapter 7: Transformations

Feature Selection Methods

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Lecture 16. Classification (II): Practical Considerations

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Feature Selection Bioinformatics Data Analysis and Tools G A V B M S U Lecture 8 Feature Selection Bioinformatics Data Analysis and Tools Elena Marchiori (elena@few.vu.nl)

Why select features Select a subset of “relevant” input variables Advantages: it is cheaper to measure less variables the resulting classifier is simpler and potentially faster prediction accuracy may improve by discarding irrelevant variables identifying relevant variables gives more insight into the nature of the corresponding classification problem (biomarker detection)

Selection based on variance Why select features? That should go under clustering as well? No feature selection Top 100 feature selection Selection based on variance Correlation plot Data: Leukemia, 3 class -1 +1

Approaches Wrapper Filter Embedded feature selection takes into account the contribution to the performance of a given type of classifier Filter feature selection is based on an evaluation criterion for quantifying how well feature (subsets) discriminate the two classes Embedded feature selection is part of the training procedure of a classifier (e.g. decision trees)

Embedded methods Attempt to jointly or simultaneously train both a classifier and a feature subset Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. Intuitively appealing Example: tree-building algorithms Adapted from J. Fridlyand

Approaches to Feature Selection Filter Approach Feature Selection by Distance Metric Score Input Features Train Model Model Wrapper Approach Feature Selection Search Feature Set Train Model Model Input Features Importance of features given by the model Adapted from Shin and Jasso

Filter methods Feature selection p s Classifier design R R s << p Features are scored independently and the top s are used by the classifier Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc Easy to interpret. Can provide some insight into the disease markers. Adapted from J. Fridlyand

Problems with filter method Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new information Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others) Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others. Adapted from J. Fridlyand

Dimension reduction: a variant on a filter method Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc) Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a “supergene”? (even though we don’t want to confuse the tasks) Those methods tend not to work better than simple filter methods. Adapted from J. Fridlyand

Wrapper methods Feature selection p s Classifier design R R s << p Iterative approach: many feature subsets are scored based on classification performance and best is used. Selection of subsets: forward selection, backward selection, Forward-backward selection, tree harvesting etc Adapted from J. Fridlyand

Problems with wrapper methods Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only. Easy to overfit. p Adapted from J. Fridlyand

Example: Microarray Analysis “Labeled” cases (38 bone marrow samples: 27 AML, 11 ALL Each contains 7129 gene expression values) Train model (using Neural Networks, Support Vector Machines, Bayesian nets, etc.) key genes 34 New unlabeled bone marrow samples Model AML/ALL

Microarray Data Challenges to Machine Learning Algorithms: Few samples for analysis (38 labeled) Extremely high-dimensional data (7129 gene expression values per sample) Noisy data Complex underlying mechanisms, not fully understood

Example: genes 36569_at and 36495_at are useful Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful

Example: genes 36569_at and 36495_at are useful Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful AML ALL

Example: genes 37176_at and 36563_at not useful Some genes are more useful than others for building classification models Example: genes 37176_at and 36563_at not useful

Importance of Feature (Gene) Selection Majority of genes are not directly related to leukemia Having a large number of features enhances the model’s flexibility, but makes it prone to overfitting Noise and the small number of training samples makes this even more likely Some types of models, like kNN do not scale well with many features

Distance metrics to capture class separation With 7219 genes, how do we choose the best? Distance metrics to capture class separation Rank genes according to distance metric score Choose the top n ranked genes HIGH score LOW score

Distance Metrics Tamayo’s Relative Class Separation t-test Bhattacharyya distance

SVM-RFE: wrapper Recursive Feature Elimination: Train linear SVM -> linear decision function Use absolute value of variable weights to rank variables Remove half variables with lower rank Repeat above steps (train, rank, remove) on data restricted to variables not removed Output: subset of variables

SVM-RFE Linear binary classifier decision function Recursive Feature Elimination (SVM-RFE) at each iteration: eliminate threshold% of variables with lower score recompute scores of remaining variables

SVM-RFE I. Guyon et al., Machine Learning, 46,389-422, 2002

RELIEF Idea: relevant variables make nearest examples of same class closer and make nearest examples of opposite classes more far apart. weights = zero For all examples in training set: find nearest example from same (hit) and opposite class (miss) update weight of each variable by adding abs(example - miss) -abs(example - hit) RELIEF I. Kira K, Rendell L, 10th Int. Conf. on AI, 129-134, 1992

RELIEF Algorithm RELIEF assigns weights to variables based on how well they separate samples from their nearest neighbors (nnb) from the same and from the opposite class. RELIEF %input: X (two classes) %output: W (weights assigned to variables) nr_var = total number of variables; weights = zero vector of size nr_var; for all x in X do hit(x) = nnb of x from same class; miss(x) = nnb of x from opposite class; weights += abs(x-miss(x)) - abs(x-hit(x)); end; nr_ex = number of examples of X; return W = weights/nr_ex Note: Variables have to be normalized (e.g., divide each variable by its (max – min) values)

EXAMPLE What are the weights of s1, s2, s3 and s4 assigned by RELIEF?

Classification: CV error N samples Training error Empirical error Error on independent test set Test error Cross validation (CV) error Leave-one-out (LOO) N-fold CV splitting 1/n samples for testing N-1/n samples for training Count errors Summarize CV error rate

Two schemes of cross validation CV1 CV2 N samples N samples LOO feature selection Train and test the feature-selector and the classifier LOO Train and test the classifier Count errors Count errors

Difference between CV1 and CV2 CV1 gene selection within LOOCV CV2 gene selection before before LOOCV CV2 can yield optimistic estimation of classification true error CV2 used in paper by Golub et al. : 0 training error 2 CV error (5.26%) 5 test error (14.7%) CV error different from test error!

Significance of classification results Permutation test: Permute class label of samples LOOCV error on data with permuted labels Repeat process a high number of times Compare with LOOCV error on original data: P-value = (# times LOOCV on permuted data <= LOOCV on original data) / total # of permutations considered

Application: Biomarker detection with Mass Spectrometric data of mixed quality I. Marchiori et al, IEEE CIBCB, 385-391, 2005 MALDI-TOF data. samples of mixed quality due to different storage time. controlled molecule spiking used to generate two classes.

Profiles of one spiked sample prova

Comparison of ML algorithms Feature selection + classification: RFE+SVM RFE+kNN RELIEF+SVM RELIEF+kNN

LOOCV results Misclassified samples are of bad quality (higher storage time) The selected features do not always correspond to m/z of spiked molecules

LOOCV results The variables selected by RELIEF correspond to the spiked peptides RFE is less robust than RELIEF over LOOCV runs and selects also “irrelevant” variables RELIEF-based feature selection yields results which are better interpretable than RFE

BUT... RFE+SVM yields superior loocv accuracy than RELIEF+SVM RFE+kNN superior accuracy than RELIEF+kNN (perfect LOOCV classification for RFE+1NN) RFE-based feature selection yields better predictive performance than RELIEF

Conclusion Better predictive performance does not necessarily correspond to stability and interpretability of results Open issues: (ML/BIO) Ad-hoc measure of relevance for potential biomarkers identified by feature selection algorithms (use of domain knowledge)? (ML) Is stability of feature selection algorithms more important than predictive accuracy?