CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:

Slides:



Advertisements
Similar presentations
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Minimum Redundancy and Maximum Relevance Feature Selection
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
CMPUT 466/551 Principal Source: CMU
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Principal Component Analysis
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Reduced Support Vector Machine
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Feature Selection Lecture 5
Feature Selection Bioinformatics Data Analysis and Tools
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
CBCl/AI MIT Class 19: Bioinformatics S. Mukherjee, R. Rifkin, G. Yeo, and T. Poggio.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Whole Genome Expression Analysis
Classification (Supervised Clustering) Naomi Altman Nov '06.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
315 Feature Selection. 316 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Central dogma of biology DNA  RNA  pre-mRNA  mRNA  Protein Central dogma.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Analyzing Expression Data: Clustering and Stats Chapter 16.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Principal Components Analysis ( PCA)
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
School of Computer Science & Engineering
Boosting and Additive Trees (2)
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
COMP61011 Foundations of Machine Learning Feature Selection
PCA, Clustering and Classification by Agnieszka S. Juncker
Descriptive Statistics vs. Factor Analysis
Dimension reduction : PCA and Clustering
Feature Selection Methods
Presentation transcript:

CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel: Room 07-24, level 8, S17, National University of Singapore

2 All classification methods we have studied so far use all genes/features Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in discriminating disease-types and treatment outcomes Practical reasons, a clinical device with thousands of genes is not financially practical Gene selection?

3 Disease Example: Childhood Leukemia Cancer in the cells of the immune system Approx. 35 new cases in Denmark every year 50 years ago – all patients died Today – approx. 78% are cured Risk Groups: –Standard –Intermediate –High –Very high –Extra high Treatment: –Chemotherapy –Bone marrow transplantation –Radiation

4 Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk Classification Today Risk group: Standard Intermediate High Very high Extra high Patient: Clinical data Immunopheno- typing Morphology Genetic measurements Microarray technology

5 Study and Diagnosis of Childhood Leukemia Diagnostic bone marrow samples from leukemia patients Platform: Affymetrix Focus Array –8793 human genes Immunophenotype –18 patients with precursor B immunophenotype –17 patients with T immunophenotype Outcome 5 years from diagnosis –11 patients with relapse –18 patients in complete remission

6 Problem: Too much data! Gene Pat1Pat2Pat3Pat4Pat5Pat6Pat7Pat8Pat _at _at _s_at _at _at _at _at _x_at _at _s_at _s_at _at _s_at _s_at _x_at _at _x_at _s_at _s_at _at _s_at _at _at _at _at _s_at _s_at _s_at _at _at _at

7 Reduction of dimensions –Principle Component Analysis (PCA) Feature selection (gene selection) –Significant genes: t-test –Selection of a limited number of genes So, what do we do?

8 Principal Component Analysis (PCA) Used for visualization of complex data Developed to capture as much of the variation in data as possible Generic features of principal components –summary variables –linear combinations of the original variables –uncorrelated with each other –capture as much of the original variance as possible

9 Principal components 1.principal component (PC1) –the direction along which there is greatest variation 2.principal component (PC2) –the direction with maximum variation left in data, orthogonal to the direction (i.e. vector) of PC1 3.principal component (PC3) –the direction with maximal variation left in data, orthogonal to the plane of PC1 and PC2 –(Less frequently used)

10 Example: 3 dimensions => 2 dimensions

11 PCA - Example

12 PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, 8973 dimensions (genes) reduced to 2

13 Ranking of PCs and Gene Selection

14 The t-test method Compares the means ( & ) of two data sets –tells us if they can be assumed to be equal Can be used to identify significant genes –i.e. those that change their expression a lot!

15 PCA on 100 top significant genes based on t-test Plot of 34 patients, 100 dimensions (genes) reduced to 2

16 The next question: Can we classify new patients? Plot of 34 patients, 100 dimensions (genes) reduced to 2 P99.?? ????

17 Feature Selection Problem Statement A process of selecting a minimum subset of features that is sufficient to construct a hypothesis consistent with the training examples (Almuallim and Dietterich, 1991) Selecting a minimum subset G such that P(C|G) is equal or as close as possible to P(C|F) (Koller and Sahami, 1996)

18 Feature Selection Strategies Wrapper methods –Relying on a predetermined classification algorithm –Using predictive accuracy as goodness measure –High accuracy, computationally expensive Filter methods –Separating feature selection from classifier learning –Relying on general characteristics of data (distance, correlation, consistency) –No bias towards any learning algorithm, fast Embedded methods –Jointly or simultaneously train both a classifier and a feature subset by optimizing an objective function that jointly rewards accuracy of classification and penalizes use of more features.

19 Feature Selection Strategies Filter methods –Features (genes) are scored according to the evidence of predictive power and then are ranked. Top s genes with high score are selected and used by the classifier. –Scores: t-statistics, F-statistics, signal-noise ratio, … –The # of features selected, s, is then determined by cross validation. –Advantage: Fast and easy to interpret.

20 Feature Selection Strategies Problems of filter methods –Genes are considered independently. –Redundant genes may be included. –Some genes jointly with strong discriminant power but individually with weak contribution will be ignored. –The filtering procedure is independent to the classifying method.

21 Feature Selection Step-wise variable selection: n*<N effective variables modeling the classification function N features N steps Step 1Step N … One feature vs. N features …

22 Feature Selection Step-wise selection of the features. Steps Ranked Features Discarded Features

23 Feature Selection Strategies Wrapper methods –Iterative search: many “ feature subsets ” are scored base on classification performance and the best is used. –Subset selection: Forward selection, backward selection, their –combinations. –The problem is very similar to variable selection in regression.

24 Feature Selection Strategies Wrapper methods –Analogous to variable selection in regression –Exhaustive searching is not impossible, and greedy algorithms are used instead. –Confounding problem can happen in both scenario. In regression, it is usually recommended not to include highly correlated covariates in analysis to avoid confounding. But it ’ s impossible to avoid confounding in feature selection of microarray classification.

25 Feature Selection Strategies Problems of wrapper methods –Computationally expensive: for each feature subset considered, the classifier is built and evaluated. –Exhaustive searching is impossible. Greedy search only. –Easy to overfit.

26 Feature Selection Strategies Embedded methods –Attempt to jointly or simultaneously train both a classifier and a feature subset. –Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. –Intuitively appealing –Examples: nearest shrunken centroids, CART and other tree-based algorithms.

27 Feature Selection Strategies Example of wrapper methods –Recursive Feature Elimination (RFE) 1.Train the classifier with SVM. (or LDA) 2.Compute the ranking criterion for all features 3.Remove the feature with the smallest ranking criterion. 4.Repeat step 1~3.

28 Feature Ranking Weighting and ranking individual features Selecting top-ranked ones for feature selection Advantages –Efficient: O(N) in terms of dimensionality N –Easy to implement Disadvantages –Hard to determine the threshold –Unable to consider correlation between features

29 Leave-one out method

30 Use leave-one out (LOO) criterion or upper bound on LOO to select features by searching over all possible subsets of n features for the ones that minimizes the criterion. When such a search is impossible because of too many possibilities, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one out bound. One can then keep the features corresponding to the largest scaling variables. Basic idea

31 Rescale features to minimize the LOO bound R 2 /M 2 x2x2 x1x1 R 2 /M 2 >1 M R x2x2 R 2 /M 2 =1 M = R Illustration

32 Radius margin bound: simple to compute, continuous very loose but often tracks LOO well Jaakkola Haussler bound: somewhat tighter, simple to compute, discontinuous so need to smooth, valid only for SVMs with no b term Span bound: tight as a Britney Spears outfit complicated to compute, discontinuous so need to smooth Three upper bounds on LOO

33 Radius margin bound

34 Jaakkola-Haussler bound

35 Span bound

36 We add a scaling parameter  to the SVM, which scales genes, genes corresponding to small  j are removed. The SVM function has the form: Classification function with gene selection