Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays.

Slides:



Advertisements
Similar presentations
Chapter 7 Classification and Regression Trees
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Random Forest Predrag Radenković 3237/10
Brief introduction on Logistic Regression
Experimental Design and Differential Expression Class web site: Statistics for Microarrays.
Hypothesis Testing Steps in Hypothesis Testing:
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
Discrimination Class web site: Statistics for Microarrays.
Differentially expressed genes
Additive Models and Trees
Evaluating Hypotheses
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Cluster Analysis Class web site: Statistics for Microarrays.
Chapter 11 Multiple Regression.
Topic 3: Regression.
Statistics for Microarrays
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Linear and Nonlinear Modeling Class web site: Statistics for Microarrays.
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Classification Part 4: Tree-Based Methods
Objectives of Multiple Regression
Multiple Testing in the Survival Analysis of Microarray Data
Multiple testing in high- throughput biology Petter Mostad.
Simple Linear Regression
by B. Zadrozny and C. Elkan
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University.
Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site:
Chapter 9 – Classification and Regression Trees
Classification and Regression Trees (CART). Variety of approaches used CART developed by Breiman Friedman Olsen and Stone: “Classification and Regression.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
SLIDES RECYCLED FROM ppt slides by Darlene Goldstein Supervised Learning, Classification, Discrimination.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Techniques for Analysing Microarrays Which genes are involved in ovarian and prostate cancer?
Statistics for Differential Expression Naomi Altman Oct. 06.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.3 Two-Way ANOVA.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Lecture 20, Statistics 246, April 6, 2004 Identifying expression differences in cDNA microarray experiments cDNA microarray experiments.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Estimating expression differences in cDNA microarray experiments
Classification with CART
Normalization for cDNA Microarray Data
Introductory Statistics
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
STT : Intro. to Statistical Learning
Presentation transcript:

Differential Expression and Tree-based Modeling Class web site: Statistics for Microarrays

cDNA gene expression data Data on G genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = (normalized) Log( Red intensity / Green intensity) sample1sample2sample3sample4sample5 …

Identifying Differentially Expressed Genes Goal: Identify genes associated with covariate or response of interest Examples: –Qualitative covariates or factors: treatment, cell type, tumor class –Quantitative covariate: dose, time –Responses: survival, cholesterol level –Any combination of these!

Differentially Expressed Genes Simultaneously test m null hypotheses, one for each gene j : H j : no association between expression level of gene j and covariate/response Combine expression data from different slides and estimate effects of interest Compute test statistic T j for each gene j Adjust for multiple hypothesis testing

Test statistics Qualitative covariates: e.g. two-sample t-statistic, Mann-Whitney statistic, F- statistic Quantitative covariates: e.g. standardized regression coefficient Survival response: e.g. score statistic for Cox model

QQ-Plot Used to assess whether a sample follows a particular (e.g. normal) distribution (or to compare two samples) A method for looking for outliers when data are mostly normal Recall that for the normal distribution, approximately: 68% within 1 SD of the mean 95% within 2 SDs 99.7% within 3 SDs Sample Theoretical Sample quantile is Value from Normal distribution which yields a quantile of (= -1.15)

Typical Deviations from Straight Line Patterns Outliers Curvature at both ends (long or short tails) Convex/concave curvature (asymmetry) Horizontal segments, plateaus, gaps

Outliers

Long Tails

Short Tails

Asymmetry

Plateaus/Gaps

Example: Apo AI experiment (Callow et al., Genome Research, 2000) GOAL: Identify genes with altered expression in the livers of one line of mice with very low HDL cholesterol levels compared to inbred control mice Experiment: Apo AI knock-out mouse model 8 knockout (ko) mice and 8 control (ctl) mice (C57Bl/6) 16 hybridisations: mRNA from each of the 16 mice is labelled with Cy5, pooled mRNA from control mice is labelled with Cy3 Probes: ~6,000 cDNAs, including 200 related to lipid metabolism

Which genes have changed? This method can be used with replicated data: 1. For each gene and each hybridisation (8 ko + 8 ctl) use M=log 2 (R/G) 2. For each gene form the t-statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms) 2 + 1/8 (SD of 8 ctl Ms) 2 ) 3. Form a histogram of 6,000 t values 4. Make a normal Q-Q plot; look for values “off the line” 5. Adjust for multiple testing

Histogram & Q-Q plot ApoA1

Plots of t-statistics

Assigning p-values to measures of change Estimate p-values for each comparison (gene) by using the permutation distribution of the t- statistics. For each of the possible permutation of the trt / ctl labels, compute the two-sample t-statistics t* for each gene. The unadjusted p-value for a particular gene is estimated by the proportion of t*’s greater than the observed t in absolute value.

Multiple Testing # not rej# rejectedtotals # true HUV (False +)m0m0 # false HT (False -)Sm1m1 totalsm - RRm * Per-comparison = E(V)/m * Family-wise = p(V ≥ 1) * Per-family = E(V) * False discovery rate = E(V/R)

Apo AI: Adjusted and unadjusted p-values for the 50 genes with the larges absolute t-statistics

Genes with adjusted p-value  0.01

Single-slide methods Model-dependent rules for deciding whether (R,G) corresponds to a differentially expressed gene Amounts to drawing two curves in the (R,G)-plane; call a gene differentially expressed if it falls outside the region between the two curves At this time, not enough known about the systematic and random variation within a microarray experiment to justify these strong modeling assumptions n = 1 slide may not be enough (!)

Single-slide methods Chen et al: Each (R,G) is assumed to be normally and independently distributed with constant CV; decision based on R/G only (purple) Newton et al: Gamma-Gamma-Bernoulli hierarchical model for each (R,G) (yellow) Roberts et al: Each (R,G) is assumed to be normally and independently distributed with variance depending linearly on the mean Sapir & Churchill: Each log R/G assumed to be distributed according to a mixture of normal and uniform distributions; decision based on R/G only (turquoise)

Matt Callow’s Srb1 dataset (#8). Newton’s, Sapir & Churchill’s and Chen’s single slide method Difficulty in assigning valid p- values based on a single slide

Another example: Survival analysis with expression data Bittner et al. looked at differences in survival between the two groups (the ‘cluster’ and the ‘unclustered’ samples) ‘Cluster’ seemed to have longer survival

Kaplan-Meier Survival Curves, Bittner et al.

unclustered cluster Average Linkage Hierarchical Clustering, survival only

Kaplan-Meier Survival Curves, reduced grouping

Identification of genes associated with survival For each gene j, j = 1, …, 3613, model the instantaneous failure rate, or hazard function, h(t) with the Cox proportional hazards model: h(t) = h 0 (t) exp(  j x ij ) and look for genes with both: large effect size  j large standardized effect size  j /SE(  j ) ^ ^^

Findings Top 5 genes by this method not in Bittner et al. ‘weighted gene list’ - Why? weighted gene list based on entire sample; our method only used half weighting relies on Bittner et al. cluster assignment other possibilities?

Statistical Significance of Cox Model Coefficients

Limitations of Single Gene Tests May be too noisy in general to show much Do not reveal coordinated effects of positively correlated genes Hard to relate to pathways

Some ideas for further work Expand models to include more genes and possibly two-way interactions Nonparametric tree-based subset selection – would require much larger sample sizes

(BREAK)

Trees Provide means to express knowledge Can aid in decision making Can be portrayed graphically or by means of a chart or ‘key’, e.g. (MASS space shuttle): stabilityerrorsignwindmagnitudevisibilityDECISION any noauto xstabany yesnoauto stabLXany yesnoauto stabXLany yesnoauto stabMMnntailanyyesnoauto any Out of rangeyesnoauto Etc…

Tree-based Methods – References Hastie, Tibshirani, Friedman 2001 –The Elements of Statistical Learning Venables and Ripley, 1999 –Modern Applied Statistics with S-Plus (MASS) Ripley, 1996 –Pattern Recognition and Neural Networks Breiman, Olshen, Friedman, Stone 1984 –Classification and Regression Trees

Tree-based Methods Automatic construction of decision trees dates from social science work in the early 1960’s (AID) Breiman et al. (1984) proposed new algorithms for tree construction (CART) Tree construction can be seen as a type of variable selection

Response types Categorical outcome – Classification tree Continuous outcome – Regression tree Survival outcome – Survival tree Software – Available R packages include tree, rpart (tssa available in S)

Trees Partition the Feature Space End point of tree is a (labeled) partition of the (feature) space of possible observations X Tree-based methods partition X into rectangular regions; try to make the (average) responses in each box as different as possible In logical problems it is assumed that there does exist a partition of X that will correctly classify all observations; task is to find a tree to succinctly describe this partition

Partitions and CART X1X1 X1X1 t1t1 t3t3 t2t2 t4t4 R2R2 R1R1 R3R3 R5R5 R4R4 YesNo X2X2 X2X2 XX

Partitions and CART X1X1 t1t1 t3t3 t2t2 t4t4 R2R2 R1R1 R3R3 R5R5 R4R4 X2X2 X 1  t 1 X 2  t 2 X 1  t 3 X 2  t 4 R1R1 R2R2 R3R3 R4R4 R5R5

Tree Comparison Measure how well the partition created by a tree corresponds to the correct decision rule (classification) For a logical problem, count number errors For statistical problem, usually overlapping class distributions, so that no partition unambiguously describes classes – estimate misclassification prob.

Three Aspects of Tree Construction Split Selection Rule Split-stopping Rule Assignment of predicted values

Split Selection Binary splits Look only one step ahead – avoids massive computational time by not attempting to optimize whole tree performance Choose an impurity measure to optimize each split – Gini index or entropy, rather than misclassification rate for classification tree, deviance (squared error) for regression tree

Split-stopping Issue: A very large tree will tend to overfit the data (and therefore lack generalizability), while too small a tree might not capture important structure Usual solution: grow large/maximal tree (stop splitting only when some minimum node size, 5 or 10 say, is reached), followed by (cost-complexity) pruning

Pruning Sequence of rooted subtrees Measure R i (e.g. deviance) at leaves, R =  R i Minimize the cost-complexity measure R  = R +  * size  governs tradeoff between tree size and goodness of fit Choose  to minimize cross-validated error (misclassification or deviance)

Assignment of Predicted Values Assign value to each leaf (terminal node) In Classification: (weighted) voting among observations in the node In Regression: mean of observations in the node

Other Issues (I) Loss matrix – Procedures can be modified for asymmetric losses Missing predictor values –Can create ‘missing’ category –Surrogate splits exploit correlation between predictors Linear combination splits

Other Issues (II) Tree Instability – Small changes in the data can result in very different series of splits – difficulties in interpretation – Aggregate trees to reduce (e.g. bagging) Lack of smoothness –More of an issue in regression trees –Multivariate Adaptive Regression Splines (MARS) Difficulty in capturing additive structure with binary trees

Acknowledgements Sandrine Dudoit Jane Fridlyand Yee Hwa (Jean) Yang Debashis Ghosh Erin Conlon Ingrid Lonnstedt Terry Speed