Alex Lewin (Imperial College) Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina.

Slides:



Advertisements
Similar presentations
Alex Lewin (Imperial College Centre for Biostatistics) Ian Grieve (IC Microarray Centre) Elena Kulinskaya (IC Statistical Advisory Service) Improving Interpretation.
Advertisements

Probability models- the Normal especially.
Lewin A 1, Richardson S 1, Marshall C 1, Glazier A 2 and Aitman T 2 (2006), Biometrics 62, : Imperial College Dept. Epidemiology 2: Imperial College.
STATISTICAL TOOLS FOR SYNTHESIZING LISTS OF DIFFERENTIALLY EXPRESSED FEATURES IN MICROARRAY EXPERIMENTS Marta Blangiardo and Sylvia Richardson 1 1 Centre.
Estimating the False Discovery Rate in Multi-class Gene Expression Experiments using a Bayesian Mixture Model Alex Lewin 1, Philippe Broët 2 and Sylvia.
Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics.
1 Alex Lewin Centre for Biostatistics Imperial College, London Joint work with Natalia Bochkina, Sylvia Richardson BBSRC Exploiting Genomics grant Mixture.
Alex Lewin Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) Philippe Broët (INSERM, Paris) In collaboration with Anne-Mette Hein,
BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,
Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.
Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,
Model checking in mixture models via mixed predictive p-values Alex Lewin and Sylvia Richardson, Centre for Biostatistics, Imperial College, London Mixed.
1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,
1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia.
BGX 1 Sylvia Richardson Natalia Bochkina Alex Lewin Centre for Biostatistics Imperial College, London Bayesian inference in differential expression experiments.
Lecture 9 Model Comparison using MCMC and further models.
Linear Models for Microarray Data
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Pattern Recognition and Machine Learning
Comments on Hierarchical models, and the need for Bayes Peter Green, University of Bristol, UK IWSM, Chania, July 2002.
Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui and Gary A. Churchill Genome Biology 2003, 4:210 Presented.
Darya Chudova, Alexander Ihler, Kevin K. Lin, Bogi Andersen and Padhraic Smyth BIOINFORMATICS Gene expression Vol. 25 no , pages
Visual Recognition Tutorial
Chapter 10 Simple Regression.
Differentially expressed genes
Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College Bayesian modelling of differential gene expression data.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Chapter 11 Multiple Regression.
Computer vision: models, learning and inference
Experimental Evaluation
AM Recitation 2/10/11.
Multiple testing in high- throughput biology Petter Mostad.
CDNA Microarrays MB206.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Three Frameworks for Statistical Analysis. Sample Design Forest, N=6 Field, N=4 Count ant nests per quadrat.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Multilevel and multifrailty models. Overview  Multifrailty versus multilevel Only one cluster, two frailties in cluster e.g., prognostic index (PI) analysis,
Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Canadian Bioinformatics Workshops
Plant Microarray Course
Estimation of Gene-Specific Variance
PDF, Normal Distribution and Linear Regression
Probability Theory and Parameter Estimation I
Bayesian data analysis
Differential Gene Expression
Special Topics In Scientific Computing
CI for μ When σ is Unknown
STATISTICAL INFERENCE PART IV
Statistical inference
'Linear Hierarchical Models'
Pattern Recognition and Machine Learning
CHAPTER 15 SUMMARY Chapter Specifics
Robust Full Bayesian Learning for Neural Networks
Parametric Empirical Bayes Methods for Microarrays
Statistics II: An Overview of Statistics
Alex Lewin (Imperial College) Sylvia Richardson (IC Epidemiology)
Statistical Inference for the Mean: t-test
Presentation transcript:

Alex Lewin (Imperial College) Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina (IC Epidemiology) Helen Causton (IC Microarray Centre) Peter Green (Bristol) Bayesian Modelling for Differential Gene Expression

Insulin-resistance gene Cd36 cDNA microarray: hybridisation signal for SHR much lower than for Brown Norway and SHR.4 control strains Aitman et al 1999, Nature Genet 21:76-83

Larger microarray experiment: look for other genes associated with Cd36 Microarray Data 3 SHR compared with 3 transgenic rats (with Cd36) 3 wildtype (normal) mice compared with 3 mice with Cd36 knocked out genes on each array Biological Question Find genes which are expressed differently between animals with and without Cd36.

Bayesian Hierarchical Model for Differential Expression Decision Rules Predictive Model Checks Simultaneous estimation of normalization and differential expression Gene Ontology analysis for differentially expressed genes

Low-level Model (how gene expression is estimated from signal) Normalisation (to make arrays comparable) Differential Expression Clustering, Partition Model We aim to integrate all the steps in a common statistical framework Microarray analysis is a multi-step process

Bayesian Modelling Framework Model different sources of variability simultaneously, within array, between array … Uncertainty propagated from data to parameter estimates (so not over-optimistic in conclusions). Share information in appropriate ways to get robust estimates.

Gene Expression Data 3 wildtype mice, Fat tissue hybridised to Affymetrix chips Newton et al Showed data fit well by Gamma or Log Normal distributions Kerr et al Linear model on log scale sd mean

Data: y gsr = log expression for gene g, condition s, replicate r g = gene effect δ g = differential effect for gene g between 2 conditions r(g)s = array effect (expression-level dependent) gs 2 = gene variance 1st level y g1r | g, δ g, g1 N( g – ½ δ g + r(g)1, g1 2 ), y g2r | g, δ g, g2 N( g + ½ δ g + r(g)2, g2 2 ), Σ r r(g)s = 0 r(g)s = function of g, parameters {a} and {b} Bayesian hierarchical model for differential expression

Mean effect g g ~ Unif (much wider than data range) Differential effect δ g δ g ~ N(0,10 4 ) – fixed effects (no structure in prior) OR mixture: δ g ~ 0 δ G_ (1.5, 1 ) + 2 G + (1.5, 2 ) Priors for gene effects Explicit modelling of the alternative H0H0

Fixed Effects Kerr et al Mixture Models Newton et al. 2004(non-parametric mixture) Löenstedt and Speed 2003, Smyth 2004 (conjugate mixture prior) Broet et al. 2002(several levels of DE) References

Two extreme cases: (1) Constant variance gsr N(0, 2 ) Too stringent Poor fit (2) Independent variances gsr N(0, g 2 ) ! Variance estimates based on few replications are highly variable Need to share information between genes to better estimate their variance, while allowing some variability Hierarchical model Prior for gene variances

2nd level gs 2 | μ s, τ s logNormal (μ s, τ s ) Hyper-parameters μ s and τ s can be influential. Empirical Bayes Eg. Löenstedt and Speed 2003, Smyth 2004 Fixes μ s, τ s Fully Bayesian 3rd level μ s N( c, d) τ s Gamma (e, f) Prior for gene variances

Variances estimated using information from all G x R measurements (~12000 x 3) rather than just 3 Variances stabilised and shrunk towards average variance Gene specific variances are stabilised

Spline Curve r(g)s = quadratic in g for a rs(k-1) g a rs(k) with coeff (b rsk (1), b rsk (2) ), k =1, … #breakpoints Prior for array effects (Normalization) Locations of break points not fixed Must do sensitivity checks on # break points a1a1 a2a2 a3a3 a0a0

Array effect as a function of gene effect loess Bayesian posterior mean

Before (y gsr ) After (y gsr - r(g)s ) WildtypeKnockout Effect of normalisation on density ^

1st level –y gsr | g, δ g, gs N( g – ½ δ g + r(g)s, gs 2 ), 2nd level –Fixed effect priors for g, δ g –Array effect coefficients, Normal and Uniform – gs 2 | μ s, τ s logNormal (μ s, τ s ) 3rd level –μ s N( c, d) –τ s Gamma (e, f) Bayesian hierarchical model for differential expression

Declare the model WinBUGS software for fitting Bayesian models for( i in 1 : ngenes ) { for( j in 1 : nreps) { y1[i, j] ~ dnorm(x1[i, j], tau1[i]) x1[i, j] <- alpha[i] - 0.5*delta[i] + beta1[i, j] } for( i in 1 : ngenes ) { tau1[i] <- 1.0/sig21[i] sig21[i] <- exp(lsig21[i]) lsig21[i] ~ dnorm(mm1,tt1) } mm1 ~ dnorm( 0.0,1.0E-3) tt1 ~ dgamma(0.01,0.01) WinBUGS does the calculations

Whole posterior distribution Posterior means, medians, quantiles WinBUGS software for fitting Bayesian models

Bayesian Hierarchical Model for Differential Expression Decision Rules Predictive Model Checks Simultaneous estimation of normalization and differential expression Gene Ontology analysis for differentially expressed genes

So far, discussed fitting the model. How do we decide which genes are differentially expressed? Parameters of interest: g, δ g, g –What quantity do we consider, δ g, (δ g / g ), … ? –How do we summarize the posterior distribution? Decision Rules for Inference

Inference on δ (1)d g = E(δ g | data) posterior mean Like point estimate of log fold change. Decision Rule: gene g is DE if |d g | > δ cut (2)p g = P( |δ g | > δ cut | data) posterior probability (incorporates uncertainty) Decision Rule: gene g is DE if p g > p cut This allows biologist to specify what size of effect is interesting (not just statistical significance) Fixed Effects Model biological interest biological interest statistical confidence

Inference on δ, (1)t g = E(δ g | data) / E( g | data) Like t-statistic. Decision Rule: gene g is DE if |t g | > t cut (2)p g = P( |δ g / g | > t cut | data) Decision Rule: gene g is DE if p g > p cut Bochkina and Richardson (in preparation) Fixed Effects Model statistical confidence statistical confidence

δ g ~ 0 δ G_ (1.5, 1 ) + 2 G + (1.5, 2 ) Mixture Model (1)d g = E(δ g | data) posterior mean Shrunk estimate of log fold change. Decision Rule: gene g is DE if |d g | > δ cut (2)Classify genes into the mixture components. p g = P(gene g not in H 0 | data) Decision Rule: gene g is DE if p g > p cut H0H0 Explicit modelling of the alternative

Illustration of decision rule p g = P( |δ g | > log(2) and g > 4 | data) x p g > 0.8 Δ t-statistic > 2.78 (95% CI)

Bayesian Hierarchical Model for Differential Expression Decision Rules Predictive Model Checks Simultaneous estimation of normalization and differential expression Gene Ontology analysis for differentially expressed genes

Bayesian P-values Compare observed data to a null distribution P-value: probability of an observation from the null distribution being more extreme than the actual observation If all observations come from the null distribution, the distribution of p-values is Uniform

Cross-validation p-values Distribution of p-values {p i, i=1,…,n} is approximately Uniform if model adequately describes the data. Idea of cross validation is to split the data: one part for fitting the model, the rest for validation n units of observation For each observation y i, run model on rest of data y -i, predict new data y i new from posterior distribution. Bayesian p-value p i = Prob(y i new > y i | data y -i )

Posterior Predictive p-values all data includes y i p-values are less extreme than they should be p-values are conservative (not quite Uniform). Bayesian p-value p i = Prob(y i new > y i | all data) For large n, not possible to run model n times. Run model on all data. For each observation y i, predict new data y i new from posterior distribution.

Bayesian p-value Prob( S g 2 new > S g 2 obs | data) Example: Check priors on gene variances 1)Compare equal and exchangeable variance models 2)Compare different exchangeable priors Want to compare data for each gene, not gene and replicate, so use sample variance S g 2 (suppress index s here)

WinBUGS code for posterior predictive checks for( i in 1 : ngenes ) { for( j in 1 : nreps) { y1[i, j] ~ dnorm(x1[i, j], tau1[i]) ynew1[i, j] ~ dnorm(x1[i, j], tau1[i]) x1[i, j] <- alpha[i] - 0.5*delta[i] + beta1[i, j] } s21[i] <- pow(sd(y1[i, ]), 2) s2new1[i] <- pow(sd(ynew1[i, ]), 2) pval1[i] <- step(s2new1[i] - s21[i]) } replicate relevant sampling distribution calculate sample variances count no. times predicted sample variance is bigger than observed sample variance

Posterior predictive Prior parameters y gr Mean parameters r = 1:R g = 1:G g 2 Sg2Sg2 new Sg2Sg2 y gr new Graph shows structure of model

Mixed predictive Prior parameters y gr Mean parameters r = 1:R g = 1:G g 2 Sg2Sg2 new Sg2Sg2 y gr new g 2 new Less conservative than posterior predictive (Marshall and Spiegelhalter, 2003)

Equal variance model: Model 1: 2 log Normal (0, 10000) Exchangeable variance models: Model 2: g -2 Gamma (2, β) Model 3: g -2 Gamma (α, β) Model 4: g 2 log Normal (μ, τ) (α, β, μ, τ all parameters) Four models for gene variances

Bayesian predictive p-values

Bayesian Hierarchical Model for Differential Expression Decision Rules Predictive Model Checks Simultaneous estimation of normalization and differential expression Gene Ontology analysis for differentially expressed genes

Expression level dependent normalization Many gene expression data sets need normalization which depends on expression level. Usually normalization is performed in a pre-processing step before the model for differential expression is used. These analyses ignore the fact that the expression level is measured with variability. Ignoring this variability leads to bias in the function used for normalization.

Simulated Data Gene variances similar range and distribution to mouse data Array effects cubic functions of expression level Differential effects 900 genes: δ g = 0 50 genes: δ g N( log(3), ) 50 genes: δ g N( -log(3), )

Array Effects and Variability for Simulated Data Data points:y gsr – y g (r = 1…3) Curves: r(g)s (r = 1…3) _

Two-step method (using loess) 1)Use loess smoothing to obtain array effects loess r(g)s 2)Subtract loess array effects from data: y loess gsr = y gsr - loess r(g)s 3)Run our model on y loess gsr with no array effects

Decision rules for selecting differentially expressed genes If P( |δ g | > δ cut | data) > p cut then gene g is called differentially expressed. δ cut chosen according to biological hypothesis of interest (here we use log(3) ). p cut corresponds to the error rate (e.g. False Discovery Rate or Mis-classification Penalty) considered acceptable.

Full model v. two-step method Plot observed False Discovery Rate against p cut (averaged over 5 simulations) Solid line for full model Dashed line for pre- normalized method

1)y loess gsr = y gsr - loess r(g)s 2)y model gsr = y gsr - E( r(g)s | data) Results from 2 different two-step methods are much closer to each other than to full model results. Different two-step methods

Bayesian Hierarchical Model for Differential Expression Decision Rules Predictive Model Checks Simultaneous estimation of normalization and differential expression Gene Ontology analysis for differentially expressed genes

Gene Ontology (GO) Database of biological terms Arranged in graph connecting related terms Directed Acyclic Graph: links indicate more specific terms ~16,000 terms from QuickGO website (EBI)

Gene Ontology (GO) from QuickGO website (EBI)

Gene Annotations Genes/proteins annotated to relevant GO terms Gene may be annotated to several GO terms GO term may have 1000s of genes annotated to it (or none) Gene annotated to term A annotated to all ancestors of A (terms that are related and more general)

GO annotations of genes associated with the insulin-resistance gene Cd36 Compare GO annotations of genes most and least differentially expressed Most differentially expressed p g > 0.5 (280 genes) Least differentially expressed p g < 0.2 (11171 genes)

GO annotations of genes associated with the insulin-resistance gene Cd36 For each GO term, Fishers exact test on proportion of differentially expressed genes with annotations v. proportion of non-differentially expressed genes with annotations observed O = A expected E = C*(A+B)/(C+D) if no association of GO annotation with DE FatiGO website genes annot. to GO term genes not annot. to GO term genes most diff. exp. genes least diff. exp. AB CD

GO annotations of genes associated with the insulin-resistance gene Cd36 O = observed no. differentially expressed genes E = expected no. differentially expressed genes

Response to external stimulus (O=12, E=4.7) Response to biotic stimulus (O=14, E=6.9) Response to stimulus Physiological process Organismal movement Biological process Response to external biotic stimulus * Inflammatory response (O=4, E=1.2) Immune response (O=9, E=4.5) Response to wounding (O=6, E=1.8) Response to stress (O=12, E=5.9) Defense response (O=11, E=5.8) Response to pest, pathogen or parasite (O=8, E=2.6) All GO ancestors of Inflammatory response * This term was not accessed by FatiGO Relations between GO terms were found using QuickGO:

Further Work to do on GO Account for dependencies between GO terms Multiple testing corrections Uncertainty in annotation ( work in preparation )

Summary Bayesian hierarchical model flexible, estimates variances robustly Predictive model checks show exchangeable prior good for gene variances Useful to find GO terms over-represented in the most differentially-expressed genes Paper available (Lewin et al. 2005, Biometrics, in press) http ://

In full Bayesian framework, introduce latent allocation variable z g = 0,1 for gene g in null, alternative For each gene, calculate posterior probability of belonging to unmodified component: p g = Pr( z g = 0 | data ) Classify using cut-off on p g (Bayes rule corresponds to 0.5) For any given p g, can estimate FDR, FNR. Decision Rules For gene-list S, est. (FDR | data) = Σ g S p g / |S|

The Null Hypothesis Composite Null Point Null, alternative not modelled Point Null, alternative modelled