Presentation is loading. Please wait.

Presentation is loading. Please wait.

BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,

Similar presentations


Presentation on theme: "BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,"— Presentation transcript:

1 BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Marys) Tim Aitman (Hammersmith) Peter Green (Bristol) BBSRC Biological Atlas of Insulin Resistance

2 BGX 2 Statistical modelling and biology Extracting the message from microarray data needs statistical as well as biological understanding Statistical modelling – in contrast to data analysis – gives a framework for formally organising assumptions about signal and noise Our models are structured, reflecting data generation process: –Bayesian hierarchical modelling approach –Inference based on posterior distribution of quantities of interest

3 BGX 3 What are gene expression data ? DNA Microarrays are used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell type, under specific conditions Gene expression data (e.g. Affymetrix ) results from the scanning of arrays where hybridisation between a sample and a large number of probes has taken place: –gene expression measure for each gene The expression level of ten of thousands of probes are measured on a single microarray: –gene expression profile Typically, gene expression profiles are obtained for several samples, in a single or related experiments: –gene expression data matrix * * * * *

4 BGX 4 Common characteristics of data sets in transcriptomic High dimensional data (ten of thousands of genes) and few samples Many sources of variability (low signal/noise ratio) condition/treatment biological array manufacture imaging technical within/between array variation gene specific variability of the probes for a gene (e.g. for Affymetrix )

5 BGX 5 Gene expression data can be used in several types of analysis: -- Comparison of gene expression under different experimental conditions, or in different tissues -- Building a predictive model for classification or prognosis based on gene expression measurements -- Exploration of patterns in gene expression matrices Analysing gene expression data Samples Genes (20000) Gene expression level Gene expression data matrix

6 BGX 6 Common statistical issues Pre-processing and data reduction –account for the uncertainty of the signal? –making arrays comparable: normalisation Realistic assessment of uncertainty Multiplicity: control of error rates Need to borrow information Importance to include prior biological knowledge Illustrate how structured statistical modelling can help to tease out signal from noise and strengthen inference in the context of differential expression studies

7 BGX 7 Outline Background Modelling uncertainty in the signal Bayesian hierarchical models for differential expression experiments –posterior predictive checks –use of posterior distribution of parameters of interest to select genes of interest Further structure: mixture models

8 BGX 8 Data: Affymetrix chip: - Each gene g is represented by a probe set, consisting of a number of probe pairs (reporters) j Perfect match (PM) and Mismatch (MM) Aim: Formulate a model to combine PM and MM values into a new expression value for the gene -- BGX - Base the model on biological assumptions - Combine good features of Li and Wong (dChip) and RMA (Robust Multichip Analysis, Irrizarry et al) I – Modelling uncertainty in the signal: A fully Bayesian Gene expression index for Affymetrix Gene Chip arrays (Anne Mette Hein) Use a flexible Bayesian framework that will allow to get a measure of uncertainty of the expression to integrate further components of the experimental design

9 BGX 9 Single array model: Motivation Key observations: Conclusions: PMs and MMs both increase with spike-in concentration (MMs slower than PMs) MMs bind fraction of signal Spread of PMs increase with level Multiplicative (and additive) error; transformation needed Considerable variability in PM (and MM) response within a probe set Varying reliability in gene expression estimation for different genes Probe effects approximately additive on log-scale Estimate gene expression measure from PMs and MMs on log scale

10 BGX 10 BGX single array model Remaining priors: vague fraction log(H gj +1) TN(λ, η 2 ) Non-specific hybridisation: array wide distribution: j=1,…,J (20), g=1,…,G Shrinkage: exchangeability log(σ g 2 ) N(a, b 2 ) Emp. Bayes log(S gj +1) TN(μ g,σ g 2 ) Expression measure for gene g is built from: j=1,…,J (20) BGX expression measure PM gj N( S gj + H gj, τ 2 ) MM gj N( Φ S gj + H gj, τ 2 ) Background noise, additive Gene and probe specific S and H (g:1,…,1000s, j=1,…,tens)

11 BGX 11 BGX model: inference Hein et al, Biostatistics, 2005 For each gene g: obtain a distribution for signal (log scale) g : BGX: gene expression PM MM Implemented in WinBugs and C++ (MCMC) All parameters estimated jointly in full Bayesian framework Posterior distributions of parameters (and functions) obtained The single array model can be extended to estimate signal from several biological replicates, as well as differential signal between conditions

12 BGX 12 Single array model: examples of posterior distributions of BGX indices Each curve represents a gene Examples with data: o : log(PM gj -MM gj ) j=1,…,J (at 0 if not defined) Mean 1SD

13 BGX 13 Comparison with other expression measures 11 genes spiked in at 13 (increasing) concentrations BGX index μ g increases with concentration ….. … except for gene 7 (incorrectly spiked-in??) Indication of smooth & sustained increase over a wider range of concentrations

14 BGX 14 95% credibility intervals for Bayesian gene expression index 11 spike-in genes at 13 different concentrations Note how the variability is substantially larger for low expression level Each colour corresponds to a different spike-in gene Gene 7 : broken red line

15 BGX 15 II – Modelling differential expression Differential expression parameter Condition 1 Condition 2 Posterior distribution (flat prior) Mixture modelling for classification Hierarchical model of replicate variability and array effect Hierarchical model of replicate variability and array effect Start with given point estimates of expression

16 BGX 16 Data Sets and Biological question Biological Question Understand the mechanisms of insulin resistance Using animal models where key genes are knockout A) Cd36 Knock out Data set (MAS 5) 3 wildtype (normal) mice compared with 3 mice with Cd36 knocked out ( genes on each array ) B) IRS2 Knock out Data set (RMA) 8 wildtype (normal) mice compared with 8 mice with IRS2 gene knocked out ( genes on each array)

17 BGX 17 Condition 1 (3 replicates) Condition 2 (3 replicates) Needs normalisation Spline curves shown Exploratory analysis showing array effect Mouse data set A

18 BGX 18 Data: y gcr = log gene expression gene g, replicate r, condition c g = gene effect d g = differential effect for gene g between 2 conditions r(g)c = array effect – modelled as a smooth (spline) function of g gc 2 = gene specific variance 1st level y g1r N( g – ½ d g + r(g)1, g1 2 ) y g2r N( g + ½ d g + r(g)2, g2 2 ) Σ r r(g)c = 0, r(g)c = function of g, parameters {c,d} 2nd level Flat priors for g, d g, {c,d} gc 2 lognormal (a c, b c ) Bayesian hierarchical model for differential expression (Lewin et al, Biometrics, 2005) Exchangeable variances

19 BGX 19 Directed Acyclic Graph for the differential expression model (no array effect represented) a 1, b 1 ½(y g1. + y g2. ) dgdg 2 g1 s2g1s2g1 2 g2 s2g2s2g2 g a 2, b 2 ½(y g1. - y g2. )

20 BGX 20 Differential expression model Joint modelling of array effects and differential expression: Performs normalisation simultaneously with estimation Gives fewer false positives How to check some of the modelling assumptions? Posterior predictive checks How to use the posterior distribution of d g to select genes of interest ? Decision rules

21 BGX 21 Check assumptions on gene variances, e.g. exchangeable variances, what distribution ? Predict sample variance s g 2 new (a chosen checking function) from the model specification (not using the data for this) Compare predicted s g 2 new with observed s g 2 obs Bayesian p-value: Prob( s g 2 new > s g 2 obs ) Distribution of p-values approx Uniform if model is true (Marshall and Spiegelhalter, 2003) Easily implemented in MCMC algorithm Bayesian Model Checking

22 BGX 22 Bayesian model checking a 1, b 1 ½(y g1. + y g2. ) dgdg 2 g1 s2g1s2g1 2 g2 s2g2s2g2 g a 2, b 2 ½(y g1. - y g2. ) 2 g1 new s2g1s2g1 obs

23 BGX 23 Mouse Data set A

24 BGX 24 Use of tail probabilities for selecting gene lists d g : log fold change t g = d g / (σ 2 g1 / n 1 + σ 2 g2 / n 2 ) ½ standardised difference (n 1 and n 2 # replicates in each condition) -- Obtain the posterior distribution of d g and/or t g -- Compute directly posterior probability of genes satisfying criterion X of interest, e.g. d g > threshold or t g > percentile p g,X = Prob( g of interest | Criterion X, data) -- Compute the distributions of ranks, …. Interesting statistical issues on relative merits and properties of different selection rules based on tail probabilities

25 BGX 25 Compute Probability ( | t g | > 2 | data) Bayesian T test Order genes Select genes such that Using the posterior distribution of t g (standardised difference) (Natalia Bochkina ) Probability ( | t g | > 2 | data) > cut-off ( in blue) By comparison, additional genes selected by a standard T test with p value < 5% are in red) Data set B

26 BGX 26 Credibility intervals for ranks 100 genes with lowest rank (most under/ over expressed) Low rank, high uncertainty Low rank, low uncertainty

27 BGX 27 III – Mixture and Bayesian estimation of False Discovery Rates (FDR) Mixture models can be used to perform a model based classification Mixture models can be considered at the level of the data (e.g. clustering time profiles) or for the underlying parameters Mixture models can be used to detect differentially expressed genes if a model of the alternative is specified One benefit is that an estimate of the uncertainty of the classification: the False Discovery Rate is simultaneously obtained

28 BGX 28 Mixture framework for differential expression y g1r = g - ½ d g + g1r, r = 1, … R 1 y g2r = g + ½ d g + g2r, r = 1, … R 2 (We assume that the data has been pre normalised) Var( gcr ) = σ 2 gc ~ IG(a c, b c ) d g ~ 0 δ G (- x | 1.5, 1 ) + 2 G (x|1.5, 2 ) H 0 H 1 Dirichlet distribution for ( 0, 1, 2 ) Exp(1) hyper prior for 1 and 2 Explicit modelling of the alternative

29 BGX 29 Mixture for classification of DE genes Calculate the posterior probability for any gene of belonging to the unmodified component : p g0 | data Classify using a cut-off on p g0 : i.e. declare gene is DE if 1- p g0 > p cut Bayes rule corresponds to p cut = 0.5 Bayesian estimate of FDR (and FNR) for any list (Newton et al 2003, Broët et al 2004) : Bayes FDR (list) | data = 1/card(list) Σ g list p g0

30 BGX 30 Performance of the mixture prior Joint estimation of all the mixture parameters (including 0 ) using MCMC algorithms avoids plugging-in of values that are influential on the classification Estimation of all parameters combines information from biological replicates and between condition contrasts Performance has been tested on simulated data sets

31 BGX 31 Plot of true difference in each case π 0 = 0.8, 500 DE π 0 = 0.9, 250 DE π 0 = 0.99, 25 DEπ 0 = 0.95, 125 DE π 0 = 0.80, 500 DE

32 BGX 32 Examples of simulated data for each case

33 BGX 33 Results averaged over 50 replications Av. π 0 = 0.99 Av. π 0 = 0.80 Av. π 0 = 0.90 Av. π 0 = 0.78Av. π 0 = 0.95 ^ ^ ^ ^ ^ Good estimates of 0 = Prob(null) for each case

34 BGX 34 Comparison of estimated (dotted lines) and observed (full) FDR (black) and FNR (red) rates as cut-off for declaring DE is varied Bayesian mixture: good estimates of FDR and FNR easy way to choose efficient classification rule

35 BGX 35 In summary Integrated gene expression analysis Uses the natural hierarchical structure of the data: e.g. probes within genes within replicate arrays within condition to synthesize, borrow information and provide realistic quantification of uncertainty Posterior distributions can be exploited for inference with few replicates: choice of decision rules Framework where biological prior information, e.g. on the structure of the probes or on chromosomic location, can be incorporated Model based classification, e.g. through mixtures, provides interpretable output and a structure to deal with multiplicity General framework for investigating other questions

36 BGX 36 Many interesting questions in the analysis of gene expression data -- Comparison of gene expression under different experimental conditions, or in different tissues -- Integrated gene expression analysis -- Investigate high dimensional classification rules (prediction with large number of variables) and large p small n regression problems (shrinkage or variable selection) -- Building a predictive model for classification or prognosis based on gene expression measurements, finding signatures

37 BGX 37 Association of gene expression with prognosis Investigate properties of high dimensional classification rules (prediction with large number of variables) and large p small n regression problems (shrinkage or variable selection) Expression plot of 115 prognostic genes comprising The Ovarian Cancer Prognostic Profile

38 BGX Comparison of gene expression under different experimental conditions, or in different tissues -- Building a predictive model for classification or prognosis based on gene expression measurements, finding signatures Other questions …. -- Integrated gene expression analysis -- Investigate high dimensional classification rules (prediction with large number of variables) and large p small n regression problems (shrinkage or variable selection) -- Perform unsupervised model based clustering -- Estimate graphical models -- Exploration of patterns and association networks in gene expression matrices

39 BGX Comparison of gene expression under different experimental conditions, or in different tissues -- Classification of gene expression profiles and association of gene expression with other factors, e.g. prognosis (prediction problem) Exploration of patterns in gene expression matrices Perform unsupervised model based clustering (e.g. semi-parametric using basis functions, mixtures or DP processes) Development of central nervous systems in rats (9 time points) samples genes

40 BGX 40 BBSRC Exploiting Genomics grant Colleagues Natalia Bochkina, Anne Mette Hein, Alex Lewin (Imperial College) Peter Green (Bristol University) Philippe Broët (INSERM, Paris) Papers and technical reports: Thanks


Download ppt "BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,"

Similar presentations


Ads by Google