Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall, Natalia Bochkina (Imperial) Graeme Ambler (Bristol) Tim Aitman and Helen Causton (Hammersmith) BGX Windsor, October 2004

BGX Statistical modelling and biology Extracting the message from microarray data needs statistical as well as biological understanding Statistical modelling – in contrast to data analysis – gives a framework for formally organising assumptions about signal and noise Our models are structured, reflecting data generation process: highly structured stochastic systems

BGX Background and 3 studies Hierarchical modelling A fully Bayesian gene expression index (BGX) Differential expression and array effects Two-way clustering

BGX Hierarchical modelling A fully Bayesian gene expression index (BGX) Differential expression and array effects Two-way clustering Part 1

BGX Gene expression using Affymetrix chips 20µm Millions of copies of a specific oligonucleotide sequence element Image of Hybridised Array Approx. ½ million different complementary oligonucleotides Single stranded, labeled RNA sample Oligonucleotide element * * * * * 1.28cm Hybridised Spot Slide courtesy of Affymetrix Expressed genes Non-expressed genes Zoom Image of Hybridised Array

BGX Variation and uncertainty condition/treatment biological array manufacture imaging technical within/between array variation gene-specific variability Gene expression data (e.g. Affymetrix ) is the result of multiple sources of variability Structured statistical modelling allows considering all uncertainty at once

BGX Costs and benefits of this approach Advantages of avoiding plug-in approach Uncertainties propagated throughout model Realistic estimates of variability Avoid bias The price you pay – computational costs Intricate implementation Longer run times (but far less than experimental protocol!)

BGX A fully Bayesian Gene eXpression index for Affymetrix GeneChip arrays Anne-Mette Hein Sylvia Richardson, Helen Causton, Graeme Ambler, Peter Green Gene specific variability (probe) PM MM PM MM PM MM PM MM BGX Gene index

BGX Single array model: motivation PMs and MMs both increase with spike-in concentration (MMs slower than PMs) MMs bind fraction of signal Spread of PMs increase with level Multiplicative (and additive) error; transformation needed Considerable variability in PM (and MM) response within a probe set Varying reliability in gene expression estimation for different genes Probe effects approximately additive on log-scale Estimate gene expression measure from PMs and MMs on log scale Key observations: Conclusions:

BGX Model assumptions and key biological parameters The intensity for the PM measurement for probe (reporter) j and gene g is due to binding of labelled fragments that perfectly match the oligos in the spot (the true signal S gj ) of labelled fragments that do not perfectly match these oligos (the non-specific hybridisation H gj ) The intensity of the corresponding MM measurement is caused by a binding fraction Φ of the true signal S gj by non-specific hybridisation H gj

BGX BGX single array model g=1, …,G (thousands), j=1, …,J (11-20) Gene expression index (BGX): g =median (TN (μ g, ξ g 2 )) Pools information over probes j=1,…,J log(H gj +1) TN(λ, η 2 ) Array-wide distribution PM gj N( S gj + H gj, τ 2 ) MM gj N(Φ S gj + H gj, τ 2 ) Background noise, additive signal Non-specific hybridisation fraction j=1,…,J Priors: vague 2 ~ (10 -3, 10 -3 ) ~ B(1,1), g ~ U(0,15) 2 ~ (10 -3, 10 -3 ), ~ N(0,10 3 ) Gene specific error terms: exchangeable log(ξ g 2 ) N(a, b 2 ) log(S gj +1) TN (μ g, ξ g 2 ) Empirical Bayes

BGX Markov chain Monte Carlo (MCMC) computation Fitting of Bayesian models hugely facilitated by advent of these simulation methods Produce a large sample of values of all unknowns, from posterior given data Easy to set up for hierarchical models BUT can be slow to run (for many variables!) and can fail to converge reliably

BGX Sample in place of a distribution - 1D

BGX Sample in place of a distribution - 2D

BGX Single array model performance Data set : varying concentrations (geneLogic): 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line In sample k: each of 11 genes spiked in at concentration c k : sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 conc. (pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150 Each sample hybridised to an array Consider subset consisting of 500 normal genes + 11 spike-ins

BGX Signal & expression indices `true signal`/ expression index BGX increases with concentration 10 arrays: gene 1 spiked-in at increasing concentrations Lines: 95% credibility intervals for log(S gj +1) Curves: posterior for signal

BGX Non-specific hybridisation 10 arrays: gene 1 spiked-in at increasing concentrations Non-specific hybridisation does not increase with concentration Lines: 95% credibility intervals for log(H gj +1) Curves: posterior for signal

BGX Comparison with other expression measures 11 genes spiked in at 13 (increasing) concentrations BGX index g increases with concentration ….. … except for gene 7 (incorrectly spiked-in??) Indication of smooth & sustained increase over a wider range of concentrations

BGX Single array model: examples of posterior distributions of BGX indices Each curve represents a gene Examples with data: o : log(PM gj -MM gj ) j=1,…,J g (at 0 if not defined) Mean 1SD

BGX 95% credibility intervals for Bayesian gene expression index 11 spike-in genes at 13 different concentrations (data set A) Note how the variability is substantially larger for low expression level Each colour corresponds to a different spike-in gene Gene 7 : broken red line

BGX Bayesian modelling of differential gene expression, adjusting for array effects Alex Lewin Sylvia Richardson, Natalia Bochkina, Clare Marshall, Anne Glazier, Tim Aitman The spontaneously hypertensive rat (SHR): A model of human insulin resistance syndromes. Deficiency in gene Cd36 found to be associated with insulin resistance in SHR Following this, several animal models were developed where other relevant genes are knocked out comparison between knocked out and wildtype (normal) mice or rats. See poster!

BGX Data set & biological question Microarray Data Data set A (MAS 5) ( 12000 genes on each array) 3 SHR compared with 3 transgenic rats Data set B (RMA) ( 22700 genes on each array) 8 wildtype (normal) mice compared with 8 knocked out mice Biological Question Find genes which are expressed differently in wildtype and knockout / transgenic mice

BGX Exploratory analysis showing array effect Condition 1 (3 replicates) Condition 2 (3 replicates)

BGX Differential expression model The quantity of interest is the difference between conditions for each gene: d g, g = 1, …,N Joint model for the 2 conditions : y g1r = g - ½ d g + 1r ( g ) + g1r, r = 1, … R 1 y g2r = g + ½ d g + 2r ( g ) + g2r, r = 1, … R 2 where y gcr is log gene expression for gene g, condition c, replicate r g is overall gene effect cr ( ) is array effect - a smooth function of gcr is normally distributed error, with gene- and condition- specific variance

BGX Differential expression model Joint modelling of array effects and differential expression: Performs normalisation simultaneously with estimation Gives fewer false positives Can work with any desired composite criterion for identifying interesting genes, e.g. fold change and overall expression level

Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4 Criterion: The majority of the genes have very small p g,X : 90% of genes have p g,X < 0.2 Genes with p g,X > 0.5 (green) # 280 p g,X > 0.8 (red) # 46 p g,X = 0.49 Plot of log fold change versus overall expression level Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) MAS5 Genes with low overall expression have a greater range of fold change than those with higher expression

Gene is of interest if |log fold change| > log (1.5)Criterion: The majority of the genes have very small p g,X : 97% of genes have p g,X < 0.2 Genes with p g,X > 0.5 (green) # 292 p g,X > 0.8 (red) # 139 Plot of log fold change versus overall expression level Data set B 8 wildtype mice compared to 8 knockout mice RMA

PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Condition 1 PM MM PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Distribution of differential expression parameter Condition 2 Integrated modelling of Affymetrix data PM MM Distribution of expression index for gene g, condition 1 Distribution of expression index for gene g, condition 2 Hierarchical model of replicate (biological) variability and array effect Hierarchical model of replicate (biological) variability and array effect

BGX Hierarchical modelling A fully Bayesian gene expression index (BGX) Differential expression and array effects Two-way (gene by sample) clustering Part 4

BGX Hierarchical clustering of samples A subset of 1161 gene expression profiles, obtained in 60 different samples Ross et al, Nature Genetics, 2000 The gene expression profiles cluster according to tissue of origin of the samples Red : more mRNA Green : less mRNA in the sample compared to a reference

BGX Many clustering algorithms have been developed and used for exploratory purposes They rely on a measure of distance (dissimilarity) between gene or sample profiles, e.g. Euclidean Hierarchical clustering proceeds in an agglomerative manner: single profiles are joined to form groups using the distance metric, recursively Good visual tool, but many arbitrary choices care in interpretation! Non-model-based clustering

BGX Build the cluster structure into the model, rather than estimating gene effects (say) first, and post- processing to seek clusters Bayesian setting allows use of real prior information where it is exists (biological understanding of pathways, etc, previous experiments, …) Model-based clustering

BGX Additive ANOVA models for (log-) gene expression g =gene s =sample/condition The simplest model: gene + sample The model generates the method, and in this case performs a simple form of normalisation Under standard conditions, the (least-squares) estimates of gene effects are

BGX... bring in mixture modelling … g =gene T g = unknown cluster to which gene g belongs This is a mixture model (single sample first!)

BGX … finally allow clusters to overlap – Plaid model h denotes a cluster, block or layer – pathway? gh = 0 or 1 and sh = 0 or 1

BGX Plaid model genes samples

BGX An early experiment : artificial raw data Artificial data from a very special case of the Plaid model: single sample s True H=3, b (h) =2.2, 3.4 and 4.7, N(0, 2 ); 500 genes, some in each of 2 3 =8 configurations of gh 8 overlapping normal clusters

BGX true H was 3 true b (h) were 2.2, 3.4, 4.7

BGX Human fibroblast data – Lemon et al (2002) 18 samples split into 3 categories: serum starved, serum stimulated and a 50:50 mix of starved/stimulated. We used the natural logarithm of Lemon et al.s calculated LWF values as our measure of expression and subtracted gene and sample mean levels. We then selected the 100 most variable genes across all 18 samples and used this 18×100 array as the input to our analysis.

BGX Bayesian clustering Hierarchical model allows us to learn about all unknowns simultaneously In particular, this includes complete 2- way classification, gene by sample, with numerical uncertainties We then construct visualisations of interesting aspects (marginal distributions) of this posterior

BGX Bayesian clustering: samples

BGX Bayesian clustering: genes

BGX More details, papers and code www.stats.bris.ac.uk/BGX/ www.bgx.org.uk

Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

Similar presentations

Presentation on theme: "Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

Similar presentations

Presentation on theme: "Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,"— Presentation transcript:

Similar presentations

About project

Feedback