1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina, Anne Mette Hein, Alex Lewin (St Marys) Helen Causton and Tim Aitman (Hammersmith) Peter Green (Bristol) Philippe Broët (INSERM, Paris) BBSRC Exploiting Genomics grant

2 Outline Introduction A fully Bayesian gene expression index (BGX) Differential expression and array effects Mixture models Discussion

3 Part 1 Introduction Recent developments in genomics have led to techniques – Capable of interrogating the genome at different levels –Aiming to capture one or several stages of the biological process DNA mRNA protein phenotype

4 DNA -> mRNA -> protein Pictures from http://www.emc.maricopa.edu/faculty/farabee/BIOBK/BioBookTOC.html Protein-encoding genes are transcribed into mRNA (messenger), and the mRNA is translated to make proteins Fundamental process

5 DNA Microarrays are used to measure the relative abundance of mRNA, providing information on gene expression in a particular cell, under particular conditions The fundamental principle used to measure the expression is that of hybridisation between a sample and probes: – Known sequences of single-stranded DNA representing genes are immobilised on microarray –Tissue sample (with unknown concentration of RNA) fluorescently labelled – Sample hybridised to array – Array scanned to measure amount of RNA present for each sequence The expression level of ten of thousands of probes are measured on a single microarray ! gene expression profile What are gene expression data ? gene expression measure

6 Variation and uncertainty condition/treatment biological array manufacture imaging technical gene specific variability of the probes for a gene within/between array variation Gene expression data (e.g. Affymetrix ) is the result of multiple sources of variability Structured statistical modelling allows considering all uncertainty at once

7 Example of within vs between strains gene variability 7 cross-bred strains of mice that differ only by a small portion of chromosome 1 Strains have different phenotypes related to immunological disorders For each line, 9 animals used to obtain 3 pooled RNA extracts from spleen7 x 3 samples Excellent experimental design to minimise biological variability between replicate animals Aim: to tease out differences between expression profiles of the 7 lines of mice and relate these to locations on chromosome 1

8 Biological variability is large ! Total variance calculated over the 21 samples Average (over the 7 groups) of within strain variance calculated from the 3 pooled samples Ratio within/total

9 1000 genes most variable between strains: hierarchical clustering recovers the cross-bred lines structure Random set of 1000 genes

10 Common characteristics of genomics data sets High dimensional data (ten of thousands of genes) and few samples Many sources of variability (low signal/noise ratio) Common issues Pre-processing and data reduction Multiple testing Need to borrow information Importance to include prior biological knowledge

11 Part 2 Introduction A fully Bayesian gene expression index (BGX) –Single array model –Multiple array model Differential expression and array effects Mixture models Discussion

12 A fully Bayesian Gene eXpression index for Affymetrix GeneChip arrays Anne Mette Hein SR, Helen Causton, Graeme Ambler, Peter Green Background correction Gene specific variability (probe) PM MM PM MM PM MM PM MM Gene index BGX Raw intensities

13 * * * * * Slide courtesy of Affymetrix Zoom Image of Hybridised Array Expressed PM Non-expressed PM Image of Hybridised Array Hybridised Spot Each gene g represented by probe set: (J:11-20) Perfect match: PM g1,…, PM gJ Mis-match: MM g1,…, MM gJ expression measure for gene g Affymetrix GeneChips:

14 Commonly used methods for estimation expression levels from GeneChips MAS5: uses PM and MMs. Imputes IMs from MMs to obtain all PM-MMs positive gene expression measure : estimate obtained by applying Tukey Biweight to the set of log(PM-MM) values in the probe set RMA: uses PMs only. Fits an model with additive gene and probe effects to log- scale background corrected PMs using median polish Characteristics: positive, robust, noisy at low levels Characteristics: positive, robust, attenuated signal detection

15 Variability across conditions is conditioned by the choice of summary measure ! Beware of filtering Mean (left) and Empirical standard deviation (right) over 7 conditions (arrays) for 45000 genes estimated by 2 different methods for quantifying gene expression mean

16 The intensity for the PM measurement for probe (reporter) j and gene g is due to binding of labelled fragments that perfectly match the oligos in the spot The true Signal S gj of labelled fragments that do not perfectly match these oligos The non-specific hybridisation H gj The intensity of the corresponding MM measurement is caused by a binding fraction Φ of the true signal S gj by non-specific hybridisation H gj Model assumptions and key biological parameters

17 BGX single array model: g=1,…,G (thousands), j=1,…,J (11-20) Gene specific error terms: exchangeable log(ξ g 2 ) N(a, b 2 ) log(S gj +1) TN (μ g, ξ g 2 ) j=1,…,J Gene expression index (BGX): g =median (TN (μ g, ξ g 2 )) Pools information over probes j=1,…,J log(H gj +1) TN(λ, η 2 ) Array-wide distribution PM gj N( S gj + H gj, τ 2 ) MM gj N(Φ S gj + H gj, τ 2 ) Background noise, additive signal Non-specific hybridisation fraction Priors: vague 2 ~ (10 -3, 10 -3 ) ~ B(1,1), g ~ U(0,15) 2 ~ (10 -3, 10 -3 ), ~ N(0,10 3 ) Empirical Bayes

18 Inference: mean 2.5-97.5% credibility interval Implemented in WinBugs and C allows: - Joint estimation of parameters in full Bayesian framework obtain: - posterior distributions of parameters (and functions of these) in model: 1 2 32 3 4 1.75 2 2.25 For each gene g: log(S gj +1): j=1,…,J log(H gj +1): j=1,…,J g : Log-scale true signals:Log-scale non-spec. hybr:BGX: gene expr: NB! A distribution

19 Computational issues We found mixing slow for gene specific parameters (μ g, ξ g 2 ) and large autocorrelation For low signal (bottom 25%) more variability of S gj and H gj, and less separation So less information on (μ g, ξ g 2 ) and longer runs are needed For the full hierarchical model, the convergence of the hyperparameters for the distribution of ξ g 2 was problematic We studied sensitivity to a range of plausible values for those and implemented an empirical Bayes version of the model which was reproducible with sensible run length

20 Posterior mean of g using a run of 30 000 versus those obtained from runs of 5 000, 10 000 and 20 000 sweeps Reproducibility is obtained with short runs for large expression values Longer runs are necessary for low expression values

21 14 samples of cRNA from acute myeloid leukemia (AML) tumor cell line In sample k: each of 11 genes spiked in at concentration c k : sample k: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 conc. c k (pM): 0.0 0.5 0.75 1.0 1.5 2.0 3.0 5.0 12.5 25 50 75 100 150 Each sample hybridised to an array Single array model performance: Data set : varying concentrations (geneLogic): Consider subset consisting of 500 normal genes + 11 spike-ins

22 Single array model: examples of posterior distributions of BGX expression indices Each curve (truncated normal with median param.) represents a gene Examples with data: o : log(PM gj -MM gj ) j=1,…,J g (at 0 if not defined) Mean +- 1SD

23 Single array model performance: 11 genes spiked in at 13 (increasing) concentrations BGX index g increases with concentration ….. … except for gene 7 (spiked-in??) Indication of smooth & sustained increase over a wider range of concentrations Comparison with other expression measures

24 2.5 – 97.5 % credibility intervals for the Bayesian expression index 11 spike-in genes at 13 different concentration (data set A) Note how the variability is substantially larger for low expression level Each colour corresponds to a different spike-in gene Gene 7 : broken red line

25 PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Condition 1 PM MM PM MM PM MM PM MM Gene specific variability (probe) Gene index BGX Distribution of differential expression parameter Condition 2 Integrated modelling of Affymetrix data PM MM Distribution of expression index for gene g, condition 1 Distribution of expression index for gene g, condition 2 Hierarchical model of replicate (biological) variability and array effect Hierarchical model of replicate (biological) variability and array effect

26 PM gjcr N ( S gjcr + H gjcr, τ cr 2 ) MM gjcr N ( ΦS gjcr + H gjcr, τ cr 2 ) BGX Multiple array model: conditions: c=1,…,C, replicates: r = 1,…,R c log(S gj cr +1) TN (μ gc, ξ gc 2 ) Gene and condition specific BGX g c =median (TN(μ gc, ξ gc 2 )) Pools information over replicate probe sets j = 1,…J, r = 1,…,R c Background noise, additive Array specific log(H gj cr +1) TN(λ cr,η cr 2 ) Array-specific distribution of non-specific hybridisation

27 Subset of AffyU133A spike-in data set (AffyComp) Consider: Six arrays, 1154 genes (every 20 th and 42 spike-ins) Same cRNA hybridised to all arrays EXCEPT for spike-ins: `1` `2` `3` … `12` `13` `14` Spike-in genes: 1-3 4-6 7-9 … 34-36 37-39 40-42 Spike-in conc (pM): Condition 1 (array 1-3): 0.0 0.25 0.50 … 128 256 512 Condition 2 (array 4-6): 0.25 0.50 1.00 … 256 512 0.00 Fold change: - 2 2 … 2 2 -

28 BGX: measure of uncertainty provided Posterior mean +- 1SD credibility intervals diff g =bgx g,1 - bgx g,2 } Spike in 1113 -1154 above the blue line Blue stars show RMA measure

29 Part 3 Introduction A fully Bayesian gene expression index (BGX) Differential expression and array effects –Non linear array effects –Model checking Mixture models Discussion

30 Differential expression and array effects Alex Lewin SR, Natalia Bochkina, Tim Aitman

31 Data Set and Biological question Biological Question Understand the mechanisms of insulin resistance Using animal models where key genes are knockout and comparison made between gene expression of wildtype (normal) and knockout mice Data set A (MAS 5) ( 12000 genes on each array) 3 wildtype mice compared with 3 mice with Cd36 knocked out Data set B (RMA) ( 22700 genes on each array) 8 wildtype mice compared with 8 knocked out mice

32 Differential expression parameter Condition 1 Condition 2 Posterior distribution (flat prior) Mixture modelling for classification Hierarchical model of replicate Variability and array effect Hierarchical model of replicate Variability and array effect Start with given point estimates of expression

33 Condition 1 (3 replicates) Condition 2 (3 replicates) Needs normalisation Spline curves shown Exploratory analysis of array effect Mouse data set A

34 Model for Differential Expression Expression-level-dependent normalisation Only few replicates per gene, so share information between genes to estimate variability of gene expression between the replicates To select interesting genes: –Use posterior distribution of quantities of interest, function of, ranks …. –Use mixture prior on the differential expression parameter

35 Data: y gsr = log gene expression for gene g, replicate r g = gene effect δ g = differential effect for gene g between 2 conditions r(g)s = array effect (expression-level dependent) gs 2 = gene variance 1st level y g1r N( g – ½ δ g + r(g)1, g1 2 ), y g2r N( g + ½ δ g + r(g)2, g2 2 ), Σ r r(g)s = 0, r(g)s = function of g, parameters {a} and {b} 2nd level Priors for g, δ g, coefficients {a} and {b} gs 2 lognormal (μ s, τ s ) Bayesian hierarchical model for differential expression

36 Piecewise polynomial with unknown break points: r(g)s = quadratic in g for a rs(k-1) g a rs(k) with coeff (b rsk (1), b rsk (2) ), k =1, … # breakpoints –Locations of break points not fixed –Must do sensitivity checks on # break points Joint estimation of array effects and differential expression: In comparison to 2 step method –More accurate estimates of array effects –Lower percentage of false positive (simulation study) Details of array effects (Normalization)

37 Mouse Data set A 3 replicate arrays (wildtype mouse data) Model: posterior means E( r(g)s | data) v. E( g | data) Data: y gsr - E( g | data) For this data set, cubic fits well

38 Check assumptions on gene variances, e.g. exchangeable variances, what distribution ? Predict sample variance S g 2 new (a chosen checking function) from the model specification (not using the data for this) Compare predicted S g 2 new with observed S g 2 obs Bayesian p-value: Prob( S g 2 new > S g 2 obs ) Distribution of p-values approx Uniform if model is true (Marshall and Spiegelhalter, 2003) Easily implemented in MCMC algorithm Bayesian Model Checking

39 Data set A

40 Possible Statistics for Differential Expression δ g log fold change δ g * = δ g / (σ 2 g1 / R 1 + σ 2 g2 / R 2 ) ½ (standardised difference) We obtain the posterior distribution of all { δ g } and/or { δ g * } Can compute directly posterior probability of genes satisfying criterion X of interest: p g,X = Prob( g of interest | Criterion X, data) Can compute the distributions of ranks

41 Gene is of interest if |log fold change| > log(2) and log (overall expression) > 4 Criterion X The majority of the genes have very small p g,X : 90% of genes have p g,X < 0.2 Genes with p g,X > 0.5 (green) # 280 p g,X > 0.8 (red) # 46 p g,X = 0.49 Plot of log fold change versus overall expression level Data set A 3 wildtype mice compared to 3 knockout mice (U74A chip) Mas5 Genes with low overall expression have a greater range of fold change than those with higher expression

42 Gene is of interest if |log fold change| > log (1.5)Criterion X: The majority of the genes have very small p g,X : 97% of genes have p g,X < 0.2 Genes with p g,X > 0.5 (green) # 292 p g,X > 0.8 (red) # 139 Plot of log fold change versus overall expression level Experiment: 8 wildtype mice compared to 8 knockout mice RMA

43 Posterior probabilities and log fold change Data set A : 3 replicates MAS5 Data set B : 8 replicates RMA

44 Credibility intervals for ranks 100 genes with lowest rank (most under/ over expressed) Low rank, high uncertainty Low rank, low uncertainty Data set B

45 Compute Probability ( | δ g * | > 2 | data) Bayesian analogue of a t test ! Order genes Select genes such that Using the posterior distribution of δ g * (standardised difference) Probability ( | δ g * | > 2 | data) > cut-off ( in blue) By comparison, additional genes selected by a standard T test with p value < 5% are in red)

46 Part 4 Introduction A fully Bayesian gene expression index Differential expression and array effects Mixture models –Classification for differential expression –Bayesian estimate of False Discovery Rates –CGH arrays: models including information on clones spatial location on chromosome Discussion

47 Mixture and Bayesian estimation of false discovery rates Natalia Bochkina, Philippe Broët Alex Lewin, SR

48 Gene lists can be built by computing separately a criteria for each gene and ranking Thousands of genes are considered simultaneously How to assess the performance of such lists ? Multiple Testing Problem Statistical Challenge Select interesting genes without including too many false positives in a gene list A gene is a false positive if it is included in the list when it is truly unmodified under the experimental set up Want an evaluation of the expected false discovery rate (FDR)

49 Bayesian Estimate of FDR Step 1: Choose a gene specific parameter (e.g. δ g ) or a gene statistic Step 2: Model its prior (resp marginal) distribution using a mixture model -- with one component to model the unaffected genes (null hypothesis) e.g. point mass at 0 for δ g -- other components to model (flexibly) the alternative Step 3: Calculate the posterior probability for any gene of belonging to the unmodified component : p g0 | data Step 4: Evaluate FDR (and FNR) for any list assuming that all the gene classification are independent (Broët et al 2004) : Bayes FDR (list) | data = 1/card(list) Σ g list p g0

50 Mixture framework for differential expression To obtain a gene list, a commonly used method (cf Lönnstedt & Speed 2002, Newton 2003, Smyth 2003, …) is to define a mixture prior for δ g : H 0 δ g = 0 point mass at 0 with probability p 0 H 1 δ g ~ flexible 2-sided distribution to model pattern of differential expression Classify each gene following its posterior probabilities of not being in the null: 1- p g0 Use Bayes rule or fix the FDR to get a cutoff

51 Mixture prior for differential expression In full Bayesian framework, introduce latent allocation variable z g to help computations Joint estimation of all the mixture parameters (including p 0 ) avoids plugging-in of values (e.g. p 0 ) that are influential on the classification Sensitivity to prior settings of the alternative distribution Performance has been tested on simulated data sets Poster by Natalia Bochkina

52 Performance of the mixture prior y g1r = g - ½ δ g + g1r, r = 1, … R 1 y g2r = g + ½ δ g + g2r, r = 1, … R 2 (For simplification, we assume that the data has been pre normalised) Var( gsr ) = σ 2 gs ~ IG(a s, b s ) δ g ~ p 0 δ 0 + p 1 G ( 1.5, 1 ) + p 2 G (1.5, 2 ) H 0 H 1 Dirichlet distribution for (p 0, p 1, p 2 ) Exponential hyper prior for 1 and 2

53 Estimation Estimation of all parameters combines information from biological replicates and between condition contrasts s 2 gs = 1/R s Σ r (y gsr - y gs. ) 2, s = 1,2 Within condition biological variability 1/R s Σ r y gsr = y gs., Average expression over replicates ½(y g1. + y g2. )Average expression over conditions ½(y g1. - y g2. ) Between conditions contrast

54 g = 1:G DAG for the mixture model a 1, b 1 ½(y g1. + y g2. ) 1, 2 δgδg 2 g1 s2g1s2g1 2 g2 s2g2s2g2 g zgzg a 2, b 2 p ½(y g1. - y g2. )

55 Simulated data y gr ~ N( δ g, σ 2 g ) (8 replicates) σ 2 gs ~ IG(1.5, 0.05) δ g ~ (-1) Bern(0.5) G(2,2), g=1:200 δ g = 0, g=201:1000 Choice of simulation parameters inspired by estimates found in analyses of biological data sets Plot of the true differences

56 Post Prob (g H 1 ) = 1- p g0 Bayes rule FDR (black) FNR (blue) as a function of 1- p g0 Observed and estimated FDR/FNR correspond well Important feature

57 Comparison of mixture classification and posterior probabilities for δ g * (standardised differences) In red, 200 genes with δ g 0 Probability ( | δ g * | > 2 | data) 31 = 4% False negative 10 = 6% False positive Post Prob (g H 1 )

58 Wrongly classified by mixture: truly dif. expressed, truly not dif. expressed Classification errors are on the borderline: Confusion between size of fold change and biological variability

59 Another simulation Can we improve estimation of within condition biological variability ? 2628 data points Many points added on borderline: classification errors in red

60 g = 1:G DAG for the mixture model a 1, b 1 ½(y g1. + y g2. ) 1, 2 δgδg 2 g1 s2g1s2g1 2 g2 s2g2s2g2 g zgzg a 2, b 2 p ½(y g1. - y g2. ) The variance estimates are influenced by the mixture parameters Use only partial information from the replicates to estimate 2 gs and feed forward in the mixture ?

61 Mixture, full vs partial In 46 data points with improved classification when feed back from mixture is cut In 11 data points with changed but new incorrect classification Classification altered for 57 points: Work in progress

62 Mixture models in CGH arrays experiments Philippe Broët, SR Curie Institute oncology department CGH = Competitive Genomic Hybridization between fluorescein- labelled normal and pathologic samples to an array containing clones designed to cover certain areas of the genome

63 In oncology, where carcinogenesis is associated with complex chromosomic alterations, CGH array can be used for detailed analysis of genomic changes in copy number (gains or loss of genetic information) in the tumor sample. Amplification of an oncogene or deletion of a tumor suppressor gene are considered as important mechanisms for tumorigenesis LossGain Tumor supressor geneOncogene Aim: study genomic alterations

64 Specificity of CGH array experiment A priori biological knowledge from conventional CGH : Limited number of states for a sequence : - presence, - deletion, - gain(s) corresponding to different intensity ratios on the slide Mixture model to capture the underlying discrete states Clones located contiguously on chromosomes are likely to carry alterations of the same type Use clone spatial location in the allocation model Some CGH custom array experiments target restricted areas of the genome Large proportion of genomic alterations are expected

65 3 component mixture model with spatial allocation y gr N(θ g, g 2 ), normal versus tumoral change, clone g replicate measure r θ g w g0 N( μ 0, 0 2 ) + w g1 N( μ 1, 1 2 ) + w g2 N( μ 2, 2 2 ) μ 0 : known central estimate obtained from reference clones Introduce centred spatial autoregressive Markov random fields, {u g 0 }, {u g 1 }, {u g 2 } with nearest neighbours along the chromosomes presencedeletiongain x x x g -1 gg+1 Spatial neighbours of g Define mixture proportions to depend on the chromosomic location via a logistic model: w gk = exp ( u g k ) / Σ m exp(u g m ) favours allocation of nearby clones to same component Work in progress

66 Deletion ? Presence ? Ref value μ 0 = - 0.11 μ0μ0 Curie Institute CGH platform Focus on Investigating deletion areas on chromosome 1 (tumour suppressor locus) Data on 190 clones

67 Mixture model posterior probability p of clone being deleted Classification with cut-off at p 0.8 Short arm

68 Bayesian gene expression measure (BGX) Good range of resolution, provides credibility intervals Differential Expression Expression-level-dependent normalisation Borrow information across genes for variance estimation Gene lists based on posterior probabilities or mixture classification False Discovery Rate Mixture gives good estimate of FDR and classifies Flexibility to incorporate a priori biological features, e.g. dependence on chromosomic location Future work Mixture prior on BGX index, with uncertainty propagated to mixture parameters, comparison of marginal and prior mixture approaches, clustering of profiles for more general experimental set-ups Summary

69 Papers and technical reports: Hein AM., Richardson S., Causton H., Ambler G. and Green P. (2004) BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data (to appear in Biostatistics) Lewin A., Richardson S., Marshall C., Glazier A. and Aitman T. (2003) Bayesian Modelling of Differential Gene Expression (under revision for Biometrics) Broët P., Lewin A., Richardson S., Dalmasso C. and Magdelenat H. (2004) A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 22, 2562-2571. Broët, P., Richardson, S. and Radvanyi, F. (2002) Bayesian Hierarchical Model for Identifying Changes in Gene Expression from Microarray Experiments. Journal of Computational Biology 9, 671-683. Available at http ://www.bgx.org.uk/ Thanks

1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

Similar presentations

Presentation on theme: "1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,

Similar presentations

Presentation on theme: "1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,"— Presentation transcript:

Similar presentations

About project

Feedback