Alex Lewin Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) Philippe Broët (INSERM, Paris) In collaboration with Anne-Mette Hein,

Slides:



Advertisements
Similar presentations
1 Needles in Haystacks: Are There Any? How Many Are There? Where Are They? John Rice University of California, Berkeley.
Advertisements

C82MST Statistical Methods 2 - Lecture 2 1 Overview of Lecture Variability and Averages The Normal Distribution Comparing Population Variances Experimental.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Lewin A 1, Richardson S 1, Marshall C 1, Glazier A 2 and Aitman T 2 (2006), Biometrics 62, : Imperial College Dept. Epidemiology 2: Imperial College.
STATISTICAL TOOLS FOR SYNTHESIZING LISTS OF DIFFERENTIALLY EXPRESSED FEATURES IN MICROARRAY EXPERIMENTS Marta Blangiardo and Sylvia Richardson 1 1 Centre.
Estimating the False Discovery Rate in Multi-class Gene Expression Experiments using a Bayesian Mixture Model Alex Lewin 1, Philippe Broët 2 and Sylvia.
Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics.
1 Alex Lewin Centre for Biostatistics Imperial College, London Joint work with Natalia Bochkina, Sylvia Richardson BBSRC Exploiting Genomics grant Mixture.
1 Modelling of CGH arrays experiments Philippe Broët Faculté de Médecine, Université de Paris-XI Sylvia Richardson Imperial College London CGH = Competitive.
BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,
Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.
Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,
Alex Lewin (Imperial College) Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina.
Model checking in mixture models via mixed predictive p-values Alex Lewin and Sylvia Richardson, Centre for Biostatistics, Imperial College, London Mixed.
1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of genomic data In collaboration with Natalia Bochkina,
1 Sylvia Richardson Centre for Biostatistics Imperial College, London Bayesian hierarchical modelling of gene expression data In collaboration with Natalia.
BGX 1 Sylvia Richardson Natalia Bochkina Alex Lewin Centre for Biostatistics Imperial College, London Bayesian inference in differential expression experiments.
Linear Models for Microarray Data
Chapter 4 Inference About Process Quality
1 MAXIMUM LIKELIHOOD ESTIMATION OF REGRESSION COEFFICIENTS X Y XiXi 11  1  +  2 X i Y =  1  +  2 X We will now apply the maximum likelihood principle.
Insert Date HereSlide 1 Using Derivative and Integral Information in the Statistical Analysis of Computer Models Gemma Stephenson March 2007.
Statistical Inferences Based on Two Samples
DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.
Chapter 18: The Chi-Square Statistic
Experimental Design and Analysis of Variance
Multiple testing and false discovery rate in feature selection
Comments on Hierarchical models, and the need for Bayes Peter Green, University of Bristol, UK IWSM, Chania, July 2002.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Chromatin Immuno-precipitation (CHIP)-chip Analysis
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
Microarrays Dr Peter Smooker,
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Differentially expressed genes
Sylvia Richardson, with Alex Lewin Department of Epidemiology and Public Health, Imperial College Bayesian modelling of differential gene expression data.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Analysis of microarray data
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Multiple testing in high- throughput biology Petter Mostad.
Essential Statistics in Biology: Getting the Numbers Right
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Statistical Principles of Experimental Design Chris Holmes Thanks to Dov Stekel.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,
BIOSTATISTICS Hypotheses testing and parameter estimation.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray: An Introduction
Estimation of Gene-Specific Variance
Bayesian Semi-Parametric Multiple Shrinkage
Differential Gene Expression
Functional Genomics in Evolutionary Research
 The human genome contains approximately genes.  At any given moment, each of our cells has some combination of these genes turned on & others.
Alex Lewin (Imperial College) Sylvia Richardson (IC Epidemiology)
Presentation transcript:

Alex Lewin Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) Philippe Broët (INSERM, Paris) In collaboration with Anne-Mette Hein, Natalia Bochkina (IC Epidemiology) Helen Causton (IC Microarray Centre) Peter Green and Graeme Ambler (Bristol) Bayesian modelling of gene expression data

Contents Introduction to microarrays Differential expression Bayesian mixture estimation of false discovery rate Contents

Introduction to microarrays

Challenge: Identify function of all genes in genome DNA microarrays allow study of thousands of genes simultaneously Post-genome Genetics Research

DNA -> mRNA -> protein Gene Expression Pictures from

Hybridisation Known sequences of single-stranded DNA immobilised on microarray Tissue sample (with unknown concentration of RNA) fluorescently labelled Sample hybridised to array Excess sample washed off array Array scanned to measure amount of RNA present for each sequence on array DNA TGCT RNA ACGA

The Principle of Hybridisation 20µm Millions of copies of a specific oligonucleotide sequence element Image of Hybridised Array Approx. ½ million different complementary oligonucleotides Single stranded, labeled RNA sample Oligonucleotide element * * * * * 1.28c m Hybridised Spot Slide courtesy of Affymetrix Expressed genes Non-expressed genes Zoom Image of Hybridised Array

Output of Microarray Each gene is represented by several different DNA sequences (probes) Obtain intensity for each probe Different tissue samples on different arrays so compare gene expression for different experimental conditions

Differential Expression AL, Sylvia Richardson, Clare Marshall, Anne Glazier, Tim Aitman

Low-level Model (how gene expression is estimated from signal) Normalisation (to make arrays comparable) Differential Expression Clustering Partition Model Microarray analysis is a multi-step process We aim to integrate all the steps in a common statistical framework Differential Expression 1 of 18

Model different sources of variability simultaneously, within array, between array … Share information in appropriate ways to get better estimates, e.g. estimation of gene specific variability. Uncertainty propagated from data to parameter estimates. Incorporate prior information into the model. Bayesian hierarchical model framework Differential Expression 2 of 18

Data Set and Biological question Previous Work (Tim Aitman, Anne Marie Glazier) The spontaneously hypertensive rat (SHR): A model of human insulin resistance syndromes. Deficiency in gene Cd36 found to be associated with insulin resistance in SHR (spontaneously hypertensive rat) Differential Expression 3 of 18

Data Set and Biological question Microarray Data 3 SHR compared with 3 transgenic rats 3 wildtype (normal) mice compared with 3 mice with Cd36 knocked out genes on each array Biological Question Find genes which are expressed differently in wildtype and knockout mice. Differential Expression 4 of 18

Condition 1 (3 replicates) Condition 2 (3 replicates) Needs normalisation Spline curves shown Data Differential Expression 5 of 18

Model for Differential Expression Expression-level-dependent normalisation Only 3 replicates per gene, so share information between genes to estimate gene variances To select interesting genes, use posterior distribution of ranks Differential Expression 6 of 18

Data: y gr = log gene expression for gene g, replicate r g = gene effect r(g) = array effect (expression-level dependent) g 2 = gene variance 1st level y gr N( g + r(g), g 2 ), Σ r r(g) = 0 r(g) = function of g, parameters {a} and {b} Bayesian hierarchical model for genes under one condition Differential Expression 7 of 18

2nd level Priors for g, coefficients {a} and {b} g 2 lognormal (μ, τ) Hyper-parameters μ and τ can be influential. In a full Bayesian analysis, these are not fixed 3rd level μ N( c, d) τ lognormal (e, f) Bayesian hierarchical model for genes under one condition Differential Expression 8 of 18

Details of array effects (Normalisation) Piecewise polynomial with unknown break points: r(g) = quadratic in g for a rk-1 g a rk with coeff (b rk (1), b rk (2) ), k =1, … #breakpoints Locations of break points not fixed Must do sensitivity checks on # break points Cubic fits well for this data Differential Expression 9 of 18

Non linear fit of array effect as a function of gene effect loess cubic Differential Expression 10 of 18

Before (y gr ) After (y gr - r(g) ) WildtypeKnockout Effect of normalisation on density ^ Differential Expression 11 of 18

Variances are estimated using information from all G x R measurements (~12000 x 3) rather than just 3 Variances are stabilised and shrunk towards average variance Smoothing of the gene specific variances Differential Expression 12 of 18

Check our assumption of different variance for each gene Bayesian Model Checking Predict sample variance S g 2 new from the model for each gene Compare predicted S g 2 new with observed S g 2 obs Bayesian p-value Prob( S g 2 new > S g 2 obs ) Distribution of p-values Uniform if model is adequate Easily implemented in MCMC algorithm Differential Expression 13 of 18

Bayesian predictive p-values Exchangeable variance model is supported by the data Control for method: equal variance model has too little variability for the data Differential Expression 14 of 18

Differential expression model d g = differential effect for gene g between 2 conditions Joint model for the 2 conditions : y g1r N( g - ½ d g + r(g)1, g1 2 ), (condition 1) y g2r N( g + ½ d g + r(g)2, g2 2 ), (condition 2) Prior can be put on d g directly Differential Expression 15 of 18

Possible Statistics for Differential Expression d g log fold change d g * = d g / (σ 2 g1 / 3 + σ 2 g2 / 3 ) ½ (standardised difference) We obtain the joint distribution of all {d g } and/or {d g * } Distributions of ranks Differential Expression 16 of 18

Credibility intervals for ranks Differential Expression 17 of 18 Low rank, high uncertainty Low rank, low uncertainty 150 genes with lowest rank (most under- expressed)

Probability statements about ranks Under-expression: probability that gene is ranked in bottom 100 genes Have to choose rank cutoff (here 100) Have to choose how confident we want to be in saying the rank is less than the cutoff (eg prob=80%) Differential Expression 18 of 18

Summary: Differential Expression Expression-level-dependent normalisation Only 3 replicates per gene, so share information between genes to estimate gene variances To select interesting genes, use posterior distribution of ranks

Bayesian estimation of False Discovery Rate Philippe Broët, AL, Sylvia Richardson

Testing thousands of hypotheses simultaneously Traditional methods (Bonferroni) too conservative Challenge: select interesting genes without including too many false positives. Multiple Testing FDR 1 of 16

False Discovery Rate FDR 2 of 16 FDR=E(V/R) Declare negative Declare positive True negative True positive ST VU N-RR m0m0 m1m1 N Bonferroni: FWER=P(V>0)

Storey showed that FDR as a Bayesian Quantity FDR 3 of 16 Storey starts from p-values. We directly estimate posterior probabilities. E(V/R | R>0) = P(truly negative | declare positive)

Storey Estimate of P(null) P-values are Uniform if all genes obey the null hypothesis Estimate P(null) where density of p-values is approximately flat FDR 4 of 16

Storey Estimate of FDR Lists based on ranking genes (ordered rejection regions) List i is all genes with p-value p g <= p i cut For list i, P( declare positive | truly negative ) = p i cut FDR i = P( truly - ) P( declare + | truly - ) / P( declare + ) = P(null) p i cut N/N i FDR 5 of 16

Bayesian Estimate of FDR Classify genes as under-expressed, …, unaffected, …, over-expressed (may be several different levels of over and under-expression) unaffected null hypothesis For each gene, calculate probability of following the null distribution FDR = mean P(gene belonging to null) for genes declared positive FDR 6 of 16

Gene Expression Profiles FDR 7 of 16 Each gene has repeat measurements under several conditions: gene profile Null hypothesis: no change across conditions Summarize profile by F-statistic (one for each gene)

Transformation of F-statistics FDR 8 of 16 Transform F -> D approx. Normal if no change across conditions

Bayesian mixture model Mixture model specification D g ~ w 0 N(0, σ 0 2 ) + j=1:k w j N(μ j, σ j 2 ) μ j ordered, uniform on upper range k, unknown number of components -> alternative is modelled semi-parametrically Results integrated over different values of k FDR 9 of 16 NULLALTERNATIVE

Latent variable z g = 0, 1, …., k with prob w 0, w 1, …,w k P(gene g in null | data) = P(z g = 0 | data) Bayes Estimate of FDR FDR 10 of 16 For any given list L containing N L genes, FDR L = 1/N L Σ g on list L P(gene g belonging to null | data)

Compare Estimates of FDR FDR 11 of 16 Storey FDR i = 1/N i p i cut P(null) N Bayes FDR i = 1/N i Σ list i P(gene in null | data) NB Bayes estimate can be calculated for any list of genes, not just those based on ranking genes

Results for Simulated Data (Ordered lists of genes) Simulation study: 50 simulations of each set of profiles FDR 12 of 16 Usual methods (Storey q-value and SAM) overestimate FDR Bayes mixture estimate of FDR is closer to true value

Study of gene expression changes among 3 types of tumour: BRCA1, BRCA2 and sporadic tumours (Hedenfalk et al). Gene profiles across tumours summarized by F- statistics, transformed to D. Estimate FDR using Bayesian mixture model. Breast Cancer Data FDR 13 of 16

Fit data for all genes (2471) using the Bayesian mixture model Estimate FDR for pre-defined groups of genes with known functions: –apoptosis (26 genes) –cell cycle regulation (21 genes) –cytoskeleton (25 genes) FDR for subsets of genes FDR 14 of 16

FDR 15 of 16 Results for subsets of genes Slide from Philippe Broët Cytoskeleton FDR:66% Cycle regulation FDR:26% Apoptosis FDR:70%

Good estimate of FDR and FNR Semi-parametric model for differentially expressed genes. Obtain posterior probability for each gene. Can calculate FDR, FNR for any list of genes. Summary: FDR FDR 16 of 16

Differential Expression Expression-level-dependent normalisation Borrow information across genes for variances Joint distribution of ranks False Discovery Rate Flexible mixture gives good estimate of FDR Future work Mixture prior on log fold changes, with uncertainty propagated to mixture parameters Summary

Two papers submitted: Lewin, A., Richardson, S., Marshall C., Glazier A. and Aitman T. (2003) Bayesian Modelling of Differential Gene Expression. Broët, P., Lewin, A., Richardson, S., Dalmasso, C. and Magdelenat, H. (2004) A model-based approach for detecting distinctive gene expression profiles in multiclass response microarray experiments. Available at http ://