1 Alex Lewin Centre for Biostatistics Imperial College, London Joint work with Natalia Bochkina, Sylvia Richardson BBSRC Exploiting Genomics grant Mixture models for classifying differentially expressed genes
2 Modelling differential expression Many different methods/models for differential expression –t-test –t-test with stabilised variances (EB) –Bayesian hierarchical models –mixture models Choice whether to model alternative hypothesis or not Our model: –Model the alternative hypothesis –Fully Bayesian
3 Gene means and fold differences: linear model on the log scale Gene variances: borrow information across genes by assuming exchangeable variances Mixture prior on fold difference parameters Point mass prior for null hypothesis Mixture model features
4 1st level y g1r | g, d g, g1 N( g – ½ d g, g1 2 ), y g2r | g, d g, g2 N( g + ½ d g, g2 2 ), 2nd level gs 2 | a s, b s IG (a s, b s ) d g ~ 0 δ G_ (1.5, 1 ) + 2 G + (1.5, 2 ) 3rd level Gamma hyper prior for 1, 2, a s, b s Dirichlet distribution for ( 0, 1, 2 ) Fully Bayesian mixture model for differential expression Explicit modelling of the alternative H0H0
5 In full Bayesian framework, introduce latent allocation variable z g = 0,1 for gene g in null, alternative For each gene, calculate posterior probability of belonging to unmodified component: p g = Pr( z g = 0 | data ) Classify using cut-off on p g (Bayes rule corresponds to 0.5) For any given p g, can estimate FDR, FNR. Decision Rules For gene-list S, est. (FDR | data) = Σ g S p g / |S|
6 Simulation Study Explore Explore performance of fully Bayesian mixture in different situations: Non-standard distribution of DE genes Small number of DE genes Small number of replicate arrays Asymmetric distributions of over- and under- expressed genes Simulated data, 50 simulated data sets for each of several different set-ups.
genes, 8 replicates in each experimental condition d g ~ 0 δ ( Unif() + (1 - ) N() ) + 2 ( Unif() + (1 - ) N() ) gs ~ logNorm(-1.8, 0.5) ( logNorm based on data ) Simulation Study
8 Gamma distributions superimposed Non-standard distributions of DE genes Av. est. π 0 = ± Av. est. π 0 = ± Av. est. π 0 = ± = 0.3 = 0.5 = 0.8 π 0 = 0.8
9 Small number of DE genes / Small number of replicate arrays True π 0 = 0.95 True π 0 = replicates Av. FDR = 7.0 % Av. FNR = 2.0 % Av. est. π 0 = ± replicates Av. FDR = 17.9 % Av. FNR = 3.6 % Av. est. π 0 = ± replicates Av. FDR = 9.2 % Av. FNR = 0.6 % Av. est. π 0 = ± replicates Av. FDR = 17.6 % Av. FNR = 0.9 % Av. est. π 0 = ± 0.007
10 Asymmetric distributions of over/under-expressed genes True π 0 = 0.9 True π 1 = 0.09 True π 2 = 0.01 Av. est. π 0 = ± Av. est. π 1 = ± Av. est. π 2 = ± d g ~ 0 δ (0.6 Unif( 0.01, 1.7 ) N(1.7, 0.8) ) + 2 (0.6 Unif( -0.7, ) N( -0.7, 0.8) )
11 1) FDR / FNR can be estimated well Additional Checks 50 simulations of same set-up: Av. est. π 0 = No genes are declared to be DE. 2) Model works when there are no DE genes True FDR Est. FDR True FNR Est. FNR
12 Comparison with conjugate mixture prior Replace d g ~ 0 δ G_ (1.5, 1 ) + 2 G + (1.5, 2 ) with d g ~ 0 δ N(0, c g 2 ) NB: We estimate both c and 0 in fully Bayesian way. True 0 Est. 0 with Gamma prior Est. 0 with conjugate prior ± ± ± ± ± ± ± 0.001
13 Application to Mouse data Mouse wildtype (WT) and knock-out (KO) data (Affymetrix) ~ genes, 8 replicates in each WT and KO Gamma prior Est. π 0 = ± Declares 59 genes DE
14 Summary Good performance of fully Bayesian mixture model –can estimate proportion of DE genes in variety of situations –accurate estimation of FDR / FNR Different mixture priors give similar classification results Gives reasonable results for real data