Download presentation

Presentation is loading. Please wait.

Published byAmia Leblanc Modified over 3 years ago

1
Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

2
Many complex models used in bioinformatics Classification/clustering can be greatly affected by choice of distributions Our approach: exploit the structure of the model to perform predictive checks hierarchical models generally involve exchangeability assumptions mixture models are partially exchangeable Background and Aims

3
Mixture model for gene expression data Model checks for mixture model distribution for gene-specific variances different mixture priors Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005) Outline of Talk

4
Hierarchical mixture model for gene expression data differential effect for gene g variance for each gene Data: paired log differences between 2 conditions g ybar g SgSg σgσg μ,τμ,τ wjwj ηjηj g = gene r = replicate j = mixture component y gr | δ g, g N(δ g, g 2 ) w ~ Dirichlet(1,…,1), various priors for δ g, g δ g | η ~ Σw j h j ( η j ), g 2 | μ,τ f(μ,τ)

5
Mixture model for gene expression data Many mixture models have been proposed for gene expression data Set-up is similar to variable selection prior: point mass + alternative distribution Particular choices for alternative: Normal (Lönnstedt and Speed) Uniform (Parmigiani et al) many others …

6
Mixture model for gene expression data Allow for asymmetry in over-and under-expressed genes 3-component mixture model δ g | η ~ w 1 h 1 ( η 1 ) + w 2 h 2 ( η 2 ) + w 3 h 3 ( η 3 ) 6 knock-out and 5 wildtype mice MAS5.0 processed data

7
Mixture model for gene expression data Classify each gene into mixture components using posterior probabilities

8
Choice of mixture prior affects classification results Mixture Prior for δ g Est. w 2 (% in null) w 1 Unif(- η -,0) + w 2 δ(0) + w 3 Unif(0, η + ) 0.96 w 1 Gam - (1.5, η - ) + w 2 δ(0) + w 3 Gam + (1.5, η + ) 0.68 w 1 Gam - (1.5, η - ) + w 2 N(0,ε) + w 3 Gam + (1.5, η + ) 0.99

9
Mixture model for gene expression data Models checks for mixture model distribution for gene-specific variances different mixture priors Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005) Outline of Talk

10
Predict new data from the model Use posterior predictive distribution Condition on hyperparameters (mixed predictive * not very conservative) Get Bayesian p-value for each gene/marker/sample Use all p-values together (100s or 1000s) to assess model fit * Gelman, Meng and Stern 1995; Marshall and Spiegelhalter 2003 Predictive model checks

11
posterior S mpred S g obs Checking distribution for gene variances Bayesian p-value for gene g: p g = Prob( S mpred > S g obs | data ) All genes are exchangeable histogram of p-values for all genes together g ybar g S g obs post. pred. S g ppred mixed pred. S mpred σgσg μ,τμ,τ σ pre d

12
Predictive p-values for data simulated from the model Histograms should be Uniform Mixed predictive distribution much less conservative than posterior predictive Mixed v. posterior predictive Using global distributionUsing gene-specific distributions

13
Checking different variance models Model differential expression between 3 transgenic and 3 wildtype mice g 2 | μ,τ Gam(μ,τ), μ fixed g 2 | μ,τ Gam(μ,τ) g 2 | μ,τ logNorm(μ,τ) g 2 = 2 for all genes

14
p g = 0 for t = 1,…,niter { σ t pred f(μ t,τ t ) S t mpred Gam( m, m(σ t pred ) -2 ) p g p g + I [ S t mpred > S g obs ] } p g p g / niter Implementation (MCMC) Just two extra parameters predicted at each iteration niter = no. MCMC iterations m = (no. replicates – 1)/2 g ybar g S g obs mixed pred. S mpred σgσg μ,τμ,τ σ pre d

15
Mixture model for gene expression data Model checks for mixture model distribution for gene-specific variances different mixture priors Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005) Outline of Talk

16
Checking mixture prior δ g | η ~ w 1 h 1 ( η 1 ) + w 2 h 2 ( η 2 ) + w 3 h 3 ( η 3 ) OR δ g | η, z g = j ~ h j ( η j ) j = 1,…,3 P(z g = j ) = w j Model checking: focus on separate mixture components

17
δ g | η, z g = j ~ h j ( η j ) j = 1,…,3 Think about MCMC iterations … Mixture component is estimated from genes currently assigned to that component Can only define p-value for given gene and mix. component when the gene is assigned to that component (i.e. condition on z g in p-value) So check each component using only the genes currently assigned (i.e. condition on z g in histogram) Issues for mixture model checking

18
g j pred wjwj ybar g SgSg ybar g j mpre d σgσg μ,τμ,τ ηjηj Predictive checks for mixture model Bayesian p-value for gene g and mix. component j: p gj = Prob( ybar gj mpred > ybar g obs | data, z g =j ) Genes assigned to the same mix. component are exchangeable histogram of p-values for each mix. component separately histogram for component j made only from genes with large P(z g = j )

19
Effectively we condition on a best classification Condition on classification to check separate components All genes with P(z g = j) > 0 Only genes with P(z g = j) > 0.5 Predictive p-values for data simulated from the model

20
Checking different mixture distributions w 1 Unif(- η -,0) + w 2 δ(0) + w 3 Unif(0, η + ) Outer mix. components skewed too much away from zero Null component too narrow

21
Checking different mixture distributions w 1 Gam - (1.5, η - ) + w 2 δ(0) + w 3 Gam + (1.5, η + ) Outer components skewed opposite Null still too narrow?

22
Checking different mixture distributions w 1 Gam - (1.5, η - ) + w 2 N(0,ε) + w 3 Gam + (1.5, η + ) Better fit for all components

23
Implementation g j pred wjwj ybar g SgSg ybar g j mpre d σgσg μ,τμ,τ ηjηj p gj = 0 for t = 1,…,niter { δ jt pred ~ h jt ( η jt ) j = 1,…,3 ybar gt mpred N( δ jt pred, g 2 /nrep ) for j = z gt p gj p gj + I [ ybar gt mpred > ybar g obs ] for j = z gt } p gj p gj / niter(z g =j) Need ngenes extra parameters at each iteration

24
Summary of model checking procedure 1. Find part of model where individuals are assumed to be exchangeable (so information is shared) 2. Choose test statistic T (eg. sample mean or variance) 3. Predict T pred from distribution for exchangeable individuals (whole posterior for T pred ) 4. Compare observed T i for each individual i to distribution of T pred 5. For checking mixture components, condition on the best classification

25
Mixture model for gene expression data Model checks for mixture model distribution for gene-specific variances different mixture priors Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005) Outline of Talk

26
y i vector of gene expression for each sample i = 1,…,n Multi-variate mixture model for clustering samples: y i | z i = j MVN( ζ j, Λ j ) j = 1,…,J P(z i = j ) = w j No. of mix. components (J) is estimated in the model Aim to select genes which are informative for clustering the samples Clustering and variable selection (Tadesse et al. 2005)

27
γ = vector of indices of selected variables γ = vector of indices of variables not used to cluster samples Likelihood conditional on allocation to mixture: Conjugate priors on multivariate means and covariance matrices P( γ g = 1) = φ i = sample g = gene j = mix. component

28
Clustering and variable selection (Tadesse et al. 2005) i = sample g = gene j = mix. component Model checking: want to check the distribution for each mixture component separately (conditional on J) In addition, need to condition on a given variable selection Clearly impossible computationally μ j ( γ), Σ j ( γ) yiyi y ( γ ) j pred wjwj η ( γ), Ω ( γ) φ J

29
1) Run model with no prediction 2) Find the best configuration: set of selected variables ( γ ) no. mixture components J allocation of samples to mixture components z i 3) Re-run model, with ( γ ), J and z i fixed, calculated predictive p-values Computing predictive p-values p ij = Prob( T j pred > T i obs | data, z i =j, J, ( γ ) ) where T = |y| 2 (for example)

30
Conclusions Choice of model distributions can greatly influence results of clustering and classification For models where information is shared across individuals, predictive checks can be used as an alternative to cross-validation Should be possible to do this even for quite complex models (if you can fit the model, you can check it)

31
Acknowledgements Collaborators on BBSRC Exploiting Genomics Grant Natalia Bochkina, Clare Marshall Peter Green Meeting on model checking in Cambridge David Spiegelhalter Shaun Seaman BBSRC Exploiting Genomics Grant Paper and software at http://www.bgx.org.uk/

Similar presentations

OK

Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

Structured statistical modelling of gene expression data Peter Green (Bristol) Sylvia Richardson, Alex Lewin, Anne-Mette Hein (Imperial) with Clare Marshall,

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google