Presentation is loading. Please wait.

Presentation is loading. Please wait.

MCQMC 2012 From inference to modelling to algorithms and back again Kerrie Mengersen QUT Brisbane.

Similar presentations


Presentation on theme: "MCQMC 2012 From inference to modelling to algorithms and back again Kerrie Mengersen QUT Brisbane."— Presentation transcript:

1 MCQMC 2012 From inference to modelling to algorithms and back again Kerrie Mengersen QUT Brisbane

2 Acknowledgements: BRAG Bayesian methods and models + Fast computation + Applications in environment, health, biology, industry

3 So what’s the problem?

4 Matchmaking 101

5 Study 1: Presence/absence models Sama Low Choy Mark Stanaway

6 Plant biosecurity

7 Observations and data –Visual inspection symptoms –Presence / absence data –Space and time Dynamic invasion process –Growth, spread Inference –Map probability of extent over time –Useful scale for managing trade / eradication –Currently use informal qualitative approach Hierarchical Bayesian model to formalise the information From inference to model

8 Hierarchical Bayesian model for plant pest spread Data Model: Pr(data | incursion process and data parameters) –How data is observed given underlying pest extent Process Model: Pr(incursion process | process parameters) –Potential extent given epidemiology / ecology Parameter Model: Pr(data and process parameters) –Prior distribution to describe uncertainty in detectability, exposure, growth … The posterior distribution of the incursion process (and parameters) is related to the prior distribution and data by: Pr(process, parameters | data)  Pr(data | process, parameters ) Pr( process | parameters ) Pr(parameters)

9 Early Warning Surveillance Priors based on emergency plant pest characteristics exposure rate for colonisation probability spread rates to link sites together for spatial analysis Add surveillance data Posterior evaluation modest reduction in area freedom large reduction in estimated extent residual “risk” maps to target surveillance

10 Observation Parameter Estimates Taking into account invasion process Hosts –Host suitability Inspector efficiency –Identify contributions

11 Study 2: Mixture models Clair Alston

12 CAT scanning sheep

13 Finite mixture model y i ~  j N(  j,  j 2 ) Include spatial information From inference to model What proportions of the sheep carcase are muscle, fat and bone?

14

15 Inside a sheep

16

17 Study 3: State space models Nicole White

18 Parkinson’s Disease

19 PD symptom data Current methods for PD subtype classification rely on a few criteria and do not permit uncertainty in subgroup membership. Alternative: finite mixture model (equivalent to a latent class analysis for multivariate categorical outcomes) Symptom data: Duration of diagnosis, early onset PD, gender, handedness, side of onset

20 1. Define a finite mixture model based on patient responses to Bernoulli and Multinomial questions. 2. Describe subgroups w.r.t. explanatory variables 3.Obtain patient’s probability of class membership y ij : ith subject’s response to item j From inference to model

21 PD: Symptom data

22 PD Signal data: “How will they respond?”

23 Inferential aims Identify spikes and assign to unknown no. source neurons Compare clusters between segments within a recording and between recordings at different locations of the brain 3 depths

24 Microelectrode recordings Each recording was divided into 2.5sec. segments Discriminating features found via PCA

25 DP Model y i |  i ~ p(y i |  i )  i ~ G G ~ DP( , G 0 ) P PCs, y i =(y i1,..,y iP ) ~ MVN(  ) G 0 = p(  p   ~ Ga(2,2) From inference to model

26 Average waveforms

27 Comparing segments Comparison of models based on symmetric Kullback- Leibler (KL) divergence: given two datasets y i and y j fitted with parameters  i and  j : D(y i,y j |  i,  j) =1/T i log p(y i |  i )/p(y i |  j ) + 1/T j log p(y j |  j )/p(y j |  i ) (Commonly applied in music/speech processing to assess similarity of recording) Each likelihood is approximated using output from M Gibbs iterations – finite mixture approximation given each z (m) E(p(y|  )) = M -1  m=1:M  t=1:T  k=1:K (m)  k (m) f(y t |  k (m) ) K(m)= Number of occupied components inferred by z(m)  (m) and  (m) can be simulated from their posterior distribution, given z (m)

28 Comparison of segments via KL divergence Hinton diagram showing similarities

29 Study 4: Spatial dynamic factor models Chris Strickland Ian Turner What can we learn about landuse from MODIS data?

30 Smart modelling Fit a spatial dynamic factor model Include spatial dependence through columns of the factor loadings matrix using a Gaussian Markov random field => nonseparable space-time covariance structure Important to use methods to achieve dimension reduction: –Space: Use Krylov subspace methods to take advantage of the sparse matrix structures –Time: Clever algorithms that avoid unwieldy inversions (solving systems with extremely large dimensions)

31 Differentiate landuse SDFM 1 st factor has influence on temporal dynamics in right half of image (woodlands) 3 rd factor has influence on LH image (grasslands) 1 st trend component 2 nd trend comp. common cyclical comp.

32 Matchmaking 101

33 Smart models

34 Example 1: Generalisation Mixtures are great but how do we choose k? Propose an overfitting model (k>k 0 ) Non-identifiable! All values of  = (p 1 0,..,p k0 0, 0,  1 0,..,  k0 0 ) and all values of  = (p 1 0,..,p j,…,p k0 0, p k+1,  1 0,..,  k0 0,  j 0 ) with p j +p k+1 =p j 0 fit equally well. Judith Rousseau f 0 (x) =  j=1,..,k0 p j g  j (x)

35 So what? Multiplicity of possible solutions => MLE does not have a stable asymptotic behaviour. Not important when f  is the main object of interest, but important if we want to recover  It thus becomes crucial to know that the posterior distribution under overfitted mixtures give interpretable results

36 Possible alternatives to avoid overfitting Fruhwirth-Schnatter (2006): either one of the component weights is zero or two of the component parameters are equal. Choose priors that bound the posterior away from the unidentifiability sets. Choose priors that induce shrinkage for elements of the component parameters. Problem: may not be able to fit the true model

37 Our result Assumptions: –L1 consistency of the posterior –Model g is three times differentiable, regular, and integrable –Prior on  is continuous and positive, and the prior on (p 1,..,p k ) satisfies  (p)  p 1  1 -1 …p k  k -1

38 Our result - 1 If max(  1 )

39 Our result - 2 In contrast, if min(  j, j≤k)>d/2 and k>k 0, then 2 or more components will tend to merge with non- neglectable weights each. This will lead to less stable behaviour. In the intermediate case, if min(  j, j≤k) ≤d/2 ≤max(  j,j ≤k), then the situation varies depending on the  j ’s, and on the difference between k and k 0.

40 Implications: Model dimension When d/2>max{  j, j=1,..,k}, dk 0 +k  j≥k0+1  j appears as an effective dimension of the model This is different from the number of parameters, dk+k-1, or from other “effective number of parameters” Similar results are obtained for other situations

41 Example 1 y i ~ N(0,1); fit pN(  1,1)+(1-p)N(  2,1)  i =1 > d/2

42 Example 2 y i ~ N(0,1) G=pN 2 (  1,  1 )+(1-p)N 2 (  2,  2 ),  j diagonal d = 3;  1 =  2 =1 < d/2

43 Conclusions The result validates the use of Bayesian estimation in mixture models with too many components. It is one of the few examples where the prior can actually have an impact asymptotically, even to first order (consistency) and where choosing a less informative prior leads to better results. It also shows that the penalization effect of integrating out the parameter, as considered in the Bayesian framework is not only useful in model choice or testing contexts but also in estimating contexts.

44 Example 2: Empirical likelihoods Sometimes the likelihood associated with the data is not completely known or cannot be computed in a manageable time (eg population genetic models, hidden Markov models, dynamic models), so traditional tools based on stochastic simulation (eg, regular MCMC) are unavailable or unreliable. Eg, biosecurity spread model. Christian Robert

45 Algorithmic alternative: ABC We could simulate from the likelihood (even if we can’t estimate it), using, e.g., ABC Aim: approximate the posterior distribution  (  |y) = k p(  ) f(y|  ), for  in  Produce a sample of parameters (  1,…,  M ) by the following algorithm: Concerns about ABC: the choice of statistic  distance metric , tolerance level 

46 Model alternative: ELvIS Define parameters of interest as functionals of the cdf F (eg moments of F), then use Importance Sampling via the Empirical Likelihood. Select the F that maximises the likelihood of the data under the moment constraint. Given a constraint of the form E  (  (Y)) =  the EL is defined as L el (  |y) = max F  i=1:n {F(y i )-F(Y i-1 } For example, in the 1-D case when  = E  (Y) the empirical likelihood in  is the maximum of p ,…,p n under the constraint  i=1:n p i y i = 

47 Quantile distributions A quantile distribution is defined by a closed-form quantile function F -1 (p;  ) and generally has no closed form for the density function. Properties: very flexible, very fast to simulate (simple inversion of the uniform distribution). Examples: 3/4/5-parameter Tukey’s lambda distribution and generalisations; Burr family; g- and-k and g-and-h distributions.

48 g-and-k quantile distribution

49 Methods for estimating a quantile distribution MLE using numerical approximation to the likelihood Moment matching Generalised bootstrap Location and scale-free functionals Percentile matching Quantile matching ABC Sequential MC approaches for multivariate extensions of the g-and-k

50 ELvIS in practice Two values of  =(A,B,g,k):  =(0,1,0,0) standard normal distribution  =(3,2,1,0.5) Allingham’s choice Two priors for  : U(0,5) 4 A~U(-5,5), B~U(0,5), g~U(5,5), k~(-1,1) Two sample sizes: n=100 n=1000

51 ELvIS in practice:  =(3,2,1,0.5), n=100

52 Matchmaking 101

53 A wealth of algorithms!

54 From model to algorithm Models: Logistic regression Non-Gaussian state space models Spatial dynamic factor models Evaluate: Computation time Maximum bias sd Inefficiency factor (IF) Accuracy rate Chris Strickland

55 Logistic Regression k = 2, 4, 8, 20 covariates; n=1000 observations Importance sampling (IS): –E[h(  )] = h(  ) [p(  |y)/q(  |y)] q(  |q) d  with q(  |y) proportional to exp(-0.5(  -  * ) T V -1 (  -  * )) –MLE  * (mode found by IRWLS) variance V=-2∂ 2 p(y|X,  )/(d  d  T )|  =  * Random walk Metropolis-Hastings (RWMH): –same proposal distribution Adaptive RWMH Garthwaite, Yan, Sisson –Only needs starting values (  * ) – easiest candidate!

56 Results AlgorithmktimebiassdIFacc.rate IS RWMH ARWMH

57 So what? If k is small (e.g. 8), even a naïve candidate is ok. When k is larger (e.g. 20), we need something more intelligent, e.g. adaptive MH As models become more complicated, we need to get more sophisticated

58 Importance sampler vs MCMC vs Particle filter vs Laplace approximation (INLA) Non-Gaussian state space model

59 Importance sampler √General algorithms (Durbin&Koopman global approximation) and Tailored algorithms (Liesenfeld&Richard 2006 – local approximation) √Independence sampler - don’t have to worry about correlated draws √Parallelisable – potentially much faster ×Difficult to come up with a good candidate distribution ×More difficult as the sample and model complexity increase ×More complex than MCMC to obtain non-standard expectations

60 Non-Gaussian state space model MCMC √Very flexible w.r.t extensions to other models √The same algorithms can be used as the sample size and/or parameter dimension grows (with some provisos) ×Can be slow, and complicated to achieve good acceptance and mixing ×Single move samplers perform poorly in terms of mixing Pitt&Shephard; Kim,Shephard&Chib √Very efficient (mixing) algorithms can be designed for specific problems, eg stochastic volatility K,S,C √General approaches available – simple reparametrisation can lead to vastly improved simulation efficiency Strickland et al

61 Non-Gaussian state space model Particle filters √ Easy to implement – at least intuitively √Updating estimates as the sample size grows is possibly a lot simpler and cheaper than a full MCMC or IS sampler update ×Perhaps not as easy as appears with parameter learning ×Particle approximation degenerates – may need MCMC steps, or alternatives ×To do full MCMC updates, need to store the entire history of particle approximation

62 Non-Gaussian state space model Integrated Nested Laplace Approximation (INLA) √ Extremely fast √ If model complexity stays the same, then it can work for very large problems √R interface to code ×Can only hold a small number of hyperparameters than can be forced in the GMRF approximation, so very restrictive to the problem set

63 So what? Many algorithms: general versus specific, flexible versus tailored Pros and cons should be weighed against inferential aims, model and computational resources Blocking and reparametrisation are two good tricks, but we need to be clever about non-centred reparametrisation

64 Spatial Dynamic Factor Models Y t = Bf t +  t,  t ~ N(0,  ),  diag. factor loadings b j ~ N(0, V(s)) Spatial correlation: –Lopes et al. use a GRF, O((p×k * ) 3 ) –Strickland et al. use a GMRF (images: large discrete spatial domain) + Krylov subspace methods to sample from GMRF posterior, scales linearly O(p×k * ). –Rue uses Cholesky decomposition, more complex, O(p×k * ) 3/2 )

65 So what? Difference between (i) O(p×k * ) and (ii) O(p×k * ) 3/2 ): If the data set becomes 100 times larger, (i) will take 1 million times longer, compared to 100 times longer for (ii). Instead of waiting 1 hour, you might have to wait more than 1 year…

66 Case study Spatial domain: 900 pixels Temporal: approx 200 periods (every 16 days) Total 180,000 observations, ~3600 parameters 10 mins for MCMC iterations (on a laptop!)

67 Conclusions The model is too complicated and the datasets too large for IS or PF. There are too many parameters for INLA. MCMC is thus desirable, but it is extremely important to choose good algorithms! As the model becomes more complex, the choice of smart algorithms becomes more important and can make a large difference in computation and estimation

68 Matchmaking 102

69 Smart algorithms

70 Hybrid algorithms Tierney (1994) Design an efficient algorithm that combines features of other algorithms, in order to overcome identified weaknesses in the component algorithms

71 Comparing algorithms Accuracy –bias (H) Efficiency –rate of convergence, rate of acceptance (A) –mixing speed (integrated autocorrelation time)   (H) Applicability –simplicity of set-up, flexibility of tailoring Implementation –coding difficulty, memory storage –computational demand: total number of iterations (T), burnin (T 0 ) correlation along the chain, measured by the effective sample size ESSH = (T-T 0 )/   (H)

72 Hybrid algorithms - 1 Metropolis-Hastings Algorithm ( MHA s ) –Improve mixing speed via: parallel chains (MHA) repulsive proposal (MHA RP M&R ), pinball sampler delayed rejection (DRA Tierney & Mira, DRA LP, DRA Pinball ) –Improve applicability by: reversible jump Green Metropolis adjusted Langevin (MALA Tierney&Roberts, Besag&Green ) Kate Lee, Christian Robert, Ross McVinish

73 Simulation study Model –mixture of 2-D normals, well separated  = 0.5 N([0,0] T, I 2 ) N([5,5] T, I 2 ) –100 replicated simulations –Each result is obtained after running the algorithm for 1200 seconds using 10 particles –Proposal variance = 4; value in brackets is MSE (similar results for variance = 2) Platform –Version Matlab –run by SGI Altix XE cluster containing 112 x 64 bit Intel Xeon cores.

74 Results T=no. simulations; A=acceptance rate; H=accuracy;  2 H =var(H),  p (H)=autocorrelation time; ESS H =effective sample size

75 Results MHA –shortest CPU time per iteration and largest sample size –need to tune proposal variance to optimise performance MALA –can get trapped in nearest mode if the scaling parameter in the variance of the proposal is not sufficiently large MHA with RP –induces a fast mixing chain, but need to choose tuning parameter –expensive to compute –in rare cases the algorithm can be unstable and get stuck DRA –less correlated chains, higher acceptance rate –higher computational demand –Langevin proposal: improves mixing, but the loss in computational efficiency overwhelmed gain in statistical efficiency –Normal random walk faster to compute and improves mixing.

76 Hybrid algorithms - 2 Population Monte Carlo (PMC) –Extension of IS by allowing importance function to adapt to the target in an iterative manner Cappe et al –PMC with repulsive proposal: create ‘holes’ around existing particles Particle systems and MCMC –IS + MCMC M&R –parallel MCMC + SMC del Moral

77 PMC algorithm

78 PMC with repulsive effect

79 Simulation study repeated 100 replicates, run for 500 seconds using 50 particles, with first 100 iterations ignored √accuracy of estimation √ fast exploration (significantly reduced integrated autocorrelation time) √ No instability (unlike MCMC algorithms) √ Less sensitivity to importance function √ Repulsive effect improved mixing

80 Summary AlgorithmStatistical EfficiencyComputationApplicability EPMCRRCCESPFHCP Mode MALA S MHA RP B DRA B DRA LP B DRA Pinball B PS B PMC R B Relative performance compared to MHA and PMC EPM=efficiency of proposal move; CR=correlation reduction of chain; RC=rate of convergence; CE=cost effectiveness; SP=simplicity of programming; FH=flexibility of hyperparameters; CP=consistency of performance; Mode=preference between a single mode and multimodal problem

81 Hybrid algorithms - 3

82 ABCel Unlike ABC, ABCel does not require:

83 Conclusions 1.Combining features of individual algorithms may lead to complicated characteristics in a hybrid algorithm. 2.Each individual algorithm may have a strong individual advantage with respect to a particular performance criterion, but this does not guarantee that the hybrid method will enjoy a joint benefit of these strategies. 3.The combination of algorithms may add complexity in set-up, programming and computational expense.

84 Implementing smart algorithms: PyMCMC Python package for fast MCMC takes advantage of Python libraries Numpy, Scipy Classes for Gibbs, M-H, orientational bias MC, slice samplers, etc. linear (with stochastic search), logit, probit, log- linear, linear mixed-model, probit mixed-model, nonlinear mixed models, mixture, spatial mixture, spatial mixture with regressors, time series suite (including DFM and SDFM) Straightforward to optimise, extensible to C or Fortran, parallelisable (GPU) Chris Strickland

85 PyMCMC

86 Matchmaking algorithm! Cool problems InferenceAlgorithmModel Past Future

87 Key References Lee, K., Mengersen, K., Robert, C.P. (2012) Hybrid models. In Case Studies in Bayesian Modelling. Eds Alston, Mengersen, Pettitt. Wiley, to appear. Stanaway, M., Reeves, R., Mengersen, K. (2010) Hierarchical Bayesian modelling of early detection surveillance for plant pest invasions. J. Environmental and Ecological Statistics. Strickland, C., Simpson, D, Denham, R., Turner, I., Mengersen, K. Fast methods for spatial dynamic factor models. CSDA. Strickland, C., Alston, C., Mengersen, K. (2011) PyMCMC. J. Statistical Software. Under review. White, N. Johnson, H., Silburn, P., Mengersen, K. Unsupervised sorting and comparison of extracellular spikes with Dirichlet Process Mixture Models. Annals Applied Statistics. Under review


Download ppt "MCQMC 2012 From inference to modelling to algorithms and back again Kerrie Mengersen QUT Brisbane."

Similar presentations


Ads by Google