Presentation on theme: "Simple methods to improve MCMC efficiency in random effect models William Browne*, Mousa Golalizadeh*, Martin Green and Fiona Steele Universities of Bristol."— Presentation transcript:
Simple methods to improve MCMC efficiency in random effect models William Browne*, Mousa Golalizadeh*, Martin Green and Fiona Steele Universities of Bristol and Nottingham *Thanks to ESRC for supporting this work
Summary Introduction. Application 1 – Clutch size in great tits. Method 1: Hierarchical centering. Method 2: Parameter expansion. Application 2 – Mastitis incidence in dairy cattle. Method 3: Orthogonal predictors. Application 3 – Contraceptive discontinuation in Indonesia. Conclusions.
Introduction/Synopsis MCMC methods allow easy fitting of complex random effects models The simplest (default) MCMC algorithms can produce poorly mixing chains. By reparameterising the model one can greatly improve mixing. These reparameterisations are easy to implement in WinBUGS (or MLwiN) The choice of reparameterisation depends in part on the model/dataset.
Application 1: Great tit nesting behaviour (crossed random effects) Original work was collaborative research with Richard Pettifor (Institute of Zoology, London), and Robin McCleery and Ben Sheldon (University of Oxford).
Application 1: Great tit nesting behaviour (crossed random effects) A longitudinal study of great tits nesting in Wytham Woods, Oxfordshire. 6 responses : 3 continuous & 3 binary. Clutch size, lay date and mean nestling mass. Nest success, male and female survival. Data: 4165 nesting attempts over a period of 34 years. There are 4 higher-level classifications of the data: female parent, male parent, nestbox and year. We only consider Clutch size here
Data background SourceNumber of IDs Median #obs Mean #obs Year34104122.5 Nestbox96844.30 Male parent298611.39 Female parent294411.41 Note there is very little information on each individual male and female bird but we can get some estimates of variability via a random effects model. The data structure can be summarised as follows:
MCMC efficiency for clutch size response The MCMC algorithm used in the univariate analysis of clutch size was a simple 10-step Gibbs sampling algorithm.. To compare methods for each parameter we can look at the effective sample sizes (ESS) which give an estimate of how many independent samples we have for each parameter as opposed to 50,000 dependent samples. ESS = # of iterations/,
Clutch Size Here we see that the average clutch size is just below 9 eggs with large variability between female birds and some variability between years. Male birds and nest boxes have less impact.
Effective Sample sizes ParameterMLwiNWinBUGS Fixed Effect671602 Year3063229604 Nestbox833788 Male3633 Female30983685 Observation110135 Time519s2601s We will now consider methods that will improve the ESS values for particular parameters. We will firstly consider the fixed effect parameter.
Trace and autocorrelation plots for fixed effect using standard Gibbs sampling algorithm
Hierarchical Centering This method was devised by Gelfand et al. (1995) for use in nested models. Basically (where feasible) parameters are moved up the hierarchy in a model reformulation. For example : is equivalent to The motivation here is we remove the strong negative correlation between the fixed and random effects by reformulation.
Hierarchical Centering In our cross-classified model we have 4 possible hierarchies up which we can move parameters. We have chosen to move the fixed effect up the year hierarchy as its variance had biggest ESS although this choice is rather arbitrary. The ESS for the fixed effect increases 50-fold from 602 to 35,063 while for the year level variance we have a smaller improvement from 29,604 to 34,626. Note this formulation also runs faster 1864s vs 2601s (in WinBUGS).
Trace and autocorrelation plots for fixed effect using hierarchical centering formulation
Parameter Expansion We next consider the variances and in particular the between-male bird variance. When the posterior distribution of a variance parameter has some mass near zero this can hamper the mixing of the chains for both the variance parameter and the associated random effects. The pictures over the page illustrate such poor mixing. One solution is parameter expansion (Liu et al. 1998). In this method we add an extra parameter to the model to improve mixing.
Trace plots for between males variance and a sample male effect using standard Gibbs sampling algorithm
Autocorrelation plot for male variance and a sample male effect using standard Gibbs sampling algorithm
Parameter Expansion In our example we use parameter expansion for all 4 hierarchies. Note the parameters have an impact on both the random effects and their variance. The original parameters can be found by: Note the models are not identical as we now have different prior distributions for the variances.
Parameter Expansion For the between males variance we have a 20-fold increase in ESS from 33 to 600. The parameter expanded model has different prior distributions for the variances although these priors are still diffuse. It should be noted that the point and interval estimate of the level 2 variance has changed from 0.034 (0.002,0.126) to 0.064 (0.000,0.172). Parameter expansion is computationally slower 3662s vs 2601s for our example.
Trace plots for between males variance and a sample male effect using parameter expansion.
Combining the two methods Hierarchical centering and parameter expansion can easily be combined in the same model. Here we perform centering on the year classification and parameter expansion on the other 3 hierarchies.
Effective Sample sizes ParameterWinBUGS originally WinBUGS combined Fixed Effect60234296 Year2960434817 Nestbox7885170 Male33557 Female36858580 Observation1351431 Time2601s2526s As we can see below the effective sample sizes for all parameters are improved for this formulation while running time remains approximately the same.
Applications 2 & 3: Multilevel discrete-time survival analysis Used to model the durations until the occurrence of events e.g. length of time to death. Different from standard regression due to right-censoring and time varying covariates. In many applications events may be repeatable and the outcome is duration of continuous exposure to the risk of an event. In our applications cows may suffer mastitis more than once and women can initiate and discontinue use of contraceptives several times. We begin with two levels of data episodes nested within individuals. With a discrete time approach we expand the data to create a lower level – time interval, which represents a fixed period of time and is nested within episode.
Data Expansion Suppose we have time intervals (e.g. days, weeks) indexed by t = 1,...,K where K is the maximum duration of an episode. Let t ij denote the number of intervals for which individual i is observed in episode j. Before modelling we now need to expand the data for each episode ij to obtain t ij records. We have y tij =0 for t=1,…,t ij-1 and the response for t= t ij is 1 if an episode ends in an event and 0 for censored episodes. Example: Individual observed for 7 months and has event after 4 months: Response vector (0,0,0,1,0,0,0), Time intervals (1,2,3,4,1,2,3)
Modelling Having restructured the data we can analyse the event occurrence indicator y tij using standard methods for clustered binary response data. Here we model the probability of an event occurrence π tij as a function of duration (z t ) and covariates x tij with u j being an individual specific random effect representing unobserved time-invariant individual characteristics.
MCMC algorithm for random effect logistic regression models Consider the following simple model: A standard MCMC algorithm (as used in MLwiN (Rasbash et al. (2000), Browne (2003)) is Step 1: Update using random walk Metropolis sampling. Step 2: Update using random walk Metropolis sampling. Step 3: Update from its inverse Gamma full conditional using Gibbs Sampling
Hierarchical centered formulation We can reparameterise by replacing the residuals u j with random effects u * j = β 0 +u j. Here the u* j are (hierarchically) centred around β 0. This formulation allows Gibbs sampling to be used for the fixed effect β 0 in addition to the random effects variance. It is interesting to consider how an MCMC algorithm for this centered formulation can be expressed in terms of the original parameterisation.
MCMC Algorithm At iteration t+1 Step 1: Update β 0 from its normal conditional distribution: Adjust to keep u * j fixed. The other steps are unchanged by reparameterisation as the random walk Metropolis sampling for u * j is equivalent to the same step for u j and the final step is unchanged.
Why should hierarchical centering be useful for discrete time survival models ? As we have seen discrete time survival models are a special case of random effects logistic regression models. Expansion results in many higher level predictors that can be (hierarchically) centered around. Hierarchical centering should improve mixing but also speed up estimation. Will be particularly useful when we have many level 1 units per level 2 unit and large level 2 variance, where the speed up will be greatest – see application 2. Not always useful but other potential solutions – see application 3.
Application 2: Mastitis incidence in dairy cattle Mastitis – inflammation of mammary gland of dairy cows, usually caused by a bacterial infection. Infections arising in dry (non-lactating) period often result in clinical mastitis in early lactation. Green et al. (2007) use multilevel survival models to investigate how cow, farm and management factors in the dry period correlate with mastitis incidence. Data – 2 years x 52 dairy farms – 103 farm years – 8,710 cow lactation periods following dry periods expands to 256,382 records in total!
Mastitis model The Model has many predictors. The duration terms z t consists of an intercept plus polynomials in log time to order 3. The 4 levels are farm, farm year, cow dry period and weekly obs. although no random effects occur at the dry period level. The model can easily be converted to a hierarchical centered formulation by centering around the 10 farm-year level fixed effects
Mastitis predictors Predictors at dry period level (6) are : Parity of cow (5 categories – 4 dummies) Dummies for Somatic cell count high before drying off, and whether two SCC readings were available. At the farm level (9): One dummy for cows remaining standing for 30 minutes after dry cow treatment, Two dummy variables to indicate whether cubicle bedding is disinfected in the early dry period and whether this is not applicable due to the system used. Two dummy variables to indicate if transition cow cubicles are bedded at least once daily and whether there are in fact transition cow cubicles. A dummy variable as to whether the cubicle bedding is disinfected in the transition dry period Three dummy variables to indicate whether the transition cubicle feed and loaf area is scraped daily, more often than daily or doesn't exist.
Mastitis results The model was run for 50,000 iterations after an adapting period and a further burnin of 500 iterations in MLwiN. Due to the number of parameters in model we will in the table that follows simply give effective sample size (ESS) values although it should also be noted that parameter values for the non- centred formulation were rather variable due to the small ESS The estimation took ~19 hours for the non- centred formulation and 11½ hours for the centred formulation.
Method 3a: Orthogonal polynomials It is noticeable that the parameters α 1 -α 3 have small ESS. These correspond to the predictor polynomials in log duration and the three correlations between the predictors are -0.61, 0.79 and -0.90 respectively. We can without altering the rest of the analysis replace this group of predictors via a reparameterisation that makes the predictors less correlated and in fact we choose to make them orthogonal i.e. To do this we can keep the first predictor and then add the rest in one at a time solving the above at each stage. This results in predictors z * : logt, logt 2 +0.94logt, logt 3 + 1.50logt 2 -2.05logt
Results of orthogonal polynomials Fitting this model will give coefficient α * that can be easily converted back to the coefficients for the original predictors: α 1 = α * 1 +0.94 α * 2 -2.05 α * 3, α 2 = α * 2 +1.50 α * 3,, α 3 = α * 3 Effective sample size comparison: ParameterStandardOrthogonal α0α0 26953337 α1α1 9261997 α2α2 75779 α3α3 751038
Application 3: Contraceptive discontinuation in Indonesia Steele et al. (2004) use multilevel multistate models to study transitions in and out of contraceptive use in Indonesia. Here we consider a simplification of their model which considers only the transition from use to non-use, commonly referred to as contraceptive discontinuation. The data come from the 1997 Indonesia Demographic and Health Survey. Contraceptive use histories were collected retrospectively for the six-year period prior to the survey, and include information on the month and year of starting and ceasing use, the method used, and the reason for discontinuation. The analysis is based on 17,833 episodes of contraceptive use for 12,594 women, where an episode is defined as a continuous period of using the same method of contraception. Restructuring the data to discrete-time format with monthly time intervals leads to 365,205 records. To reduce the size, monthly intervals are grouped into six-month intervals and a binomial response is defined with denominator equal to the number of months of use within a six-month interval. Aggregation of intervals leads to a dataset with 68,515 records.
Model We here have intervals nested within episodes nested within women. Here we include discrete categories for the duration terms z t – a piecewise constant hazard with categories representing 6-11 months, 12-23 months, 24-35 months and >35 months with a base category of 0-5 months.
Predictors At the episode level we have: Age categorized as <25,25-34 and 34-49. Contraceptive method categorized as pill/injectable, Norplant/IUD, other modern and traditional. At the woman level we have: Education (3 categories). Type of region of residence (urban/rural). Socioeconomic status (low, medium or high).
Results ParamN CentredCentredParamN CentredCentred α0α0 166528β4β4 2051718820 α1α1 1340591β5β5 2111819313 α2α2 1155381β6β6 203650 α3α3 11900126β7β7 209344 α4α4 10463161β8β8 1696579 β1β1 13855163β9β9 648833 β2β2 1593311840β 10 687655 β3β3 1991120562σ2uσ2u 14 The table below gives the effective sample sizes based on runs of 250,000 iterations. Here the hierarchical centered formulation does really badly. This is because the cluster variance σ 2 u is very small: estimates of 0.041 and 0.022 for the two methods
A closer look at the residuals It is well known that hierarchical centering works best if the cluster level variance is substantial. Here we see that both the variance is small and the distribution of the residuals is not very normal. This is due to a few women who discontinue usage very quickly and often, whilst many women never discontinue! Normal scores Std residuals
Simple logistic regression We will consider first removing the random effects from the model (due to their small variance) which will result in a simple logistic regression model. It will then be impossible to perform hierarchical centering however we will consider an extension to the orthogonalisation performed in the previous application. Note that Hills and Smith (1992) talk about using orthogonal parameterisations and Roberts and Gilks give it one sentence in MCMC in Practice. Here we consider it in combination with the simple (single site) random walk Metropolis sampler where reduction of correlation in the posterior is perhaps most important.
Method 3b:Orthogonal parameterisation For simplicity assume we have all predictors in one matrix P and that we can write z t α+x tij β as p tij θ where θ=(α,β). Step 1: Number the predictors in some ordering 1,…,N. Step 2: Take each predictor in turn and replace it with a predictor that is orthogonal to all the already considered predictors. For predictor p k. Note this requires solving k-1 equations in k-1 unknown w parameters. A different orthogonal set of predictors results from each ordering.
Orthogonal parameterisation The second step of the algorithm produces both a set of orthogonal predictors that span the same space as the original predictors and a group of w coefficients that can be combined to form a lower diagonal matrix W. We can fit this model and recover the coefficients for the original predictors by pre-multiplication by W T. It is worth noting here that we use improper Uniform priors for the coefficients and if we used proper priors we would need to also calculate the Jacobian for the reparameterisation to ensure the same priors are used. We ordered the predictors in what follows so that the level 2 predictors were last before performing reparameterisation.
Results The following is based on 50,000 iterations: ParamOriginalOrthogonalParamOriginalOrthogonal α0α0 40311578β4β4 1130612010 α1α1 424912049β5β5 987712023 α2α2 361711878β6β6 50010945 α3α3 464311920β7β7 51411610 α4α4 526011864β8β8 564610591 β1β1 511412908β9β9 146611214 β2β2 778710188β 10 168610249 β3β3 102918518 Here we see almost universal benefit of the orthogonal parameterisation with virtually zero time costs and very little programming!
Combining orthogonalisation with parameter expansion Combining orthogonalisation and parameter expansion we have: We ran this model using WinBUGS and only 25,000 iterations following a burnin of 500 iterations which took 34 hours compared to 23½ for 250k in MLwiN without parameter expansion. The results are given overleaf.
Results for full model Here we compare simply using the orthogonal approach in MLwiN for 250k with both orthogonal predictors and parameter expansion in WinBUGS for 25k. Note this takes ~1.5 times as long: ParamOrthog. MLwiN Param Expan ParamOrthog. MLwiN Param Expan α0α0 2271414009β4β4 2353323488 α1α1 2393122017β5β5 2449823792 α2α2 2413615024β6β6 2381622546 α3α3 233034859β7β7 2342824422 α4α4 224571881β8β8 2286022995 β1β1 2334722609β9β9 2369723960 β2β2 2277920883β 10 2362423383 β3β3 2210514032σ2uσ2u 20318
Trace plots of variance chain Before Parameter Expansion After Parameter Expansion Here we see the far greater mixing of the variance chain after parameter expansion. It is worth noting that parameter expansion uses a different prior for σ 2 u and results in an estimate of 0.059 (0.048) as opposed to 0.008 (0.006) without and earlier estimates of 0.041(0.026) and 0.022 (0.018) before orthogonalisation. Note however that all estimates bar parameter expansion are based on very low ESS!
Conclusions Hierarchical centering – as is well known - works well when the cluster variance is big but is no good for small variances Other research building on hierarchical centering includes Papaspiliopoulus et al. (2003,2007) showing how to construct partially non-centred parameterisations. Parameter expansion works well to improve mixing when the cluster variance is small but results in a different prior for the variance. Orthogonalisation of predictors appears to be a good idea generally but is slightly more involved than the other reparameterisations i.e. the predictors need orthogonalising outside WinBUGS and the chains need transforming back. An interesting area of further research is choosing the order for orthogonalisation i.e. which set of orthogonal predictors to use.
References Browne, W.J. (2003). MCMC Estimation in MLwiN. London: Institute of Education, University of London Browne, W.J. (2004). An illustration of the use of reparameterisation methods for improving MCMC efficiency in crossed random effect models Multilevel Modelling Newsletter 16 (1): 13-25 Gamerman D. (1997) Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing. 7, 57--68. Gelfand, A.E., Sahu, S.K. and Carlin, B.P. (1995) Efficient parameterisations for normal linear mixed models. Biometrika 83, 479--488 Gelman, A., Huang, Z., van Dyk, D., and Boscardin, W.J. (2007). Using redundant parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics (to appear). Green, M.J., Bradley, A.J., Medley, G.F. and Browne, W.J. (2007) Cow, Farm and Management Factors during the Dry Period that determine the rate of Clinical Mastitis after calving. Journal of Dairy Science 90: 3764-- 3776. Hills, S.E. and Smith, A.F.M. (1992) Parameterization Issues in Bayesian Inference. In Bayesian Statistics 4, (J M Bernardo, J O Berger, A P Dawid, and A F M Smith, eds), Oxford University Press, UK, pp. 227--246.
References cont. Liu, C., Rubin, D.B., and Wu, Y.N. (1998) Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika 85 (4): 755-770. Liu, J.S., Wu, Y.N. (1999) Parameter Expansion for Data Augmentation. Journal Of The American Statistical Association 94: 1264-1274 Papaspiliopoulos, O, Roberts, G.O. and Skold, M. (2003) Non-centred Parameterisations for Hierarchical Models and Data Augmentation. In Bayesian Statistics 7, (J M Bernardo, M J Bayarri, J O Berger, A P Dawid, D Heckerman, A F M Smith and M West, eds), Oxford University Press, UK, pp. 307--32 Papaspiliopoulos, O, Roberts, G.O. and Skold, M. (2007) A General Framework for the Parametrization of Hierarchical Models. Statistical Science 22, 59--73. Rasbash, J., Browne, W.J., Healy, M, Cameron, B and Charlton, C. (2000). The MLwiN software package version 1.10. London: Institute of Education, University of London. Steele, F., Goldstein, H. and Browne, W.J. (2004). A general multilevel multistate competing risks model for event history data, with an application to a study of contraceptive use dynamics. Statistical Modelling 4: 145--159 Van Dyk, D.A., and Meng, X-L. (2001) The Art of Data Augmentation. Journal of Computational and Graphical Statistics. 10, 1--50.