# §❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC)

## Presentation on theme: "§❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC)"— Presentation transcript:

§❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC)
Robert J. Tempelman

Simulation-based inference
Suppose you’re interested in the following integral/expectation: You can draw random samples x1,x2,…,xn from f(x). Then compute With Monte Carlo Standard Error: f(x): density g(x): function. As n → 

Beauty of Monte Carlo methods
You can determine the distribution of any function of the random variable(s). Distribution summaries include: Means, Medians, Key Percentiles (2.5%, 97.5%) Standard Deviations, Etc. Generally more reliable than using “Delta method” especially for highly non-normal distributions.

Using method of composition for sampling (Tanner, 1996).
Involve two stages of sampling. Example: Suppose Yi|li~Poisson(li) In turn., li|a,b ~ Gamma(a,b) Then negative binomial distribution with mean a/b and variance (a/b)(1+ b -1).

Using method of composition for sampling from negative binomial:
data new; seed1 = 2; alpha = 2; beta = 0.25; do j = 1 to 10000; call rangam(seed1,alpha,x); lambda = x/beta; call ranpoi(seed1,lambda,y); output; end; run; proc means mean var; var y; Draw li|a,b ~ Gamma(a,b) . Draw Yi ~Poisson(li) The MEANS Procedure Variable Mean Variance y 7.9749 E(y) = a/b = 2/0.25 = 8 Var(y) = (a/b)(1+ b -1) = 8*(1+4)=40

Another example? Student t.
data new; seed1 = 29523; df=4; do j = 1 to ; call rangam(seed1,df/2,x); lambda = x/(df/2); t = rannor(seed1)/sqrt(lambda); output; end; run; proc means mean var p5 p95; var t; t5 = tinv(.05,4); t95 = tinv(.95,4); proc print; Draw li|n ~ Gamma(n/2,n/2) . Draw ti |li~Normal(0,1/li) Then t ~ Student tn Variable Mean Variance 5th Pctl 95th Pctl t Obs t5 t95 1

Expectation-Maximization (EM)
Ok, I know that EM is NOT a simulation-based inference procedure. However, it is based on data augmentation. Important progenitor of Markov Chain Monte Carlo (MCMC) methods Recall the plant genetics example

Data augmentation Augment “data” by splitting first cell into two cells with probabilities ½ and q/4 for 5 categories: Looks like a Beta Distribution to me!

Data augmentation (cont’d)
So joint distribution of “complete” data: Consider the part just including the “missing data” binomial

Expectation-Maximization.

2. Maximization step Use first or second derivative methods to maximize Set to 0:

Recall the data Prob(A_B_) y1=1997 Prob(aaB_) y2=906 Prob(A_bb) y3=904
Probability Genotype Data (Counts) Prob(A_B_) y1=1997 Prob(aaB_) y2=906 Prob(A_bb) y3=904 Prob(aabb) y4=32  → 0: close linkage in repulsion  → 1: close linkage in coupling 0    1

PROC IML code: iter theta 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 proc iml; y1 = 1997; y2 = 906; y3 = 904; y4 = 32; theta = 0.20; /*Starting value */ do iter = 1 to 20; Ex2 = y1*(theta)/(theta+2); /* E-step */ theta = (Ex2+y4)/(Ex2+y2+y3+y4);/* M-step */ print iter theta; end; run; Slower than Newton-Raphson/Fisher scoring…but generally more robust to poorer starting values.

How derive an asymptotic standard error using EM?
From Louis (1982): Given:

Finish off Now Hence:

Stochastic Data Augmentation (Tanner, 1996)
Posterior Identity Predictive Identity Implies Suggests an “iterative” method of composition approach for sampling Transition function for Markov Chain

Sampling strategy from p(q|y)
Start somewhere: (starting value q= q[0] ) Sample x[1] from Sample q[1] from Sample x[2] from Sample q[2] ] from etc. It’s like sampling from “E-steps” and “M-steps” Cycle 1 Cycle 2

What are these Full Conditional Densities (FCD) ?
Recall “complete” likelihood function Assume prior on q is “flat” : FCD: Beta(a=(y1-x +y4 +1),b=(y2+y3+1)) Binomial(n=y1, p = 2/(q+2))

IML code for Chained Data Augmentation Example
Starting value proc iml; seed1=4; ncycle = 10000; /* total number of samples */ theta = j(ncycle,1,0); y1 = 1997; y2 = 906; y3 = 904; y4 = 32; beta = y2+y3+1; theta[1] = ranuni(seed1); /* initial draw between 0 and 1 */ do cycle = 2 to ncycle; p = 2/(2+theta[cycle-1]); xvar= ranbin(seed1,y1,p); alpha = y1+y4-xvar+1; xalpha = rangam(seed1,alpha); xbeta = rangam(seed1,beta); theta[cycle] = xalpha/(xalpha+xbeta); end; create parmdata var {theta xvar }; append; run; data parmdata; set parmdata; cycle = _n_;

Trace Plot proc gplot data=parmdata; plot theta*cycle; run;
“bad” starting value proc gplot data=parmdata; plot theta*cycle; run; Should discard the first “few” samples to ensure that one is truly sampling from p(q|y) Starting value should have no impact. “Convergence in distribution”. How to decide on this stuff? Cowles and Carlin (1996) Burn-in? Throw away the first samples as “burn-in”

Histogram of samples post burn-in
proc univariate data=parmdata ; where cycle > 1000; var theta ; histogram/normal(color=red mu= sigma=0.0060); run; Histogram of samples post burn-in Asymptotic Likelihood inference Bayesian inference N 9000 Posterior Mean Post. Std Deviation Quantiles for Normal Distribution Percent Quantile Observed (Bayesian) Asymptotic (Likelihood) 5.0 95.0

Zooming in on Trace Plot
Hints of autocorrelation. Expected with Markov Chain Monte Carlo simulation schemes. Number of drawn samples is NOT equal number of independent draws. The greater the autocorrelation…the greater the problem…need more samples!

Sample autocorrelation
proc arima data=parmdata plots(only)=series(acf); where cycle > 1000; identify var= theta nlag= outcov=autocov ; run; Autocorrelation Check for White Noise To Lag Chi-Square DF Pr > ChiSq Autocorrelations 6 <.0001 0.497 0.253 0.141 0.079 0.045 0.029

How to estimate the effective number of independent samples (ESS)
Consider posterior mean based on m samples: Initial positive sequence estimator (Geyer, 1992; Sorensen and Gianola, 1995): variance Sum of adjacent lag autocovariances Lag-m autocovariance

Initial positive sequence estimator
Choose t such that all SAS PROC MCMC chooses a slightly different cutoff (see documentation). Extensive autocorrelation across lags…..leads to smaller ESS

SAS code Recall: 9000 MCMC post burnin cycles.
%macro ESS1(data,variable,startcycle,maxlag); data _null_; set &data nobs=_n;; call symputx('nsample',_n); run; proc arima data=&data ; where iteration > &startcycle; identify var= &variable nlag=&maxlag outcov=autocov ; proc iml; use autocov; read all var{'COV'} into cov; nsample = &nsample; nlag2 = nrow(cov)/2; Gamma = j(nlag2,1,0); cutoff = 0; t = 0; do while (cutoff = 0); t = t+1; Gamma[t] = cov[2*(t-1)+1] + cov[2*(t-1)+2]; if Gamma[t] < 0 then cutoff = 1; if t = nlag2 then do; print "Too much autocorrelation"; print "Specify a larger max lag"; stop; end; varm = (-Cov[1] + 2*sum(Gamma)) / nsample; ESS = Cov[1]/varm; /* effective sample size */ stdm = sqrt(varm); parameter = "&variable"; /* Monte Carlo standard error */ print parameter stdm ESS; run; %mend ESS1;

Executing %ESS1 %ESS1(parmdata,theta,1000,1000);
Recall: 1000 MCMC burnin cycles. parameter stdm ESS theta i.e. information equivalent to drawing 2967 independent draws from density.

How large of an ESS should I target?
Routinely…in the thousands or greater. Depends on what you want to estimate. Recommend no less than 100 for estimating “typical” location parameters: mean, median, etc. Several times that for “typical” dispersion parameters like variance. Want to provide key percentiles? i.e., 2.5th , 97.5th percentiles? Need to have ESS in the thousands! See Raftery and Lewis (1992) for further direction.

Worthwhile to consider this sampling strategy?
Not too much difference, if any, with likelihood inference. But how about smaller samples? e.g., y1=200,y2=91,y3=90,y4=3 Different story

Gibbs sampling: origins (Geman and Geman, 1984).
Gibbs sampling was first developed in statistical physics in relation to spatial inference problem Problem: true image  was corrupted by a stochastic process to produce an observable image y (data) Objective: restore or estimate the true image  in the light of the observed image y. Inference on  based on the Markov random field joint posterior distribution, through successively drawing from updated FCD which were rather easy to specify. These FCD each happened to be the Gibbs distn’s. Misnomer has been used since to describe a rather general process.

Gibbs sampling Extension of chained data augmentation for case of several unknown parameters. Consider p = 3 unknown parameters: Joint posterior density Gibbs sampling: MCMC sampling strategy where all FCD are recognizeable:

Gibbs sampling: the process
1) Start with some “arbitrary” starting values (but within allowable parameter space) 2) Draw from 3) Draw from 4) Draw from 5) Repeat steps 2)-4) m times. One cycle = one random draw from Steps 2-4 constitute one cycle of Gibbs sampling m: length of Gibbs chain

General extension of Gibbs sampling
When there are d parameters and/or blocks of parameters: Again specify starting values: Sample from the FCD’s in cycle i Sample q1(k+1) from Sample q2(k+1) from Sample qd(k+1) from Generically, sample qi from

Throw away enough burn-in samples (k<m)
q(k+1) , q(k+2) ,..., q(m) are a realization of a Markov chain with equilibrium distribution p(q|y) The m-k joint samples of q(k+1) , q(k+2) ,..., q(m) are then considered to be random drawings from the joint posterior density p(q|y). Individually, the m-k samples of qj(k+1) , qj(k+2) ,..., qj(k+m) are random samples of qj from the marginal posterior density , p(qj|y) j = 1,2,…,d. i.e., q-j are “nuisance” variables if interest is directed on qj

Mixed model example with known variance components, flat prior on b.
Recall: where Write i.e. ALREADY KNOW JOINT POSTERIOR DENSITY!

FCD for mixed effects model with known variance components
Ok..really pointless to use MCMC here..but let’s demonstrate. But it be can shown FCD are: where ith row ith column ith row ith diagonal element

Two ways to sample b and u
1. Block draw from faster MCMC mixing (less/no autocorrelation across MCMC cycles) But slower computing time (depending on dimension of q). i.e. compute Cholesky of C Some alternative strategies available (Garcia-Cortes and Sorensen, 1995) 2. Series of univariate draws from Faster computationally. Slower MCMC mixing Partial solution: “thinning the MCMC chain” e.g., save every 10 cycles rather than every cycle

Example: A split plot in time example (Data from Kuehl, 2000, pg.493)
Experiment designed to explore mechanisms for early detection of phlebitis during amiodarone therapy. Three intravenous treatments: (A1) Amiodarone (A2) the vehicle solution only (A3) a saline solution. 5 rabbits/treatment in a completely randomized design. 4 repeated measures/animal (30 min. intervals)

SAS data step data ear; input trt rabbit time temp;
y = temp; A = trt; B = time; trtrabbit = compress(trt||'_'||rabbit); wholeplot=trtrabbit; cards; etc.

The data (“spaghetti plot”)

Profile (Interaction) means plots

A split plot model assumption for repeated measures
Treatment 1 RABBIT IS THE EXPERIMENTAL UNIT FOR TREATMENT Rabbit 3 Rabbit 1 Rabbit 2 Time 1 Time 2 Time 3 Time 4 Time 1 Time 2 Time 3 Time 4 Time 1 Time 2 Time 3 Time 4 RABBIT IS THE BLOCK FOR TIME

Suppose CS assumption was appropriate
CONDITIONAL SPECIFICATION: Model variation between experimental units (i.e. rabbits) This is a partially nested or split-plot design. i.e. for treatments, rabbits is the experimental unit;  for time, rabbits is the block!

Analytical (non-simulation) Inference based on PROC MIXED
Let’s assume “known” Flat priors on fixed effects p(b)  1. title 'Split Plot in Time using Mixed'; title2 'Known Variance Components'; proc mixed data=ear noprofile; class trt time rabbit; model temp = trt time trt*time /solution; random rabbit(trt); parms (0.1) (0.6) /hold = 1,2; ods output solutionf = solutionf; run; proc print data=solutionf; where estimate ne 0;

(Partial) Output Obs Effect trt time Estimate StdErr DF 1 Intercept _
0.2200 0.3742 12 2 2.3600 0.5292 3 5 0.4899 36 6 7 9 trt*time 0.6928 10 11 13 0.3200 14 15 0.5800

MCMC inference First set up dummy variables.
/* Based on the zero out last level restrictions */ proc transreg data=ear design order =data; model class(trt|time / zero=last); id y trtrabbit; output out=recodedsplit; run; proc print data=recodedsplit (obs=10); var intercept &_trgind; Corner parameterization implicit in SAS linear model s software

Partial Output (First two rabbits)
Obs _NAME_ Intercept trt1 trt2 time1 time2 time3 Trt1 Trt2 trt time y trtrabbit 1 -0.3 1_1 2 -0.2 3 1.2 4 3.1 5 -0.5 1_2 6 2.2 7 3.3 8 3.7 9 -1.1 1_3 10 2.4 Part of X matrix (full-rank)

MCMC using PROC IML proc iml; seed = &seed;
Full code available online proc iml; seed = &seed; nburnin = 5000; /* number of burn in samples */ total = ;/* total number of Gibbs cycles beyond burnin */ thin= 10; /* saving every “thin" */ ncycle = total/skip; /* leaving a total of ncycle saved samples */

Key subroutine (univariate sampling)
start gibbs; /* univariate Gibbs sampler */ do j = 1 to dim; /* dim = p + q */ /* generate from full conditionals for fixed and random effects */ solt = wry[j] - coeff[j,]*solution + coeff[j,j]*solution[j]; solt = solt/coeff[j,j]; vt = 1/coeff[j,j]; solution[j] = solt + sqrt(vt)*rannor(seed); end; finish gibbs;

Output samples to SAS data set called soldata
proc means mean median std data=soldata; run; ods graphics on; %tadplot(data=soldata, var=_all_); ods graphics off; %tadplot is a SAS automacro suited for processing MCMC samples.

Comparisons for fixed effects
MCMC (Some Monte Carlo error) EXACT (PROC MIXED) Variable Mean Median Std Dev N int 0.218 0.374 20000 TRT1 2.365 2.368 0.526 TRT2 -0.22 -0.215 0.532 TIME1 -0.902 -0.903 0.495 TIME2 0.0225 0.0203 0.491 TIME3 -0.64 -0.643 0.488 -1.915 -1.916 0.692 -1.224 -1.219 0.69 -0.063 -0.066 0.696 0.321 0.316 0.701 -0.543 -0.54 0.58 0.589 0.694 Effect trt time Estimate StdErr Intercept _ 0.2200 0.3742 1 2.3600 0.5292 2 0.4899 3 trt*time 0.6928 0.3200 0.5800

Trace Plot Posterior Density Autocorrelation Plot

Marginal/Cell Means Effects on previous 2-3 slides not of particular interest. Marginal means: Can derive using contrast vectors that are used to compute least squares means in PROC GLM/MIXED/GLIMMIX etc. lsmeans trt time trt*time / e; mAi: marginal mean for trt i mBj : marginal mean for time j mAiBj: cell mean for trt i time j.

Examples of marginal/cell means
Marginal means Cell mean

Marginal/cell (“LS”) means.
MCMC (Monte Carlo error) EXACT (PROC MIXED) Variable Mean Median Std Dev A1 1.403 1.401 0.223 A2 -0.293 -0.292 A3 -0.162 0.224 B1 -0.501 -0.5 0.216 B2 0.366 0.365 0.213 B3 0.465 0.466 0.217 B4 0.932 0.931 A1B1 -0.234 -0.231 0.373 A1B2 1.382 0.371 A1B3 1.88 1.878 0.374 A1B4 2.583 0.372 A2B1 -0.584 -0.585 0.375 A2B2 -0.524 -0.526 A2B3 -0.062 -0.058 A2B4 -0.003 -0.005 0.377 A3B1 -0.684 A3B2 0.24 0.242 A3B3 -0.422 -0.423 0.376 A3B4 0.218 trt time Estimate Standard Error 1 1.4 0.2236 2 -0.29 3 -0.16 -0.5 0.216 0.3667 0.4667 4 0.9333 -0.24 0.3742 1.38 1.88 2.58 -0.58 -0.52 -0.06 -3.61E-16 -0.68 0.24 -0.42 0.22

Posterior densities of ma1, mb1, ma1b1.
Dotted lines: normal density inferences based on PROC MIXED Closed lines: MCMC

Generalized linear mixed models (Probit Link Model)
Stage 1: Stage 2: Stage 3:

Rethinking prior on b i.e. Alternative:
Might not be the best idea for binary data, especially when the data is “sparse” Animal breeders call this the “extreme category problem” e.g., if all of responses in a fixed effects subclass is either 1 or 0, then ML/PM of corresponding marginal mean will approach -/+ ∞. PROC LOGISTIC has the FIRTH option for this very reason. Alternative: Typically, 16 < s2b < 50 is probably sufficient on the underlying latent scale (conditionally N(0,1))

Recall Latent Variable Concept (Albert and Chib, 1993)
Suppose for animal i Then

Data augmentation with  ={i},
i.e. distribution of Y becomes degenerate or point mass in form conditional on l

Rewrite hierarchical model
Stage 1a) Stage 1b) Those two stages define likelihood function

Joint Posterior Density
Now Let’s for now assume known s2u:

FCD Liabilities: if yi = 1 if yi = 0 i.e., draw from truncated normals

FCD (cont’d) Fixed and random effects where

Alternative Sampling strategies for fixed and random effects
1. Joint multivariate draw from faster mixing…but computationally expensive? 2. Univariate draws from FCD using partitioned matrix results. Refer to Slides # 36, 37, 49 Slower mixing.

Recall “binarized” RCBD

MCMC analysis 5000 burn-in cycles 500,000 additional cycles
Saving every 10: 50,000 saved cycles Full conditional univariate sampling on fixed and random effects. “Known” s2u = 0.50. Remember…no s2e.

Fixed Effect Comparison on inferences (conditional on “known” s2u = 0
Variable Mean Median Std Dev N intercept 0.349 0.345 0.506 50000 DIET1 -0.659 -0.654 0.64 DIET2 0.761 0.75 0.682 DIET3 -1 -0.993 0.649 DIET4 0.76 0.753 0.686 MCMC PROC GLIMMIX Solutions for Fixed Effects Effect diet Estimate Standard Error Intercept 0.3097 0.4772 1 0.5960 2 0.6761 0.6408 3 0.6104 4 0.6775 0.6410 5 .

Marginal Mean Comparisons
MCMC Variable Mean Median Std Dev N mm1 -0.31 -0.302 0.499 50000 mm2 1.11 1.097 0.562 mm3 -0.651 -0.644 0.515 mm4 1.109 1.092 0.563 mm5 0.349 0.345 0.506 Based on K’b diet Least Squares Means diet Estimate Standard Error 1 0.4768 2 0.9858 0.5341 3 0.4939 4 0.9872 0.5343 5 0.3097 0.4772 PROC GLIMMIX

Diet 1 Marginal Mean (m+a1)

Posterior Density discrepancy between MCMC and Empirical Bayes for mi?
Dotted lines: normal approximation based on PROC GLIMMIX Closed lines: MCMC Do we run the risk of overstating precision with conventional methods? Diet Marginal Means

Variable Mean Median Std Dev N prob1 0.391 0.381 0.173 20000 prob2 0.833 0.864 0.126 prob3 0.282 0.26 0.157 prob4 0.863 prob5 0.623 0.635 MCMC i.e., F(K’b) or normal cdf of marginal means diet Estimate Standard Error Mean Standard Error 1 0.4768 0.3883 0.1827 2 0.9858 0.5341 0.8379 0.1311 3 0.4939 0.2769 0.1653 4 0.9872 0.5343 0.8382 0.1309 5 0.3097 0.4772 0.6216 0.1815 PROC GLIMMIX DELTA METHOD

Comparison of Posterior Densities for Diet Marginal Mean Probabilities
Dotted lines: normal approximation based on PROC GLIMMIX Closed lines: MCMC Largest discrepancies along the boundaries

Posterior density of F(m+a1) & F(m+a2)

Posterior density of F(m+a2) - F(m+a1)
prob21_diff Frequency Percent prob21_diff < 0 819 1.64 prob21_diff >= 0 49181 98.36 Probability (F(m+a2) - F(m+a1) < 0) = “Two-tailed” P-value = 2* =

How does that compare with PROC GLIMMIX?
Estimates Label Estimate Standard Error DF t Value Pr > |t| Mean Standard Error Mean diet 1 lsmean 0.4768 10000 -0.60 0.5517 0.3883 0.1827 diet 2 lsmean 0.9858 0.5341 1.85 0.0650 0.8379 0.1311 diet1 vs diet2 dif 0.6433 -1.97 0.0484 Non-est . Recall, we assumed “known” s2u …hence normal rather than t-distributed test statistic.

What if variance components are not known?
Specify priors on variance components: Options? 1. Conjugate (Scaled Inverted Chi-Square) denoted as c-2 (nm, nmsm2)) 2. Flat (and bounded as well?) 3. Gelman’s (2006) prior

Relationship between Scaled Inverted Chi-Square & Inverted Gamma
Gelman’s prior Gelman’s prior

Gibbs sampling and mixed effects models
Recall the following hierarchical model:

Joint Posterior Density and FCD
FCD for b and u: same as before: normal FCD for VC: c-2

Back to Split Plot in Time Example
Empirical Bayes (EGLS based on REML) Fully Bayes: 5000 burnin-cycles subsequent cycles Save every 10 post burn-in Use Gelman’s prior on VC title 'Split Plot in Time using Mixed'; title2 'UnKnown Variance Components'; proc mixed data=ear covtest ; class trt time rabbit; model temp = trt time trt*time /solution; random rabbit(trt); ods output solutionf = solutionf; run; proc print data=solutionf; where estimate ne 0; Code available online

Variance component inference
PROC MIXED Covariance Parameter Estimates Cov Parm Estimate Standard Error Z Value Pr > Z rabbit(trt) 0.84 0.2001 Residual 0.5783 0.1363 4.24 <.0001 MCMC Variable Mean Median Std Dev N sigmau 0.127 0.0869 0.141 20000 sigmae 0.632 0.611 0.15

Random effects variance
MCMC plots Random effects variance Residual Variance

Estimated effects ± se (sd)
PROC MIXED MCMC Effect trt time Estimate StdErr Intercept _ 0.22 0.3638 1 2.36 0.5145 2 -0.22 -0.9 0.481 0.02 3 -0.64 trt*time -1.92 0.6802 -1.22 -0.06 0.32 -0.54 0.58 Variable Mean Median Std Dev N intercept 0.217 0.214 0.388 20000 TRT1 2.363 2.368 0.55 TRT2 -0.22 -0.219 TIME1 -0.898 -0.893 0.499 TIME2 0.0206 0.0248 0.502 TIME3 -0.64 -0.635 0.501 -1.924 -1.931 0.708 -1.222 -1.22 0.71 -0.057 0.715 0.318 0.315 0.711 -0.54 -0.541 0.585 0.589

Marginal (“Least Squares”) Means
PROC MIXED MCMC Least Squares Means Effect trt time Estimate Standard Error DF 1 1.4000 0.2135 12 2 3 0.2100 36 0.3667 0.4667 4 0.9333 trt*time 0.3638 1.3800 1.8800 2.5800 4.44E-16 0.2400 0.2200 mA1 Variable Mean Median Std Dev A1 1.399 1.401 0.24 A2 -0.292 -0.29 0.237 A3 -0.16 -0.161 0.236 B1 -0.502 -0.501 0.224 B2 0.364 0.363 0.222 B3 0.467 0.466 B4 0.934 0.936 A1B1 -0.244 -0.246 0.389 A1B2 1.378 1.379 0.391 A1B3 1.882 1.88 A1B4 2.581 2.584 A2B1 -0.586 0.393 A2B2 -0.526 -0.525 0.385 A2B3 -0.058 -0.054 0.387 A2B4 0.0031 0.0017 0.386 A3B1 -0.676 -0.678 0.388 A3B2 0.239 0.241 A3B3 -0.422 -0.427 0.392 A3B4 0.219 0.216 mA1 mB1 mB1 mA1B1 mA1B1

Posterior Densities of mA1, mB1, mA1B1
Dotted lines: t densities based on estimates/stderrs from PROC MIXED Closed lines: MCMC

How about fully Bayesian inference in generalized linear mixed models?
Probit link GLMM. Extensions to handle unknown variance components are exactly the same given the augmented liability variables. i.e. scaled-inverted chi-square conjugate to s2u. No “overdispersion” (s2e) to contend with for binary data. But stay tuned for binomial/Poisson data!

Analysis of “binarized” RCBD data.
Empirical Bayes Fully Bayes 10000 burnin cycles cycles therafter Saving every 10 Gelman’s prior on VC. title 'Posterior inference conditional on unknown VC'; proc glimmix data=binarize; class litter diet; model y = diet / covb solution dist=bin link = probit; random litter; lsmeans diet / diff ilink; estimate 'diet 1 lsmean' intercept 1 diet / ilink; estimate 'diet 2 lsmean' intercept 1 diet / ilink; estimate 'diet1 vs diet2 dif' intercept 0 diet ; run;

Inferences on VC Method = RSPL MCMC Method = Laplace Method = Quad
Analysis Variable : sigmau Mean Median Std Dev N 2.048 1.468 2.128 20000 Covariance Parameter Estimates Estimate Standard Error 0.5783 0.5021 Method = Laplace Covariance Parameter Estimates Estimate Standard Error 0.6488 0.6410 Method = Quad Covariance Parameter Estimates Estimate Standard Error 0.6662 0.6573

Inferences on marginal means (m+ai)
Method = Laplace MCMC diet Least Squares Means diet Estimate Standard Error DF 1 0.5159 36 2 1.0929 0.5964 3 0.5335 4 1.0946 0.5976 5 0.3519 0.5294 Variable Mean Median Std Dev N mm1 -0.297 -0.301 0.643 20000 mm2 1.322 1.283 0.716 mm3 -0.697 -0.69 0.662 mm4 1.319 1.285 0.72 mm5 0.465 0.442 0.671 Larger: take into account uncertainty on variance components

Posterior Densities of (m+ai)
Dotted lines: t36 densities based estimates and standard errors from PROC GLIMMIX (method=laplace) Closed lines: MCMC

MCMC inferences on probabilities of “success”: (based on F(m+ai)

MCMC inferences on marginal probabilities: (based on )
Potentially big issues with empirical Bayes inference…dependent upon quality of VC inference & asymptotics!

Inference on Diet 1 vs. Diet 2 probabilities
PROC GLIMMIX MCMC Variable Mean Median Std Dev N Prob diet1 0.4 0.382 0.212 20000 diet2 0.857 0.899 0.137 diff 0.457 0.464 0.207 Estimates Label Mean Standard Error Mean diet 1 lsmean 0.3812 0.1966 diet 2 lsmean 0.8628 0.1309 diet1 vs diet2 dif Non-est . P-value = MCMC prob21_diff Frequency Percent prob21_diff < 0 180 0.90 prob21_diff >= 0 19820 99.10 Probability (F(m+a2) - F(m+a1) < 0) = (“one-tailed”)

Any formal comparisons between GLS/REML/EB(M/PQL) and MCMC for GLMM?
Check Browne and Draper (2006). Normal data (LMM) Generally, inferences based on GLS/REML and MCMC are sufficiently close. Since GLS/REML is faster, it is the method of choice for classical assumptions. Non-normal data (GLMM). Quasi-likelihood based methods are particularly problematic in bias of point estimates and interval coverage of variance components. Side effects on fixed effects inference. Bayesian methods with diffuse priors are well calibrated for both properties for all parameters. Comparisons with Laplace not done yet.

A pragmatic take on using MCMC vs PL for GLMM under classical assumptions?
If datasets are too small to warrant asymptotic considerations, then the experiment is likely to be poorly powered. Otherwise, PL might ≈ MCMC inference. However, differences could depend on dimensionality, deviation of data distribution from normal, and complexity of design. The real big advantage of MCMC ---is multi-stage hierarchical models (see later)

Implications of design on Fully Bayes vs. PL inference for GLMM?
RCBD: Known for LMM, that inferences on treatment differences in RCBD are resilient to estimates of block VC. Inference on differences in treatment effects thereby insensitive to VC inferences in GLMM? Whole plot treatment factor comparisons in split plot designs? Greater sensitivity (i.e. whole plot VC). Sensitivity of inference for conditional versus “population-averaged” probabilities?

Ordinal Categorical Data
Back to the GF83 data. Gibbs sampling strategy laid out by Sorensen and Gianola (1995); Albert and Chib (1993). Simple extensions to what was considered earlier for linear/probit mixed models

Joint Posterior Density
Stages 1A 1B 2 2 (or something diffuse) 3

Anything different for FCD compared to probit binary?
Liabilities Thresholds: This leads to painfully slow mixing…a better strategy is based on Metropolis sampling (Cowles et al., 1996).

Fully Bayesian inference on GF83
5000 burn-in samples 50000 samples post burn-in Saving every 10. Diagnostic plots for s2u

Posterior Summaries Variable Mean Median Std Dev 5th Pctl 95th Pctl
intercept -0.222 -0.198 0.669 -1.209 0.723 hy 0.236 0.223 0.396 -0.399 0.894 age -0.036 -0.035 0.392 -0.69 0.598 sex -0.172 -0.171 0.393 -0.818 0.48 sire1 -0.082 -0.042 0.587 -1 0.734 sire2 0.116 0.0491 0.572 -0.641 0.937 sire3 0.194 0.106 0.625 -0.64 1.217 sire4 -0.173 -0.11 0.606 -1.118 0.595 sigmau 1.362 0.202 8.658 0.0021 4.148 thresh2 0.83 0.804 0.302 0.383 1.366 probfemalecat1 0.609 0.188 0.265 0.885 probfemalecat2 0.827 0.864 0.148 0.53 0.986 probmalecat1 0.539 0.545 0.183 0.23 0.836 probmalecat2 0.79 0.821 0.154 0.491 0.974

Posterior densities of sex-specific cumulative probabilities (first two categories)
How would interpret a “standard error” in this context?

Posterior densities of sex-specific probabilities (each category)

What if some FCD are not recognizeable?
Examples: Poisson mixed models, logistic mixed models. Hmmm.. Need a different strategy. Use Gibbs sampling whenever you can. Use Metropolis-Hastings sampling for FCD that are not recognizeable. NEXT!