§❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC)

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Other MCMC features in MLwiN and the MLwiN->WinBUGS interface
MCMC estimation in MlwiN
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Brief introduction on Logistic Regression
Applied Bayesian Inference for Agricultural Statisticians Robert J. Tempelman Department of Animal Science Michigan State University 1.
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Bayesian Estimation in MARK
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Part 24: Bayesian Estimation 24-1/35 Econometrics I Professor William Greene Stern School of Business Department of Economics.
Segmentation and Fitting Using Probabilistic Methods
Applied Bayesian Inference, KSU, April 29, 2012 §  / §❸Empirical Bayes Robert J. Tempelman 1.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian statistics – MCMC techniques
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
Copyright © 2006 Pearson Addison-Wesley. All rights reserved. Lecture 3: Monte Carlo Simulations (Chapter 2.8–2.10)
Multiple regression analysis
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
Simulation Modeling and Analysis
Evaluating Hypotheses
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Lecture II-2: Probability Review
Bootstrapping applied to t-tests
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
4-1 Statistical Inference The field of statistical inference consists of those methods used to make decisions or draw conclusions about a population.
Modeling Menstrual Cycle Length in Pre- and Peri-Menopausal Women Michael Elliott Xiaobi Huang Sioban Harlow University of Michigan School of Public Health.
§❺ Metropolis-Hastings sampling and general MCMC approaches for GLMM
Fixed vs. Random Effects Fixed effect –we are interested in the effects of the treatments (or blocks) per se –if the experiment were repeated, the levels.
Model Inference and Averaging
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
2 nd Order CFA Byrne Chapter 5. 2 nd Order Models The idea of a 2 nd order model (sometimes called a bi-factor model) is: – You have some latent variables.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Applied Bayesian Inference, KSU, April 29, 2012 § ❷ / §❷ An Introduction to Bayesian inference Robert J. Tempelman 1.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models Russell Almond Florida State University College of Education Educational Psychology.
BUSI 6480 Lecture 8 Repeated Measures.
Three Frameworks for Statistical Analysis. Sample Design Forest, N=6 Field, N=4 Count ant nests per quadrat.
Applied Bayesian Inference, KSU, April 29, 2012 §. ❶ / §❶ Review of Likelihood Inference Robert J. Tempelman 1.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.
Other Types of t-tests Recapitulation Recapitulation 1. Still dealing with random samples. 2. However, they are partitioned into two subsamples. 3. Interest.
Tutorial I: Missing Value Analysis
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Advanced Statistical Computing Fall 2016
Bayesian data analysis
Model Inference and Averaging
Linear and generalized linear mixed effects models
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ch13 Empirical Methods.
4-1 Statistical Inference
Presentation transcript:

§❹ The Bayesian Revolution: Markov Chain Monte Carlo (MCMC) Robert J. Tempelman

Simulation-based inference Suppose you’re interested in the following integral/expectation: You can draw random samples x1,x2,…,xn from f(x). Then compute With Monte Carlo Standard Error: f(x): density g(x): function. As n → 

Beauty of Monte Carlo methods You can determine the distribution of any function of the random variable(s). Distribution summaries include: Means, Medians, Key Percentiles (2.5%, 97.5%) Standard Deviations, Etc. Generally more reliable than using “Delta method” especially for highly non-normal distributions.

Using method of composition for sampling (Tanner, 1996). Involve two stages of sampling. Example: Suppose Yi|li~Poisson(li) In turn., li|a,b ~ Gamma(a,b) Then negative binomial distribution with mean a/b and variance (a/b)(1+ b -1).

Using method of composition for sampling from negative binomial: data new; seed1 = 2; alpha = 2; beta = 0.25; do j = 1 to 10000; call rangam(seed1,alpha,x); lambda = x/beta; call ranpoi(seed1,lambda,y); output; end; run; proc means mean var; var y; Draw li|a,b ~ Gamma(a,b) . Draw Yi ~Poisson(li) The MEANS Procedure Variable Mean Variance y 7.9749 39.2638 E(y) = a/b = 2/0.25 = 8 Var(y) = (a/b)(1+ b -1) = 8*(1+4)=40

Another example? Student t. data new; seed1 = 29523; df=4; do j = 1 to 100000; call rangam(seed1,df/2,x); lambda = x/(df/2); t = rannor(seed1)/sqrt(lambda); output; end; run; proc means mean var p5 p95; var t; t5 = tinv(.05,4); t95 = tinv(.95,4); proc print; Draw li|n ~ Gamma(n/2,n/2) . Draw ti |li~Normal(0,1/li) Then t ~ Student tn Variable Mean Variance 5th Pctl 95th Pctl t -0.00524 2.011365 -2.1376 2.122201 Obs t5 t95 1 -2.1319 2.13185

Expectation-Maximization (EM) Ok, I know that EM is NOT a simulation-based inference procedure. However, it is based on data augmentation. Important progenitor of Markov Chain Monte Carlo (MCMC) methods Recall the plant genetics example

Data augmentation Augment “data” by splitting first cell into two cells with probabilities ½ and q/4 for 5 categories: Looks like a Beta Distribution to me!

Data augmentation (cont’d) So joint distribution of “complete” data: Consider the part just including the “missing data” binomial

Expectation-Maximization. Start with complete log-likelihood: 1. Expectation (E-step)

2. Maximization step Use first or second derivative methods to maximize Set to 0:

Recall the data Prob(A_B_) y1=1997 Prob(aaB_) y2=906 Prob(A_bb) y3=904 Probability Genotype Data (Counts) Prob(A_B_) y1=1997 Prob(aaB_) y2=906 Prob(A_bb) y3=904 Prob(aabb) y4=32  → 0: close linkage in repulsion  → 1: close linkage in coupling 0    1

PROC IML code: iter theta 1 0.1055303 2 0.0680147 3 0.0512031 4 0.0432646 5 0.0394234 6 0.0375429 7 0.036617 8 0.0361598 9 0.0359338 10 0.0358219 11 0.0357666 12 0.0357392 13 0.0357256 14 0.0357189 15 0.0357156 16 0.0357139 17 0.0357131 18 0.0357127 19 0.0357125 20 0.0357124 proc iml; y1 = 1997; y2 = 906; y3 = 904; y4 = 32; theta = 0.20; /*Starting value */ do iter = 1 to 20; Ex2 = y1*(theta)/(theta+2); /* E-step */ theta = (Ex2+y4)/(Ex2+y2+y3+y4);/* M-step */ print iter theta; end; run; Slower than Newton-Raphson/Fisher scoring…but generally more robust to poorer starting values.

How derive an asymptotic standard error using EM? From Louis (1982): Given:

Finish off Now Hence:

Stochastic Data Augmentation (Tanner, 1996) Posterior Identity Predictive Identity Implies Suggests an “iterative” method of composition approach for sampling Transition function for Markov Chain

Sampling strategy from p(q|y) Start somewhere: (starting value q= q[0] ) Sample x[1] from Sample q[1] from Sample x[2] from Sample q[2] ] from etc. It’s like sampling from “E-steps” and “M-steps” Cycle 1 Cycle 2

What are these Full Conditional Densities (FCD) ? Recall “complete” likelihood function Assume prior on q is “flat” : FCD: Beta(a=(y1-x +y4 +1),b=(y2+y3+1)) Binomial(n=y1, p = 2/(q+2))

IML code for Chained Data Augmentation Example Starting value proc iml; seed1=4; ncycle = 10000; /* total number of samples */ theta = j(ncycle,1,0); y1 = 1997; y2 = 906; y3 = 904; y4 = 32; beta = y2+y3+1; theta[1] = ranuni(seed1); /* initial draw between 0 and 1 */ do cycle = 2 to ncycle; p = 2/(2+theta[cycle-1]); xvar= ranbin(seed1,y1,p); alpha = y1+y4-xvar+1; xalpha = rangam(seed1,alpha); xbeta = rangam(seed1,beta); theta[cycle] = xalpha/(xalpha+xbeta); end; create parmdata var {theta xvar }; append; run; data parmdata; set parmdata; cycle = _n_;

Trace Plot proc gplot data=parmdata; plot theta*cycle; run; “bad” starting value proc gplot data=parmdata; plot theta*cycle; run; Should discard the first “few” samples to ensure that one is truly sampling from p(q|y) Starting value should have no impact. “Convergence in distribution”. How to decide on this stuff? Cowles and Carlin (1996) Burn-in? Throw away the first 1000 samples as “burn-in”

Histogram of samples post burn-in proc univariate data=parmdata ; where cycle > 1000; var theta ; histogram/normal(color=red mu=0.0357 sigma=0.0060); run; Histogram of samples post burn-in Asymptotic Likelihood inference Bayesian inference N 9000 Posterior Mean 0.03671503 Post. Std Deviation 0.00607971 Quantiles for Normal Distribution Percent Quantile Observed (Bayesian) Asymptotic (Likelihood) 5.0 0.02702 0.02583 95.0 0.04728 0.04557

Zooming in on Trace Plot Hints of autocorrelation. Expected with Markov Chain Monte Carlo simulation schemes. Number of drawn samples is NOT equal number of independent draws. The greater the autocorrelation…the greater the problem…need more samples!

Sample autocorrelation proc arima data=parmdata plots(only)=series(acf); where cycle > 1000; identify var= theta nlag=1000 outcov=autocov ; run; Autocorrelation Check for White Noise To Lag Chi-Square DF Pr > ChiSq Autocorrelations 6 3061.39 <.0001 0.497 0.253 0.141 0.079 0.045 0.029

How to estimate the effective number of independent samples (ESS) Consider posterior mean based on m samples: Initial positive sequence estimator (Geyer, 1992; Sorensen and Gianola, 1995): variance Sum of adjacent lag autocovariances Lag-m autocovariance

Initial positive sequence estimator Choose t such that all SAS PROC MCMC chooses a slightly different cutoff (see documentation). Extensive autocorrelation across lags…..leads to smaller ESS

SAS code Recall: 9000 MCMC post burnin cycles. %macro ESS1(data,variable,startcycle,maxlag); data _null_; set &data nobs=_n;; call symputx('nsample',_n); run; proc arima data=&data ; where iteration > &startcycle; identify var= &variable nlag=&maxlag outcov=autocov ; proc iml; use autocov; read all var{'COV'} into cov; nsample = &nsample; nlag2 = nrow(cov)/2; Gamma = j(nlag2,1,0); cutoff = 0; t = 0; do while (cutoff = 0); t = t+1; Gamma[t] = cov[2*(t-1)+1] + cov[2*(t-1)+2]; if Gamma[t] < 0 then cutoff = 1; if t = nlag2 then do; print "Too much autocorrelation"; print "Specify a larger max lag"; stop; end; varm = (-Cov[1] + 2*sum(Gamma)) / nsample; ESS = Cov[1]/varm; /* effective sample size */ stdm = sqrt(varm); parameter = "&variable"; /* Monte Carlo standard error */ print parameter stdm ESS; run; %mend ESS1;

Executing %ESS1 %ESS1(parmdata,theta,1000,1000); Recall: 1000 MCMC burnin cycles. parameter stdm ESS theta 0.0001116 2967.1289 i.e. information equivalent to drawing 2967 independent draws from density.

How large of an ESS should I target? Routinely…in the thousands or greater. Depends on what you want to estimate. Recommend no less than 100 for estimating “typical” location parameters: mean, median, etc. Several times that for “typical” dispersion parameters like variance. Want to provide key percentiles? i.e., 2.5th , 97.5th percentiles? Need to have ESS in the thousands! See Raftery and Lewis (1992) for further direction.

Worthwhile to consider this sampling strategy? Not too much difference, if any, with likelihood inference. But how about smaller samples? e.g., y1=200,y2=91,y3=90,y4=3 Different story

Gibbs sampling: origins (Geman and Geman, 1984). Gibbs sampling was first developed in statistical physics in relation to spatial inference problem Problem: true image  was corrupted by a stochastic process to produce an observable image y (data) Objective: restore or estimate the true image  in the light of the observed image y. Inference on  based on the Markov random field joint posterior distribution, through successively drawing from updated FCD which were rather easy to specify. These FCD each happened to be the Gibbs distn’s. Misnomer has been used since to describe a rather general process.

Gibbs sampling Extension of chained data augmentation for case of several unknown parameters. Consider p = 3 unknown parameters: Joint posterior density Gibbs sampling: MCMC sampling strategy where all FCD are recognizeable:

Gibbs sampling: the process 1) Start with some “arbitrary” starting values (but within allowable parameter space) 2) Draw from 3) Draw from 4) Draw from 5) Repeat steps 2)-4) m times. One cycle = one random draw from Steps 2-4 constitute one cycle of Gibbs sampling m: length of Gibbs chain

General extension of Gibbs sampling When there are d parameters and/or blocks of parameters: Again specify starting values: Sample from the FCD’s in cycle i Sample q1(k+1) from Sample q2(k+1) from … Sample qd(k+1) from Generically, sample qi from

Throw away enough burn-in samples (k<m) q(k+1) , q(k+2) ,..., q(m) are a realization of a Markov chain with equilibrium distribution p(q|y) The m-k joint samples of q(k+1) , q(k+2) ,..., q(m) are then considered to be random drawings from the joint posterior density p(q|y). Individually, the m-k samples of qj(k+1) , qj(k+2) ,..., qj(k+m) are random samples of qj from the marginal posterior density , p(qj|y) j = 1,2,…,d. i.e., q-j are “nuisance” variables if interest is directed on qj

Mixed model example with known variance components, flat prior on b. Recall: where Write i.e. ALREADY KNOW JOINT POSTERIOR DENSITY!

FCD for mixed effects model with known variance components Ok..really pointless to use MCMC here..but let’s demonstrate. But it be can shown FCD are: where ith row ith column ith row ith diagonal element

Two ways to sample b and u 1. Block draw from faster MCMC mixing (less/no autocorrelation across MCMC cycles) But slower computing time (depending on dimension of q). i.e. compute Cholesky of C Some alternative strategies available (Garcia-Cortes and Sorensen, 1995) 2. Series of univariate draws from Faster computationally. Slower MCMC mixing Partial solution: “thinning the MCMC chain” e.g., save every 10 cycles rather than every cycle

Example: A split plot in time example (Data from Kuehl, 2000, pg.493) Experiment designed to explore mechanisms for early detection of phlebitis during amiodarone therapy. Three intravenous treatments: (A1) Amiodarone (A2) the vehicle solution only (A3) a saline solution. 5 rabbits/treatment in a completely randomized design. 4 repeated measures/animal (30 min. intervals)

SAS data step data ear; input trt rabbit time temp; y = temp; A = trt; B = time; trtrabbit = compress(trt||'_'||rabbit); wholeplot=trtrabbit; cards; 1 1 1 -0.3 1 1 2 -0.2 1 1 3 1.2 1 1 4 3.1 1 2 1 -0.5 1 2 2 2.2 1 2 3 3.3 1 2 4 3.7 etc.

The data (“spaghetti plot”)

Profile (Interaction) means plots

A split plot model assumption for repeated measures Treatment 1 RABBIT IS THE EXPERIMENTAL UNIT FOR TREATMENT Rabbit 3 Rabbit 1 Rabbit 2 Time 1 Time 2 Time 3 Time 4 Time 1 Time 2 Time 3 Time 4 Time 1 Time 2 Time 3 Time 4 RABBIT IS THE BLOCK FOR TIME

Suppose CS assumption was appropriate CONDITIONAL SPECIFICATION: Model variation between experimental units (i.e. rabbits) This is a partially nested or split-plot design. i.e. for treatments, rabbits is the experimental unit;  for time, rabbits is the block!

Analytical (non-simulation) Inference based on PROC MIXED Let’s assume “known” Flat priors on fixed effects p(b)  1. title 'Split Plot in Time using Mixed'; title2 'Known Variance Components'; proc mixed data=ear noprofile; class trt time rabbit; model temp = trt time trt*time /solution; random rabbit(trt); parms (0.1) (0.6) /hold = 1,2; ods output solutionf = solutionf; run; proc print data=solutionf; where estimate ne 0;

(Partial) Output Obs Effect trt time Estimate StdErr DF 1 Intercept _ 0.2200 0.3742 12 2 2.3600 0.5292 3 -0.2200 5 -0.9000 0.4899 36 6 0.02000 7 -0.6400 9 trt*time -1.9200 0.6928 10 -1.2200 11 -0.06000 13 0.3200 14 -0.5400 15 0.5800

MCMC inference First set up dummy variables. /* Based on the zero out last level restrictions */ proc transreg data=ear design order =data; model class(trt|time / zero=last); id y trtrabbit; output out=recodedsplit; run; proc print data=recodedsplit (obs=10); var intercept &_trgind; Corner parameterization implicit in SAS linear model s software

Partial Output (First two rabbits) Obs _NAME_ Intercept trt1 trt2 time1 time2 time3 Trt1 Trt2 trt time y trtrabbit 1 -0.3 1_1 2 -0.2 3 1.2 4 3.1 5 -0.5 1_2 6 2.2 7 3.3 8 3.7 9 -1.1 1_3 10 2.4 Part of X matrix (full-rank)

MCMC using PROC IML proc iml; seed = &seed; Full code available online proc iml; seed = &seed; nburnin = 5000; /* number of burn in samples */ total = 200000;/* total number of Gibbs cycles beyond burnin */ thin= 10; /* saving every “thin" */ ncycle = total/skip; /* leaving a total of ncycle saved samples */

Key subroutine (univariate sampling) start gibbs; /* univariate Gibbs sampler */ do j = 1 to dim; /* dim = p + q */ /* generate from full conditionals for fixed and random effects */ solt = wry[j] - coeff[j,]*solution + coeff[j,j]*solution[j]; solt = solt/coeff[j,j]; vt = 1/coeff[j,j]; solution[j] = solt + sqrt(vt)*rannor(seed); end; finish gibbs;

Output samples to SAS data set called soldata proc means mean median std data=soldata; run; ods graphics on; %tadplot(data=soldata, var=_all_); ods graphics off; %tadplot is a SAS automacro suited for processing MCMC samples.

Comparisons for fixed effects MCMC (Some Monte Carlo error) EXACT (PROC MIXED) Variable Mean Median Std Dev N int 0.218 0.374 20000 TRT1 2.365 2.368 0.526 TRT2 -0.22 -0.215 0.532 TIME1 -0.902 -0.903 0.495 TIME2 0.0225 0.0203 0.491 TIME3 -0.64 -0.643 0.488 -1.915 -1.916 0.692 -1.224 -1.219 0.69 -0.063 -0.066 0.696 0.321 0.316 0.701 -0.543 -0.54 0.58 0.589 0.694 Effect trt time Estimate StdErr Intercept _ 0.2200 0.3742 1 2.3600 0.5292 2 -0.2200 -0.9000 0.4899 0.02000 3 -0.6400 trt*time -1.9200 0.6928 -1.2200 -0.06000 0.3200 -0.5400 0.5800

%TADPLOT output on “intercept”. Trace Plot Posterior Density Autocorrelation Plot

Marginal/Cell Means Effects on previous 2-3 slides not of particular interest. Marginal means: Can derive using contrast vectors that are used to compute least squares means in PROC GLM/MIXED/GLIMMIX etc. lsmeans trt time trt*time / e; mAi: marginal mean for trt i mBj : marginal mean for time j mAiBj: cell mean for trt i time j.

Examples of marginal/cell means Marginal means Cell mean

Marginal/cell (“LS”) means. MCMC (Monte Carlo error) EXACT (PROC MIXED) Variable Mean Median Std Dev A1 1.403 1.401 0.223 A2 -0.293 -0.292 A3 -0.162 0.224 B1 -0.501 -0.5 0.216 B2 0.366 0.365 0.213 B3 0.465 0.466 0.217 B4 0.932 0.931 A1B1 -0.234 -0.231 0.373 A1B2 1.382 0.371 A1B3 1.88 1.878 0.374 A1B4 2.583 0.372 A2B1 -0.584 -0.585 0.375 A2B2 -0.524 -0.526 A2B3 -0.062 -0.058 A2B4 -0.003 -0.005 0.377 A3B1 -0.684 A3B2 0.24 0.242 A3B3 -0.422 -0.423 0.376 A3B4 0.218 trt time Estimate Standard Error 1 1.4 0.2236 2 -0.29 3 -0.16 -0.5 0.216 0.3667 0.4667 4 0.9333 -0.24 0.3742 1.38 1.88 2.58 -0.58 -0.52 -0.06 -3.61E-16 -0.68 0.24 -0.42 0.22

Posterior densities of ma1, mb1, ma1b1. Dotted lines: normal density inferences based on PROC MIXED Closed lines: MCMC

Generalized linear mixed models (Probit Link Model) Stage 1: Stage 2: Stage 3:

Rethinking prior on b i.e. Alternative: Might not be the best idea for binary data, especially when the data is “sparse” Animal breeders call this the “extreme category problem” e.g., if all of responses in a fixed effects subclass is either 1 or 0, then ML/PM of corresponding marginal mean will approach -/+ ∞. PROC LOGISTIC has the FIRTH option for this very reason. Alternative: Typically, 16 < s2b < 50 is probably sufficient on the underlying latent scale (conditionally N(0,1))

Recall Latent Variable Concept (Albert and Chib, 1993) Suppose for animal i Then

Data augmentation with  ={i}, i.e. distribution of Y becomes degenerate or point mass in form conditional on l

Rewrite hierarchical model Stage 1a) Stage 1b) Those two stages define likelihood function

Joint Posterior Density Now Let’s for now assume known s2u:

FCD Liabilities: if yi = 1 if yi = 0 i.e., draw from truncated normals

FCD (cont’d) Fixed and random effects where

Alternative Sampling strategies for fixed and random effects 1. Joint multivariate draw from faster mixing…but computationally expensive? 2. Univariate draws from FCD using partitioned matrix results. Refer to Slides # 36, 37, 49 Slower mixing.

Recall “binarized” RCBD

MCMC analysis 5000 burn-in cycles 500,000 additional cycles Saving every 10: 50,000 saved cycles Full conditional univariate sampling on fixed and random effects. “Known” s2u = 0.50. Remember…no s2e.

Fixed Effect Comparison on inferences (conditional on “known” s2u = 0 Variable Mean Median Std Dev N intercept 0.349 0.345 0.506 50000 DIET1 -0.659 -0.654 0.64 DIET2 0.761 0.75 0.682 DIET3 -1 -0.993 0.649 DIET4 0.76 0.753 0.686 MCMC PROC GLIMMIX Solutions for Fixed Effects Effect diet Estimate Standard Error Intercept 0.3097 0.4772 1 -0.5935 0.5960 2 0.6761 0.6408 3 -0.9019 0.6104 4 0.6775 0.6410 5 .

Marginal Mean Comparisons MCMC Variable Mean Median Std Dev N mm1 -0.31 -0.302 0.499 50000 mm2 1.11 1.097 0.562 mm3 -0.651 -0.644 0.515 mm4 1.109 1.092 0.563 mm5 0.349 0.345 0.506 Based on K’b diet Least Squares Means diet Estimate Standard Error 1 -0.2838 0.4768 2 0.9858 0.5341 3 -0.5922 0.4939 4 0.9872 0.5343 5 0.3097 0.4772 PROC GLIMMIX

Diet 1 Marginal Mean (m+a1)

Posterior Density discrepancy between MCMC and Empirical Bayes for mi? Dotted lines: normal approximation based on PROC GLIMMIX Closed lines: MCMC Do we run the risk of overstating precision with conventional methods? Diet Marginal Means

How about probabilities of success? Variable Mean Median Std Dev N prob1 0.391 0.381 0.173 20000 prob2 0.833 0.864 0.126 prob3 0.282 0.26 0.157 prob4 0.863 prob5 0.623 0.635 MCMC i.e., F(K’b) or normal cdf of marginal means diet Estimate Standard Error Mean Standard Error 1 -0.2838 0.4768 0.3883 0.1827 2 0.9858 0.5341 0.8379 0.1311 3 -0.5922 0.4939 0.2769 0.1653 4 0.9872 0.5343 0.8382 0.1309 5 0.3097 0.4772 0.6216 0.1815 PROC GLIMMIX DELTA METHOD

Comparison of Posterior Densities for Diet Marginal Mean Probabilities Dotted lines: normal approximation based on PROC GLIMMIX Closed lines: MCMC Largest discrepancies along the boundaries

Posterior density of F(m+a1) & F(m+a2)

Posterior density of F(m+a2) - F(m+a1) prob21_diff Frequency Percent prob21_diff < 0 819 1.64 prob21_diff >= 0 49181 98.36 Probability (F(m+a2) - F(m+a1) < 0) = 0.0164 “Two-tailed” P-value = 2*0.0164 = 0.0328

How does that compare with PROC GLIMMIX? Estimates Label Estimate Standard Error DF t Value Pr > |t| Mean Standard Error Mean diet 1 lsmean -0.2838 0.4768 10000 -0.60 0.5517 0.3883 0.1827 diet 2 lsmean 0.9858 0.5341 1.85 0.0650 0.8379 0.1311 diet1 vs diet2 dif -1.2697 0.6433 -1.97 0.0484 Non-est . Recall, we assumed “known” s2u …hence normal rather than t-distributed test statistic.

What if variance components are not known? Specify priors on variance components: Options? 1. Conjugate (Scaled Inverted Chi-Square) denoted as c-2 (nm, nmsm2)) 2. Flat (and bounded as well?) 3. Gelman’s (2006) prior

Relationship between Scaled Inverted Chi-Square & Inverted Gamma Gelman’s prior Gelman’s prior

Gibbs sampling and mixed effects models Recall the following hierarchical model:

Joint Posterior Density and FCD FCD for b and u: same as before: normal FCD for VC: c-2

Back to Split Plot in Time Example Empirical Bayes (EGLS based on REML) Fully Bayes: 5000 burnin-cycles 200000 subsequent cycles Save every 10 post burn-in Use Gelman’s prior on VC title 'Split Plot in Time using Mixed'; title2 'UnKnown Variance Components'; proc mixed data=ear covtest ; class trt time rabbit; model temp = trt time trt*time /solution; random rabbit(trt); ods output solutionf = solutionf; run; proc print data=solutionf; where estimate ne 0; Code available online

Variance component inference PROC MIXED Covariance Parameter Estimates Cov Parm Estimate Standard Error Z Value Pr > Z rabbit(trt) 0.08336 0.09910 0.84 0.2001 Residual 0.5783 0.1363 4.24 <.0001 MCMC Variable Mean Median Std Dev N sigmau 0.127 0.0869 0.141 20000 sigmae 0.632 0.611 0.15

Random effects variance MCMC plots Random effects variance Residual Variance

Estimated effects ± se (sd) PROC MIXED MCMC Effect trt time Estimate StdErr Intercept _ 0.22 0.3638 1 2.36 0.5145 2 -0.22 -0.9 0.481 0.02 3 -0.64 trt*time -1.92 0.6802 -1.22 -0.06 0.32 -0.54 0.58 Variable Mean Median Std Dev N intercept 0.217 0.214 0.388 20000 TRT1 2.363 2.368 0.55 TRT2 -0.22 -0.219 TIME1 -0.898 -0.893 0.499 TIME2 0.0206 0.0248 0.502 TIME3 -0.64 -0.635 0.501 -1.924 -1.931 0.708 -1.222 -1.22 0.71 -0.057 0.715 0.318 0.315 0.711 -0.54 -0.541 0.585 0.589

Marginal (“Least Squares”) Means PROC MIXED MCMC Least Squares Means Effect trt time Estimate Standard Error DF 1 1.4000 0.2135 12 2 -0.2900 3 -0.1600 -0.5000 0.2100 36 0.3667 0.4667 4 0.9333 trt*time -0.2400 0.3638 1.3800 1.8800 2.5800 -0.5800 -0.5200 -0.06000 4.44E-16 -0.6800 0.2400 -0.4200 0.2200 mA1 Variable Mean Median Std Dev A1 1.399 1.401 0.24 A2 -0.292 -0.29 0.237 A3 -0.16 -0.161 0.236 B1 -0.502 -0.501 0.224 B2 0.364 0.363 0.222 B3 0.467 0.466 B4 0.934 0.936 A1B1 -0.244 -0.246 0.389 A1B2 1.378 1.379 0.391 A1B3 1.882 1.88 A1B4 2.581 2.584 A2B1 -0.586 0.393 A2B2 -0.526 -0.525 0.385 A2B3 -0.058 -0.054 0.387 A2B4 0.0031 0.0017 0.386 A3B1 -0.676 -0.678 0.388 A3B2 0.239 0.241 A3B3 -0.422 -0.427 0.392 A3B4 0.219 0.216 mA1 mB1 mB1 mA1B1 mA1B1

Posterior Densities of mA1, mB1, mA1B1 Dotted lines: t densities based on estimates/stderrs from PROC MIXED Closed lines: MCMC

How about fully Bayesian inference in generalized linear mixed models? Probit link GLMM. Extensions to handle unknown variance components are exactly the same given the augmented liability variables. i.e. scaled-inverted chi-square conjugate to s2u. No “overdispersion” (s2e) to contend with for binary data. But stay tuned for binomial/Poisson data!

Analysis of “binarized” RCBD data. Empirical Bayes Fully Bayes 10000 burnin cycles 200000 cycles therafter Saving every 10 Gelman’s prior on VC. title 'Posterior inference conditional on unknown VC'; proc glimmix data=binarize; class litter diet; model y = diet / covb solution dist=bin link = probit; random litter; lsmeans diet / diff ilink; estimate 'diet 1 lsmean' intercept 1 diet 1 0 0 0 0 / ilink; estimate 'diet 2 lsmean' intercept 1 diet 0 1 0 0 0/ ilink; estimate 'diet1 vs diet2 dif' intercept 0 diet 1 -1 0 0 0; run;

Inferences on VC Method = RSPL MCMC Method = Laplace Method = Quad Analysis Variable : sigmau Mean Median Std Dev N 2.048 1.468 2.128 20000 Covariance Parameter Estimates Estimate Standard Error 0.5783 0.5021 Method = Laplace Covariance Parameter Estimates Estimate Standard Error 0.6488 0.6410 Method = Quad Covariance Parameter Estimates Estimate Standard Error 0.6662 0.6573

Inferences on marginal means (m+ai) Method = Laplace MCMC diet Least Squares Means diet Estimate Standard Error DF 1 -0.3024 0.5159 36 2 1.0929 0.5964 3 -0.6428 0.5335 4 1.0946 0.5976 5 0.3519 0.5294 Variable Mean Median Std Dev N mm1 -0.297 -0.301 0.643 20000 mm2 1.322 1.283 0.716 mm3 -0.697 -0.69 0.662 mm4 1.319 1.285 0.72 mm5 0.465 0.442 0.671 Larger: take into account uncertainty on variance components

Posterior Densities of (m+ai) Dotted lines: t36 densities based estimates and standard errors from PROC GLIMMIX (method=laplace) Closed lines: MCMC

MCMC inferences on probabilities of “success”: (based on F(m+ai)

MCMC inferences on marginal probabilities: (based on ) Potentially big issues with empirical Bayes inference…dependent upon quality of VC inference & asymptotics!

Inference on Diet 1 vs. Diet 2 probabilities PROC GLIMMIX MCMC Variable Mean Median Std Dev N Prob diet1 0.4 0.382 0.212 20000 diet2 0.857 0.899 0.137 diff 0.457 0.464 0.207 Estimates Label Mean Standard Error Mean diet 1 lsmean 0.3812 0.1966 diet 2 lsmean 0.8628 0.1309 diet1 vs diet2 dif Non-est . P-value = 0.0559 MCMC prob21_diff Frequency Percent prob21_diff < 0 180 0.90 prob21_diff >= 0 19820 99.10 Probability (F(m+a2) - F(m+a1) < 0) = 0.0090 (“one-tailed”)

Any formal comparisons between GLS/REML/EB(M/PQL) and MCMC for GLMM? Check Browne and Draper (2006). Normal data (LMM) Generally, inferences based on GLS/REML and MCMC are sufficiently close. Since GLS/REML is faster, it is the method of choice for classical assumptions. Non-normal data (GLMM). Quasi-likelihood based methods are particularly problematic in bias of point estimates and interval coverage of variance components. Side effects on fixed effects inference. Bayesian methods with diffuse priors are well calibrated for both properties for all parameters. Comparisons with Laplace not done yet.

A pragmatic take on using MCMC vs PL for GLMM under classical assumptions? If datasets are too small to warrant asymptotic considerations, then the experiment is likely to be poorly powered. Otherwise, PL might ≈ MCMC inference. However, differences could depend on dimensionality, deviation of data distribution from normal, and complexity of design. The real big advantage of MCMC ---is multi-stage hierarchical models (see later)

Implications of design on Fully Bayes vs. PL inference for GLMM? RCBD: Known for LMM, that inferences on treatment differences in RCBD are resilient to estimates of block VC. Inference on differences in treatment effects thereby insensitive to VC inferences in GLMM? Whole plot treatment factor comparisons in split plot designs? Greater sensitivity (i.e. whole plot VC). Sensitivity of inference for conditional versus “population-averaged” probabilities?

Ordinal Categorical Data Back to the GF83 data. Gibbs sampling strategy laid out by Sorensen and Gianola (1995); Albert and Chib (1993). Simple extensions to what was considered earlier for linear/probit mixed models

Joint Posterior Density Stages 1A 1B 2 2 (or something diffuse) 3

Anything different for FCD compared to probit binary? Liabilities Thresholds: This leads to painfully slow mixing…a better strategy is based on Metropolis sampling (Cowles et al., 1996).

Fully Bayesian inference on GF83 5000 burn-in samples 50000 samples post burn-in Saving every 10. Diagnostic plots for s2u

Posterior Summaries Variable Mean Median Std Dev 5th Pctl 95th Pctl intercept -0.222 -0.198 0.669 -1.209 0.723 hy 0.236 0.223 0.396 -0.399 0.894 age -0.036 -0.035 0.392 -0.69 0.598 sex -0.172 -0.171 0.393 -0.818 0.48 sire1 -0.082 -0.042 0.587 -1 0.734 sire2 0.116 0.0491 0.572 -0.641 0.937 sire3 0.194 0.106 0.625 -0.64 1.217 sire4 -0.173 -0.11 0.606 -1.118 0.595 sigmau 1.362 0.202 8.658 0.0021 4.148 thresh2 0.83 0.804 0.302 0.383 1.366 probfemalecat1 0.609 0.188 0.265 0.885 probfemalecat2 0.827 0.864 0.148 0.53 0.986 probmalecat1 0.539 0.545 0.183 0.23 0.836 probmalecat2 0.79 0.821 0.154 0.491 0.974

Posterior densities of sex-specific cumulative probabilities (first two categories) How would interpret a “standard error” in this context?

Posterior densities of sex-specific probabilities (each category)

What if some FCD are not recognizeable? Examples: Poisson mixed models, logistic mixed models. Hmmm.. Need a different strategy. Use Gibbs sampling whenever you can. Use Metropolis-Hastings sampling for FCD that are not recognizeable. NEXT!