# Imperial College, London

## Presentation on theme: "Imperial College, London"— Presentation transcript:

Imperial College, London
Introduction to Bayesian inference and computation for social science data analysis Nicky Best Imperial College, London

Outline Overview of Bayesian methods Examples illustrating: Discussion
Illustration of conjugate Bayesian inference MCMC methods Examples illustrating: Analysis using informative priors Hierarchical priors, meta-analysis and evidence synthesis Adjusting for data quality Model uncertainty Discussion

Overview of Bayesian inference and computation

Overview of Bayesian methods
Bayesian methods have been widely applied in many areas medicine / epidemiology / genetics ecology / environmental sciences finance archaeology political and social sciences, ……… Motivations for adopting Bayesian approach vary natural and coherent way of thinking about science and learning pragmatic choice that is suitable for the problem in hand

Overview of Bayesian methods
Medical context: FDA draft guidance “Bayesian statistics…provides a coherent method for learning from evidence as it accumulates” Evidence can accumulate in various ways: Sequentially Measurement of many ‘similar’ units (individuals, centres, sub-groups, areas, periods…..) Measurement of different aspects of a problem Evidence can take different forms: Data Expert judgement

Overview of Bayesian methods
Bayesian approach also provides formal framework for propagating uncertainty Well suited to building complex models by linking together multiple sub-models Can obtain estimates and uncertainty intervals for any parameter, function of parameters or predictive quantity of interest Bayesian inference doesn’t rely on asymptotics or analytic approximations Arbitrarily wide range of models can be handled using same inferential framework Focus on specifying realistic models, not on choosing analytically tractable approximation

Bayesian inference p ( µ j x ) = / Distinguish between
x : known quantities (data) q : unknown quantities (e.g. regression coefficients, future outcomes, missing observations) Fundamental idea: use probability distributions to represent uncertainty about unknowns Likelihood – model for the data: p( x | q ) Prior distribution – representing current uncertainty about unknowns: p(q ) Applying Bayes theorem gives posterior distribution p ( j x ) = R d /

Conjugate Bayesian inference
Example: election poll (from Franklin, 2004*) Imagine an election campaign where (for simplicity) we have just a Government/Opposition vote choice. We enter the campaign with a prior distribution for the proportion supporting Government. This is p(q ) As the campaign begins, we get polling data. How should we change our estimate of Government’s support? *Adapted from Charles Franklin’s Essex Summer School course slides:

Conjugate Bayesian inference
Data and likelihood Each poll consists of n voters, x of whom say they will vote for Government and n - x will vote for the opposition. If we assume we have no information to distinguish voters in their probability of supporting government then we have a binomial distribution for x x n ( 1 ) / This binomial distribution is the likelihood p(x | q )

Conjugate Bayesian inference
Prior We need to specify a prior that expresses our uncertainty about the election (before it begins) conforms to the nature of the q parameter, i.e. is continuous but bounded between 0 and 1 A convenient choice is the Beta distribution p ( ) = B e t a ; b + 1 /

Conjugate Bayesian inference
Beta(a,b) distribution can take a variety of shapes depending on its two parameters a and b Mean of Beta(a, b) distribution = a/(a+b) Variance of Beta(a,b) distribution = ab(a+b+1)/(a+b)2

Conjugate Bayesian inference
Posterior Combining a beta prior with the binomial likelihood gives a posterior distribution p ( j x ; n ) / 1 a b = + B e t When prior and posterior come from same family, the prior is said to be conjugate to the likelihood Occurs when prior and likelihood have the same ‘kernel’

Conjugate Bayesian inference
Suppose I believe that Government only has the support of half the population, and I think that estimate has a standard deviation of about 0.07 This is approximately a Beta(50, 50) distribution We observe a poll with 200 respondents, 120 of whom (60%) say they will vote for Government This produces a posterior which is a Beta(120+50, 80+50) = Beta(170, 130) distribution

Conjugate Bayesian inference
Prior mean, E(q ) = 50/100 = 0.5 Posterior mean, E(q | x, n) = 170/300 = 0.57 Posterior SD, √Var(q | x, n) = 0.029 Frequentist estimate is based only on the data: ^ = 1 2 : 6 ; S E ( ) p 4 3 5

Conjugate Bayesian inference
A harder problem What is the probability that Government wins? It is not .57 or .60. Those are expected votes but not the probability of winning. How to answer this? Frequentists have a hard time with this one. They can obtain a p-value for testing H0: q > 0.5, but this isn’t the same as the probability that Government wins (its actually the probability of observing data more extreme than 120 out of 200 if H0 is true) Easy from Bayesian perspective – calculate Pr(q > 0.5 | x, n), the posterior probability that q > 0.5

Bayesian computation All Bayesian inference is based on the posterior distribution Summarising posterior distributions involves integration M e a n : E ( j x ) = R p d P r i c t o w ; b l s 2 1 Except for conjugate models, integrals are usually analytically intractable Use Monte Carlo (simulation) integration (MCMC)

Bayesian computation Suppose we didn’t know how to analytically integrate the Beta(170, 130) posterior… ….but we do know how to simulate from a Beta

Bayesian computation Monte Carlo integration
Suppose we have samples q (1), q (2),…, q (n) from p(q | x ) Then E ( j x ) = R p d 1 n P i Can also use samples to estimate posterior tail area probabilities, percentiles, variances etc. Difficult to generate independent samples when posterior is complex and high dimensional Instead, generate dependent samples from a Markov chain having p(q | x ) as its stationary distribution → Markov chain Monte Carlo (MCMC)

Illustrative Examples

Borrowing strength Bayesian learning → borrowing “strength” (precision) from other sources of information Informative prior is one such source “today’s posterior is tomorrows prior” relevance of prior information to current study must be justified

Informative priors Example 1: Western and Jackman (1994)*
Example of regression analysis in comparative research What explains cross-national variation in union density? Union density is defined as the percentage of the work force who belongs to a labour union Two issues Philosophical: data represent all available observations from a population → conventional (frequentist) analysis based on long-run behaviour of repeatable data mechanism not appropriate Practical: small, collinear dataset yields imprecise estimates of regression effects * Slides adapted from Jeff Grynaviski:

Informative priors Competing theories
Wallerstein: union density depends on the size of the civilian labour force (LabF) Stephens: union density depends on industrial concentration (IndC) Note: These two predictors correlate at Control variable: presence of a left-wing government (LeftG) Sample: n = 20 countries with a continuous history of democracy since World War II Fit linear regression model to compare theories union densityi ~ N(mi, s2) mi = b0 + b1LeftG + b2LabF + b3IndC

Informative priors  point estimate ___ 95% CI
Results with non-informative priors on regression coefficients (numerically equivalent to OLS analysis)  point estimate ___ 95% CI

Informative priors Motivation for Bayesian approach with informative priors Because of small sample size and multicollinear variables, not able to adjudicate between theories Data tend to favour Wallerstein (union density depends on labour force size), but neither coefficient estimated very precisely Other historical data are available that could provide further relevant information Incorporation of prior information provides additional structure to the data, which helps to uniquely identify the two coefficients

Informative priors Prior distributions for regression coefficients
Wallerstein Believes in negative labour force effect Comparison of Sweden and Norway in 1950: doubling of labour force corresponds to 3.5-4% drop in union density on log scale, labour force effect size ≈ -3.5/log(2) ≈ -5 Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 b2 ~ N(-5, 2.52)

Informative priors Prior distributions for regression coefficients
Stephens Believes in positive industrial concentration effect Decline in industrial concentration in UK in 1980s: drop of 0.3 in industrial concentration corresponds to about 3% drop in union density industrial concentration effect size ≈ 3/0.3 = 10 Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 b3 ~ N(10, 52)

Informative priors Prior distributions for regression coefficients
Wallerstein and Stephens Both believe left-wing gov’ts assist union growth Assuming 1 year of left-wing gov’t increases union density by about 1% translates to effect size of 0.3 Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 b1 ~ N(0.3, 0.152) Vague prior b0 ~ N(0, 1002) assumed for intercept

Ind Conc Lab Force Left Govt Informative priors

Informative priors Effects of LabF and IndC estimated more precisely
Both sets of prior beliefs support inference that labour-force size decreases union density Only Stephens’ prior supports conclusion that industrial concentration increases union density Choice of prior is subjective – if no consensus, can we be satisfied that data have been interpreted “fairly”? Sensitivity analysis Sensitivity to priors (e.g. repeat analysis using priors with increasing variance) Sensitivity to data (e.g. residuals, influence diagnostics)

Hierarchical priors Hierarchical priors are another widely used approach for borrowing strength Useful when data available on many “similar” units (individuals, areas, studies, subgroups,…) Data xi and parameters qi for each unit i=1,…,N Three different assumptions: Independent parameters: units are unrelated, and each qi is estimated separately using data xi alone Identical parameters: observations treated as coming from same unit, with common parameter q Exchangeable parameters: units are “similar” (labels convey no information) → mathematically equivalent to assuming qi’s are drawn from common probability distribution with unknown parameters

Meta-analysis Example 2: Meta-analysis (Spiegelhalter et al 2004)
8 small RCTs of IV magnesium sulphate following acute myocardial infarction Data: xig = deaths, nig = patients in trial i, treatment group g (0=control, 1=magnesium) Model (likelihood): xig ~ Binomial(pig, nig) logit(pig) = fi + qi·g θi is log odds ratio for treatment effect If not willing to believe trials are identical, but no reason to believe they are systematically different → assume qi’s are exchangeable with hierarchical prior qi ~ Normal(m, s 2) m, s 2 also treated as unknown with (vague) priors

Meta-analysis Estimates and 95% intervals for treatment effect from independent MLE and hierarchical Bayesian analysis

Meta-analysis E S = n £ Effective sample size V T r i a l n V ( M L E
n = sample size of trial V1 = variance of qi without borrowing (var of MLE) V2 = variance of qi with borrowing (posterior variance of qi ) E S = n V 1 2 T r i a l n V ( M L E ) j x 1 = 2 S 7 6 . 5 3 4 9 8

Meta-analysis Example 3: Meta-analysis of effect of class size on educational achievement (Goldstein et al, 2000) 8 studies: 1 RCT 3 matched 2 experimental 2 observational

Meta-analysis Goldstein et al use maximum likelihood, with bootstrap CI due to small sample size Under-estimates uncertainty relative to Bayesian intervals Note that 95% CI for Bayesian estimate of effect of class size includes 0

Accounting for data quality
Bayesian approach also provides formal framework for propagating uncertainty about different quantities in a model Natural tool for explicitly modelling different aspects of data quality Measurement error Missing data

Accounting for data quality
Example 4: Accounting for population errors in small-area estimation and disease mapping (Best and Wakefield, 1999) Context: Mapping geographical variations in risk of breast cancer by electoral ward in SE England, Typical model: yi ~ Poisson(li Ni) yi = number of breast cancer cases in area i Ni = St Nit = population-years at risk in area i li is the area specific rate of breast cancer: parameter of interest

Accounting for data quality
Ni usually assumed to be known Ignores uncertainty in small-area age/sex population counts in inter-census years B&W make use of additional data on Registrar General’s mid-year district age/sex population totals Ndt Model A: Nit = Ndt pit where pit is proportion of annual district population in particular age group of interest living in ward i pit estimated by interpolating 1981 and 1991 census counts Model B: Allow for sampling variability in Nit Nit ~ Multinomial(Ndt, [ p1t ,…, pKt ]) Model C: Allow for uncertainty in proportions pit pit ~ informative Dirichlet prior distribution

Accounting for data quality
prior prior prior Random effects Poisson regression: log li = ai + b Xi Xi = deprivation score for ward i m s 2 b Xi ai Ni li yi ward i

Accounting for data quality
prior prior prior prior m s 2 Ndt b pit Nit Xi ai year t Ni li yi ward i

Accounting for data quality
Ni known A B C RR of breast cancer for affluent vs deprived wards Area-specific RR estimates Ward RR (assuming Ei known) Ward RR (modelled Ei)

Model uncertainty Model uncertainty can be large for observational data studies In regression models: What is the ‘best’ set of predictors for response of interest? Which confounders to control for? Which interactions to include? What functional form to use (linear, non-linear,….)?

Model uncertainty Example 5: Predictors of crime rates in US States (adapted from Raftery et al, 1997) Ehrlich (1973) – developed and tested theory that decision to commit crime is rational choice based on costs and benefits Costs of crime related to probability of imprisonment and average length of time served in prison Benefits of crime related to income inequalities and aggregate wealth of community Net benefits of other (legitimate) activities related to employment rate and education levels in community Ehrlich analysed data from 47 US states in 1960, focusing on relationship between crime rate and the 2 prison variables Up to 13 candidate control variables also considered

Model uncertainty y = log crime rate in 1960 in each of 47 US states
Z1, Z2 = log prob. of prison, log av. time in prison X1,…, X13 = candidate control variables Fit Normal linear regression model Results sensitive to choice of control variables M o d e l C n t r v a i b s E m c ( S ) P T F u A - . 3 1 2 7 p w , 4 9 { j R 8 5 6 h Table adapted from Table 2 in Raftery et al (1997)

Model uncertainty ¹ = Z ° + W ¯ ; ( X : ) n u m b e r o f c t l y s d
Using Bayesian approach, can let set of control variables be an unknown parameter of the model, q Don't know (a priori) no. of covariates in ‘best’ model  q has unknown dimension  assign prior distribution Can handle such “trans-dimensional” (TD) models using “reversible jump” MCMC algorithms Normal linear regression model yi ~ Normal(mi, s 2) i = 1,...,47 Variable selection model: i = Z + W ; ( X 1 2 : k ) n u m b e r o f c t l y s d p v g

Model uncertainty k b q g mi Xi Zi s2 yi state i

Model uncertainty

Model uncertainty 0 0.4 0.8 probability
probability Posterior probability that control variable is in model Posterior mean and 95% CI for effect (b) conditional on being in model

Model uncertainty Most likely (40%) set of control variables contains X4 (police expenditure in 1960) and X13 (income inequality) 2nd most likely (28%) set of control variables contains X5 (police expenditure in 1959) and X13 (income inequality) Control variables with >10% marginal probability of inclusion X3 : average years of schooling (18%) X4 : police expenditure in 1960 (56%) X5 : police expenditure in 1959 (40%) X13 : income inequality (94%) Posterior estimates of prison variables, averaged over models log prob. of prison: (-0.55, -0.05) log av. time in prison: (-0.69, 0.14)

Discussion Bayesian approach provides coherent framework for combining many sources of evidence in a statistical model Formal approach to “borrowing strength” Improved precision/effective sample size Fully accounts for uncertainty Relevance of different pieces of evidence is a judgement – must be justifiable Bayesian approach forces us to be explicit about model assumptions Sensitivity analysis to assumptions is crucial

See www.bias-project.org.uk for details
Discussion Bayesian calculations are computationally intensive, but: Provides exact inference; no asymptotics MCMC offers huge flexibility to model complex problems All examples discussed here were fitted using free WinBUGS software: Want to learn more about using Bayesian methods for social science data analysis? Short course: Introduction to Bayesian inference and WinBUGS, Sept 19-20, Imperial College See for details

Thank you!

References Best, N. and Wakefield, J. (1999). Accounting for inaccuracies in population counts and case registration in cancer mapping studies. J Roy Statist Soc, Series A, 162:  Goldstein, H., Yang, M., Omar, R., Turner, R. and Thompson, S. (2000). Meta-analysis using multilevel models with an application to the study of class size effects. Applied Statistics, 49: Raftery, A., Madigan, D. and Hoeting, J. (1997). Bayesian model averaging for linear regression models. J Am Statist Assoc, 92: Spiegelhalter, D., Abrams, K. and Myles, J. (2004). Bayesian Approaches to Clinical Trials and Health Care Evaluation, Wiley, Chichester. Western, B. and Jackman, S. (1994). Bayesian inference for comparative research. The American Political Science Review, 88: