Download presentation

Presentation is loading. Please wait.

Published byDaniel O'Keefe Modified over 2 years ago

1
Introduction to Bayesian inference and computation for social science data analysis Nicky Best Imperial College, London

2
Outline Overview of Bayesian methods –Illustration of conjugate Bayesian inference –MCMC methods Examples illustrating: –Analysis using informative priors –Hierarchical priors, meta-analysis and evidence synthesis –Adjusting for data quality –Model uncertainty Discussion

3
Overview of Bayesian inference and computation

4
Overview of Bayesian methods Bayesian methods have been widely applied in many areas –medicine / epidemiology / genetics –ecology / environmental sciences –finance –archaeology –political and social sciences, ……… Motivations for adopting Bayesian approach vary –natural and coherent way of thinking about science and learning –pragmatic choice that is suitable for the problem in hand

5
Overview of Bayesian methods Medical context: FDA draft guidance Bayesian statistics…provides a coherent method for learning from evidence as it accumulates Evidence can accumulate in various ways: –Sequentially –Measurement of many similar units (individuals, centres, sub-groups, areas, periods…..) –Measurement of different aspects of a problem Evidence can take different forms: –Data –Expert judgement

6
Overview of Bayesian methods Bayesian approach also provides formal framework for propagating uncertainty –Well suited to building complex models by linking together multiple sub-models –Can obtain estimates and uncertainty intervals for any parameter, function of parameters or predictive quantity of interest Bayesian inference doesnt rely on asymptotics or analytic approximations –Arbitrarily wide range of models can be handled using same inferential framework –Focus on specifying realistic models, not on choosing analytically tractable approximation

7
Bayesian inference Distinguish between x : known quantities (data) : unknown quantities (e.g. regression coefficients, future outcomes, missing observations) Fundamental idea: use probability distributions to represent uncertainty about unknowns Likelihood – model for the data: p ( x | ) Prior distribution – representing current uncertainty about unknowns: p ( ) Applying Bayes theorem gives posterior distribution p ( µ j x ) = p ( µ ) p ( x j µ ) R p ( µ ) p ( x j µ ) dµ / p ( µ ) p ( x j µ )

8
Conjugate Bayesian inference Example: election poll (from Franklin, 2004*) Imagine an election campaign where (for simplicity) we have just a Government/Opposition vote choice. We enter the campaign with a prior distribution for the proportion supporting Government. This is p ( ) As the campaign begins, we get polling data. How should we change our estimate of Governments support? *Adapted from Charles Franklins Essex Summer School course slides:

9
Conjugate Bayesian inference Data and likelihood Each poll consists of n voters, x of whom say they will vote for Government and n - x will vote for the opposition. If we assume we have no information to distinguish voters in their probability of supporting government then we have a binomial distribution for x This binomial distribution is the likelihood p ( x | ) x » µ n x ¶ µ x ( 1 ¡ µ ) n ¡ x / µ x ( 1 ¡ µ ) n ¡ x

10
Conjugate Bayesian inference Prior We need to specify a prior that –expresses our uncertainty about the election (before it begins) –conforms to the nature of the parameter, i.e. is continuous but bounded between 0 and 1 A convenient choice is the Beta distribution p ( µ ) = B e t a ( a ; b ) = ¡ ( a + b ) ¡ ( a ) ¡ ( b ) µ a ¡ 1 ( 1 ¡ µ ) b ¡ 1 / µ a ¡ 1 ( 1 ¡ µ ) b ¡ 1

11
Conjugate Bayesian inference Beta( a, b ) distribution can take a variety of shapes depending on its two parameters a and b Mean of Beta(a, b) distribution = a/(a+b) Variance of Beta(a,b) distribution = ab(a+b+1)/(a+b) 2

12
Conjugate Bayesian inference Posterior Combining a beta prior with the binomial likelihood gives a posterior distribution When prior and posterior come from same family, the prior is said to be conjugate to the likelihood –Occurs when prior and likelihood have the same kernel p ( µ j x ; n ) / p ( x j µ ; n ) p ( µ ) / µ x ( 1 ¡ µ ) n ¡ x µ a ¡ 1 ( 1 ¡ µ ) b ¡ 1 = µ x + a ¡ 1 ( 1 ¡ µ ) n ¡ x + b ¡ 1 / B e t a ( x + a ; n ¡ x + b )

13
Conjugate Bayesian inference Suppose I believe that Government only has the support of half the population, and I think that estimate has a standard deviation of about 0.07 –This is approximately a Beta(50, 50) distribution We observe a poll with 200 respondents, 120 of whom (60%) say they will vote for Government This produces a posterior which is a Beta(120+50, 80+50) = Beta(170, 130) distribution

14
Conjugate Bayesian inference Prior mean, E( ) = 50/100 = 0.5 Posterior mean, E( | x, n ) = 170/300 = 0.57 Posterior SD, Var( | x, n ) = Frequentist estimate is based only on the data: ^ µ = 120 = 200 = 0 : 6 ; SE ( ^ µ ) = p ( 0 : 6 £ 0 : 4 )= 200 = 0 : 035

15
Conjugate Bayesian inference A harder problem What is the probability that Government wins? –It is not.57 or.60. Those are expected votes but not the probability of winning. How to answer this? Frequentists have a hard time with this one. They can obtain a p-value for testing H 0 : > 0.5, but this isnt the same as the probability that Government wins –(its actually the probability of observing data more extreme than 120 out of 200 if H 0 is true) Easy from Bayesian perspective – calculate Pr( > 0.5 | x, n ), the posterior probability that > 0.5

16
Bayesian computation All Bayesian inference is based on the posterior distribution Summarising posterior distributions involves integration Except for conjugate models, integrals are usually analytically intractable Use Monte Carlo (simulation) integration (MCMC)

17
Bayesian computation Suppose we didnt know how to analytically integrate the Beta(170, 130) posterior… ….but we do know how to simulate from a Beta

18
Bayesian computation Monte Carlo integration –Suppose we have samples (1), (2),…, (n) from p ( | x ) –Then E ( µ j x ) = R µ p ( µ j x ) dµ ¼ 1 n P n i = 1 µ ( i ) Can also use samples to estimate posterior tail area probabilities, percentiles, variances etc. Difficult to generate independent samples when posterior is complex and high dimensional Instead, generate dependent samples from a Markov chain having p ( | x ) as its stationary distribution Markov chain Monte Carlo (MCMC)

19
Illustrative Examples

20
Borrowing strength Bayesian learning borrowing strength (precision) from other sources of information Informative prior is one such source –todays posterior is tomorrows prior –relevance of prior information to current study must be justified

21
Informative priors Example 1: Western and Jackman (1994)* Example of regression analysis in comparative research What explains cross-national variation in union density? –Union density is defined as the percentage of the work force who belongs to a labour union Two issues –Philosophical: data represent all available observations from a population conventional (frequentist) analysis based on long-run behaviour of repeatable data mechanism not appropriate –Practical: small, collinear dataset yields imprecise estimates of regression effects * Slides adapted from Jeff Grynaviski:

22
Informative priors Competing theories –Wallerstein: union density depends on the size of the civilian labour force (LabF) –Stephens: union density depends on industrial concentration (IndC) –Note: These two predictors correlate at Control variable: presence of a left-wing government (LeftG) Sample: n = 20 countries with a continuous history of democracy since World War II Fit linear regression model to compare theories union density i ~ N( i, 2 ) i = LeftG + 2 LabF + 3 IndC

23
Informative priors Results with non-informative priors on regression coefficients (numerically equivalent to OLS analysis) point estimate ___ 95% CI

24
Informative priors Motivation for Bayesian approach with informative priors Because of small sample size and multicollinear variables, not able to adjudicate between theories Data tend to favour Wallerstein (union density depends on labour force size), but neither coefficient estimated very precisely Other historical data are available that could provide further relevant information Incorporation of prior information provides additional structure to the data, which helps to uniquely identify the two coefficients

25
Informative priors Prior distributions for regression coefficients Wallerstein Believes in negative labour force effect Comparison of Sweden and Norway in 1950: doubling of labour force corresponds to 3.5-4% drop in union density on log scale, labour force effect size -3.5/log(2) -5 Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 2 ~ N(-5, )

26
Informative priors Prior distributions for regression coefficients Stephens Believes in positive industrial concentration effect Decline in industrial concentration in UK in 1980s: drop of 0.3 in industrial concentration corresponds to about 3% drop in union density industrial concentration effect size 3/0.3 = 10 Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 3 ~ N(10, 5 2 )

27
Informative priors Prior distributions for regression coefficients Wallerstein and Stephens Both believe left-wing govts assist union growth Assuming 1 year of left-wing govt increases union density by about 1% translates to effect size of 0.3 Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 1 ~ N(0.3, ) Vague prior 0 ~ N(0, ) assumed for intercept

28
Informative priors Ind Conc Lab Force Left Govt

29
Informative priors Effects of LabF and IndC estimated more precisely Both sets of prior beliefs support inference that labour-force size decreases union density Only Stephens prior supports conclusion that industrial concentration increases union density Choice of prior is subjective – if no consensus, can we be satisfied that data have been interpreted fairly? Sensitivity analysis –Sensitivity to priors (e.g. repeat analysis using priors with increasing variance) –Sensitivity to data (e.g. residuals, influence diagnostics)

30
Hierarchical priors Hierarchical priors are another widely used approach for borrowing strength Useful when data available on many similar units (individuals, areas, studies, subgroups,…) Data x i and parameters i for each unit i =1,…,N Three different assumptions: –Independent parameters: units are unrelated, and each i is estimated separately using data x i alone –Identical parameters: observations treated as coming from same unit, with common parameter –Exchangeable parameters: units are similar (labels convey no information) mathematically equivalent to assuming i s are drawn from common probability distribution with unknown parameters

31
Meta-analysis Example 2: Meta-analysis (Spiegelhalter et al 2004) 8 small RCTs of IV magnesium sulphate following acute myocardial infarction Data: x ig = deaths, n ig = patients in trial i, treatment group g (0=control, 1=magnesium) Model (likelihood): x ig ~ Binomial( p ig, n ig ) logit( p ig ) = i + i · g θ i is log odds ratio for treatment effect If not willing to believe trials are identical, but no reason to believe they are systematically different assume is are exchangeable with hierarchical prior i ~ Normal(, 2 ), 2 also treated as unknown with (vague) priors

32
Meta-analysis Estimates and 95% intervals for treatment effect from independent MLE and hierarchical Bayesian analysis

33
Meta-analysis Effective sample size n = sample size of trial V1 = variance of i without borrowing (var of MLE) V2 = variance of i with borrowing (posterior variance of i ) ESS = n £ V 1 V 2

34
Meta-analysis Example 3: Meta-analysis of effect of class size on educational achievement (Goldstein et al, 2000) 8 studies: 1 RCT 3 matched 2 experimental 2 observational

35
Meta-analysis Goldstein et al use maximum likelihood, with bootstrap CI due to small sample size Under-estimates uncertainty relative to Bayesian intervals Note that 95% CI for Bayesian estimate of effect of class size includes 0

36
Accounting for data quality Bayesian approach also provides formal framework for propagating uncertainty about different quantities in a model Natural tool for explicitly modelling different aspects of data quality –Measurement error –Missing data

37
Accounting for data quality Example 4: Accounting for population errors in small- area estimation and disease mapping (Best and Wakefield, 1999) Context: Mapping geographical variations in risk of breast cancer by electoral ward in SE England, Typical model: y i ~ Poisson( i N i ) y i = number of breast cancer cases in area i i is the area specific rate of breast cancer: parameter of interest N i = t N it = population-years at risk in area i

38
Accounting for data quality N i usually assumed to be known Ignores uncertainty in small-area age/sex population counts in inter-census years B&W make use of additional data on Registrar Generals mid-year district age/sex population totals N dt Model A: N it = N dt p it where p it is proportion of annual district population in particular age group of interest living in ward i p it estimated by interpolating 1981 and 1991 census counts Model B: Allow for sampling variability in N it N it ~ Multinomial( N dt, [ p 1t,…, p Kt ]) Model C: Allow for uncertainty in proportions p it p it ~ informative Dirichlet prior distribution

39
Accounting for data quality i 2 i prior XiXi yiyi ward i NiNi Random effects Poisson regression: log i = i + X i X i = deprivation score for ward i

40
Accounting for data quality i 2 i prior XiXi yiyi ward i N it p it N dt year t prior NiNi

41
Accounting for data quality Ward RR (assuming E i known) Ward RR (modelled E i ) Area-specific RR estimates N i known A B C RR of breast cancer for affluent vs deprived wards

42
Model uncertainty Model uncertainty can be large for observational data studies In regression models: –What is the best set of predictors for response of interest? –Which confounders to control for? –Which interactions to include? –What functional form to use (linear, non- linear,….)?

43
Model uncertainty Example 5: Predictors of crime rates in US States (adapted from Raftery et al, 1997) Ehrlich (1973) – developed and tested theory that decision to commit crime is rational choice based on costs and benefits Costs of crime related to probability of imprisonment and average length of time served in prison Benefits of crime related to income inequalities and aggregate wealth of community Net benefits of other (legitimate) activities related to employment rate and education levels in community Ehrlich analysed data from 47 US states in 1960, focusing on relationship between crime rate and the 2 prison variables Up to 13 candidate control variables also considered

44
Model uncertainty y = log crime rate in 1960 in each of 47 US states Z 1, Z 2 = log prob. of prison, log av. time in prison X 1,…, X 13 = candidate control variables Fit Normal linear regression model Results sensitive to choice of control variables Table adapted from Table 2 in Raftery et al (1997)

45
Model uncertainty Using Bayesian approach, can let set of control variables be an unknown parameter of the model, Don't know (a priori) no. of covariates in best model has unknown dimension assign prior distribution Can handle such trans-dimensional (TD) models using reversible jump MCMC algorithms Normal linear regression model y i ~ Normal( i, 2 ) i = 1,...,47 Variable selection model:

46
yiyi k 2 i state i XiXi ZiZi Model uncertainty

47

48
probability Posterior mean and 95% CI for effect (b) conditional on being in model Model uncertainty Posterior probability that control variable is in model

49
Model uncertainty Most likely (40%) set of control variables contains X 4 (police expenditure in 1960) and X 13 (income inequality) 2nd most likely (28%) set of control variables contains X 5 (police expenditure in 1959) and X 13 (income inequality) Control variables with >10% marginal probability of inclusion –X 3 : average years of schooling (18%) –X 4 : police expenditure in 1960 (56%) –X 5 : police expenditure in 1959 (40%) –X 13 : income inequality (94%) Posterior estimates of prison variables, averaged over models log prob. of prison: (-0.55, -0.05) log av. time in prison: (-0.69, 0.14)

50
Discussion Bayesian approach provides coherent framework for combining many sources of evidence in a statistical model Formal approach to borrowing strength –Improved precision/effective sample size –Fully accounts for uncertainty Relevance of different pieces of evidence is a judgement – must be justifiable Bayesian approach forces us to be explicit about model assumptions Sensitivity analysis to assumptions is crucial

51
Discussion Bayesian calculations are computationally intensive, but: –Provides exact inference; no asymptotics –MCMC offers huge flexibility to model complex problems All examples discussed here were fitted using free WinBUGS software: Want to learn more about using Bayesian methods for social science data analysis? –Short course: Introduction to Bayesian inference and WinBUGS, Sept 19-20, Imperial College See for detailswww.bias-project.org.uk

52
Thank you!

53
References Best, N. and Wakefield, J. (1999). Accounting for inaccuracies in population counts and case registration in cancer mapping studies. J Roy Statist Soc, Series A, 162: Goldstein, H., Yang, M., Omar, R., Turner, R. and Thompson, S. (2000). Meta-analysis using multilevel models with an application to the study of class size effects. Applied Statistics, 49: Raftery, A., Madigan, D. and Hoeting, J. (1997). Bayesian model averaging for linear regression models. J Am Statist Assoc, 92: Spiegelhalter, D., Abrams, K. and Myles, J. (2004). Bayesian Approaches to Clinical Trials and Health Care Evaluation, Wiley, Chichester. Western, B. and Jackman, S. (1994). Bayesian inference for comparative research. The American Political Science Review, 88:

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google