Imperial College, London

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

SJS SDI_21 Design of Statistical Investigations Stephen Senn 2 Background Stats.
Applying Multilevel Models in Evaluation of Bioequivalence in Drug Trials Min Yang Prof of Medical Statistics Nottingham Clinical Trials Unit School of.
1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.
Introduction to Monte Carlo Markov chain (MCMC) methods
MCMC estimation in MlwiN
Multilevel modelling short course
Assumptions underlying regression analysis
Probability Distributions
The Marginal Utility of Income Richard Layard* Guy Mayraz* Steve Nickell** * CEP, London School of Economics ** Nuffield College, Oxford.
Chapter 7 Sampling and Sampling Distributions
Multilevel Event History Modelling of Birth Intervals
Chapter 7 Hypothesis Testing
Statistical Analysis SC504/HS927 Spring Term 2008
Multiple Regression and Model Building
Design of Experiments Lecture I
Chapter 16 Inferential Statistics
Lecture 11 (Chapter 9).
©Towers Perrin Emmanuel Bardis, FCAS, MAAA Cane Fall 2005 meeting Stochastic Reserving and Reserves Ranges Fall 2005 This document was designed for discussion.
Bayesian Health Technology Assessment: An Industry Statistician's Perspective John Stevens AstraZeneca R&D Charnwood Bayesian Statistics Focus Team Leader.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
Nicky Best, Chris Jackson, Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London Studying.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
What role should probabilistic sensitivity analysis play in SMC decision making? Andrew Briggs, DPhil University of Oxford.
Nicky Best and Chris Jackson With Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Point and Confidence Interval Estimation of a Population Proportion, p
The Simple Regression Model
Evaluating Hypotheses
Results 2 (cont’d) c) Long term observational data on the duration of effective response Observational data on n=50 has EVSI = £867 d) Collect data on.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
1 PREDICTION In the previous sequence, we saw how to predict the price of a good or asset given the composition of its characteristics. In this sequence,
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Lecture 2 Basic Bayes and two stage normal normal model…
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
The Triangle of Statistical Inference: Likelihoood
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Chris Jackson With Nicky Best and Sylvia Richardson Department of Epidemiology and Public Health Imperial College, London
Combining prevalence estimates from multiple sources Julian Flowers.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Department of SOCIAL MEDICINE Producing Small Area Estimates of the Need for Hip and Knee Replacement Surgery ANDY JUDGE Nicky Welton Mary Shaw Yoav Ben-Shlomo.
Estimating a Population Proportion
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Latent Class Regression Model Graphical Diagnostics Using an MCMC Estimation Procedure Elizabeth S. Garrett Scott L. Zeger Johns Hopkins University
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Overview G. Jogesh Babu. Overview of Astrostatistics A brief description of modern astronomy & astrophysics. Many statistical concepts have their roots.
Chapter 8: Simple Linear Regression Yang Zhenlin.
INTRODUCTION TO CLINICAL RESEARCH Introduction to Statistical Inference Karen Bandeen-Roche, Ph.D. July 12, 2010.
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Canadian Bioinformatics Workshops
Bayesian data analysis
Model Inference and Averaging
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Parametric Methods Berlin Chen, 2005 References:
Chapter 8: Estimating with Confidence
Presentation transcript:

Imperial College, London Introduction to Bayesian inference and computation for social science data analysis Nicky Best Imperial College, London www.bias-project.org.uk

Outline Overview of Bayesian methods Examples illustrating: Discussion Illustration of conjugate Bayesian inference MCMC methods Examples illustrating: Analysis using informative priors Hierarchical priors, meta-analysis and evidence synthesis Adjusting for data quality Model uncertainty Discussion

Overview of Bayesian inference and computation

Overview of Bayesian methods Bayesian methods have been widely applied in many areas medicine / epidemiology / genetics ecology / environmental sciences finance archaeology political and social sciences, ……… Motivations for adopting Bayesian approach vary natural and coherent way of thinking about science and learning pragmatic choice that is suitable for the problem in hand

Overview of Bayesian methods Medical context: FDA draft guidance www.fda.gov/cdrh/meetings/072706-bayesian.html: “Bayesian statistics…provides a coherent method for learning from evidence as it accumulates” Evidence can accumulate in various ways: Sequentially Measurement of many ‘similar’ units (individuals, centres, sub-groups, areas, periods…..) Measurement of different aspects of a problem Evidence can take different forms: Data Expert judgement

Overview of Bayesian methods Bayesian approach also provides formal framework for propagating uncertainty Well suited to building complex models by linking together multiple sub-models Can obtain estimates and uncertainty intervals for any parameter, function of parameters or predictive quantity of interest Bayesian inference doesn’t rely on asymptotics or analytic approximations Arbitrarily wide range of models can be handled using same inferential framework Focus on specifying realistic models, not on choosing analytically tractable approximation

Bayesian inference p ( µ j x ) = / Distinguish between x : known quantities (data) q : unknown quantities (e.g. regression coefficients, future outcomes, missing observations) Fundamental idea: use probability distributions to represent uncertainty about unknowns Likelihood – model for the data: p( x | q ) Prior distribution – representing current uncertainty about unknowns: p(q ) Applying Bayes theorem gives posterior distribution p ( µ j x ) = R d /

Conjugate Bayesian inference Example: election poll (from Franklin, 2004*) Imagine an election campaign where (for simplicity) we have just a Government/Opposition vote choice. We enter the campaign with a prior distribution for the proportion supporting Government. This is p(q ) As the campaign begins, we get polling data. How should we change our estimate of Government’s support? *Adapted from Charles Franklin’s Essex Summer School course slides: http://www.polisci.wisc.edu/users/franklin/Content/Essex/Lecs/BayesLec01p6up.pdf

Conjugate Bayesian inference Data and likelihood Each poll consists of n voters, x of whom say they will vote for Government and n - x will vote for the opposition. If we assume we have no information to distinguish voters in their probability of supporting government then we have a binomial distribution for x x » µ n ¶ ( 1 ¡ ) / This binomial distribution is the likelihood p(x | q )

Conjugate Bayesian inference Prior We need to specify a prior that expresses our uncertainty about the election (before it begins) conforms to the nature of the q parameter, i.e. is continuous but bounded between 0 and 1 A convenient choice is the Beta distribution p ( µ ) = B e t a ; b ¡ + 1 /

Conjugate Bayesian inference Beta(a,b) distribution can take a variety of shapes depending on its two parameters a and b Mean of Beta(a, b) distribution = a/(a+b) Variance of Beta(a,b) distribution = ab(a+b+1)/(a+b)2

Conjugate Bayesian inference Posterior Combining a beta prior with the binomial likelihood gives a posterior distribution p ( µ j x ; n ) / 1 ¡ a b = + B e t When prior and posterior come from same family, the prior is said to be conjugate to the likelihood Occurs when prior and likelihood have the same ‘kernel’

Conjugate Bayesian inference Suppose I believe that Government only has the support of half the population, and I think that estimate has a standard deviation of about 0.07 This is approximately a Beta(50, 50) distribution We observe a poll with 200 respondents, 120 of whom (60%) say they will vote for Government This produces a posterior which is a Beta(120+50, 80+50) = Beta(170, 130) distribution

Conjugate Bayesian inference Prior mean, E(q ) = 50/100 = 0.5 Posterior mean, E(q | x, n) = 170/300 = 0.57 Posterior SD, √Var(q | x, n) = 0.029 Frequentist estimate is based only on the data: ^ µ = 1 2 : 6 ; S E ( ) p £ 4 3 5

Conjugate Bayesian inference A harder problem What is the probability that Government wins? It is not .57 or .60. Those are expected votes but not the probability of winning. How to answer this? Frequentists have a hard time with this one. They can obtain a p-value for testing H0: q > 0.5, but this isn’t the same as the probability that Government wins (its actually the probability of observing data more extreme than 120 out of 200 if H0 is true) Easy from Bayesian perspective – calculate Pr(q > 0.5 | x, n), the posterior probability that q > 0.5

Bayesian computation All Bayesian inference is based on the posterior distribution Summarising posterior distributions involves integration M e a n : E ( µ j x ) = R p d P r i c t o w ; b l s 2 1 Except for conjugate models, integrals are usually analytically intractable Use Monte Carlo (simulation) integration (MCMC)

Bayesian computation Suppose we didn’t know how to analytically integrate the Beta(170, 130) posterior… ….but we do know how to simulate from a Beta

Bayesian computation Monte Carlo integration Suppose we have samples q (1), q (2),…, q (n) from p(q | x ) Then E ( µ j x ) = R p d ¼ 1 n P i Can also use samples to estimate posterior tail area probabilities, percentiles, variances etc. Difficult to generate independent samples when posterior is complex and high dimensional Instead, generate dependent samples from a Markov chain having p(q | x ) as its stationary distribution → Markov chain Monte Carlo (MCMC)

Illustrative Examples

Borrowing strength Bayesian learning → borrowing “strength” (precision) from other sources of information Informative prior is one such source “today’s posterior is tomorrows prior” relevance of prior information to current study must be justified

Informative priors Example 1: Western and Jackman (1994)* Example of regression analysis in comparative research What explains cross-national variation in union density? Union density is defined as the percentage of the work force who belongs to a labour union Two issues Philosophical: data represent all available observations from a population → conventional (frequentist) analysis based on long-run behaviour of repeatable data mechanism not appropriate Practical: small, collinear dataset yields imprecise estimates of regression effects * Slides adapted from Jeff Grynaviski: http://home.uchicago.edu/~grynav/bayes/abs03.htm

Informative priors Competing theories Wallerstein: union density depends on the size of the civilian labour force (LabF) Stephens: union density depends on industrial concentration (IndC) Note: These two predictors correlate at -0.92. Control variable: presence of a left-wing government (LeftG) Sample: n = 20 countries with a continuous history of democracy since World War II Fit linear regression model to compare theories union densityi ~ N(mi, s2) mi = b0 + b1LeftG + b2LabF + b3IndC

Informative priors  point estimate ___ 95% CI Results with non-informative priors on regression coefficients (numerically equivalent to OLS analysis)  point estimate ___ 95% CI

Informative priors Motivation for Bayesian approach with informative priors Because of small sample size and multicollinear variables, not able to adjudicate between theories Data tend to favour Wallerstein (union density depends on labour force size), but neither coefficient estimated very precisely Other historical data are available that could provide further relevant information Incorporation of prior information provides additional structure to the data, which helps to uniquely identify the two coefficients

Informative priors Prior distributions for regression coefficients Wallerstein Believes in negative labour force effect Comparison of Sweden and Norway in 1950: doubling of labour force corresponds to 3.5-4% drop in union density on log scale, labour force effect size ≈ -3.5/log(2) ≈ -5 Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 b2 ~ N(-5, 2.52)

Informative priors Prior distributions for regression coefficients Stephens Believes in positive industrial concentration effect Decline in industrial concentration in UK in 1980s: drop of 0.3 in industrial concentration corresponds to about 3% drop in union density industrial concentration effect size ≈ 3/0.3 = 10 Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 b3 ~ N(10, 52)

Informative priors Prior distributions for regression coefficients Wallerstein and Stephens Both believe left-wing gov’ts assist union growth Assuming 1 year of left-wing gov’t increases union density by about 1% translates to effect size of 0.3 Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 b1 ~ N(0.3, 0.152) Vague prior b0 ~ N(0, 1002) assumed for intercept

Ind Conc Lab Force Left Govt Informative priors

Informative priors Effects of LabF and IndC estimated more precisely Both sets of prior beliefs support inference that labour-force size decreases union density Only Stephens’ prior supports conclusion that industrial concentration increases union density Choice of prior is subjective – if no consensus, can we be satisfied that data have been interpreted “fairly”? Sensitivity analysis Sensitivity to priors (e.g. repeat analysis using priors with increasing variance) Sensitivity to data (e.g. residuals, influence diagnostics)

Hierarchical priors Hierarchical priors are another widely used approach for borrowing strength Useful when data available on many “similar” units (individuals, areas, studies, subgroups,…) Data xi and parameters qi for each unit i=1,…,N Three different assumptions: Independent parameters: units are unrelated, and each qi is estimated separately using data xi alone Identical parameters: observations treated as coming from same unit, with common parameter q Exchangeable parameters: units are “similar” (labels convey no information) → mathematically equivalent to assuming qi’s are drawn from common probability distribution with unknown parameters

Meta-analysis Example 2: Meta-analysis (Spiegelhalter et al 2004) 8 small RCTs of IV magnesium sulphate following acute myocardial infarction Data: xig = deaths, nig = patients in trial i, treatment group g (0=control, 1=magnesium) Model (likelihood): xig ~ Binomial(pig, nig) logit(pig) = fi + qi·g θi is log odds ratio for treatment effect If not willing to believe trials are identical, but no reason to believe they are systematically different → assume qi’s are exchangeable with hierarchical prior qi ~ Normal(m, s 2) m, s 2 also treated as unknown with (vague) priors

Meta-analysis Estimates and 95% intervals for treatment effect from independent MLE and hierarchical Bayesian analysis

Meta-analysis E S = n £ Effective sample size V T r i a l n V ( M L E n = sample size of trial V1 = variance of qi without borrowing (var of MLE) V2 = variance of qi with borrowing (posterior variance of qi ) E S = n £ V 1 2 T r i a l n V ( M L E ) µ j x 1 = 2 S 7 6 . 5 3 4 9 8

Meta-analysis Example 3: Meta-analysis of effect of class size on educational achievement (Goldstein et al, 2000) 8 studies: 1 RCT 3 matched 2 experimental 2 observational

Meta-analysis Goldstein et al use maximum likelihood, with bootstrap CI due to small sample size Under-estimates uncertainty relative to Bayesian intervals Note that 95% CI for Bayesian estimate of effect of class size includes 0

Accounting for data quality Bayesian approach also provides formal framework for propagating uncertainty about different quantities in a model Natural tool for explicitly modelling different aspects of data quality Measurement error Missing data

Accounting for data quality Example 4: Accounting for population errors in small-area estimation and disease mapping (Best and Wakefield, 1999) Context: Mapping geographical variations in risk of breast cancer by electoral ward in SE England, 1981-1991 Typical model: yi ~ Poisson(li Ni) yi = number of breast cancer cases in area i Ni = St Nit = population-years at risk in area i li is the area specific rate of breast cancer: parameter of interest

Accounting for data quality Ni usually assumed to be known Ignores uncertainty in small-area age/sex population counts in inter-census years B&W make use of additional data on Registrar General’s mid-year district age/sex population totals Ndt Model A: Nit = Ndt pit where pit is proportion of annual district population in particular age group of interest living in ward i pit estimated by interpolating 1981 and 1991 census counts Model B: Allow for sampling variability in Nit Nit ~ Multinomial(Ndt, [ p1t ,…, pKt ]) Model C: Allow for uncertainty in proportions pit pit ~ informative Dirichlet prior distribution

Accounting for data quality prior prior prior Random effects Poisson regression: log li = ai + b Xi Xi = deprivation score for ward i m s 2 b Xi ai Ni li yi ward i

Accounting for data quality prior prior prior prior m s 2 Ndt b pit Nit Xi ai year t Ni li yi ward i

Accounting for data quality Ni known A B C RR of breast cancer for affluent vs deprived wards Area-specific RR estimates Ward RR (assuming Ei known) Ward RR (modelled Ei)

Model uncertainty Model uncertainty can be large for observational data studies In regression models: What is the ‘best’ set of predictors for response of interest? Which confounders to control for? Which interactions to include? What functional form to use (linear, non-linear,….)?

Model uncertainty Example 5: Predictors of crime rates in US States (adapted from Raftery et al, 1997) Ehrlich (1973) – developed and tested theory that decision to commit crime is rational choice based on costs and benefits Costs of crime related to probability of imprisonment and average length of time served in prison Benefits of crime related to income inequalities and aggregate wealth of community Net benefits of other (legitimate) activities related to employment rate and education levels in community Ehrlich analysed data from 47 US states in 1960, focusing on relationship between crime rate and the 2 prison variables Up to 13 candidate control variables also considered

Model uncertainty y = log crime rate in 1960 in each of 47 US states Z1, Z2 = log prob. of prison, log av. time in prison X1,…, X13 = candidate control variables Fit Normal linear regression model Results sensitive to choice of control variables M o d e l C n t r v a i b s E m ® c ( S ) P T F u A - . 3 1 2 7 p w , 4 9 { j R 8 5 6 h Table adapted from Table 2 in Raftery et al (1997)

Model uncertainty ¹ = Z ° + W ¯ ; ( X : ) n u m b e r o f c t l y s d Using Bayesian approach, can let set of control variables be an unknown parameter of the model, q Don't know (a priori) no. of covariates in ‘best’ model  q has unknown dimension  assign prior distribution Can handle such “trans-dimensional” (TD) models using “reversible jump” MCMC algorithms Normal linear regression model yi ~ Normal(mi, s 2) i = 1,...,47 Variable selection model: ¹ i = Z ° + W ¯ ; ( X µ 1 2 : k ) n u m b e r o f c t l y s d p v g ±

Model uncertainty k b q g mi Xi Zi s2 yi state i

Model uncertainty

Model uncertainty 0 0.4 0.8 probability 0 0.4 0.8 probability Posterior probability that control variable is in model Posterior mean and 95% CI for effect (b) conditional on being in model

Model uncertainty Most likely (40%) set of control variables contains X4 (police expenditure in 1960) and X13 (income inequality) 2nd most likely (28%) set of control variables contains X5 (police expenditure in 1959) and X13 (income inequality) Control variables with >10% marginal probability of inclusion X3 : average years of schooling (18%) X4 : police expenditure in 1960 (56%) X5 : police expenditure in 1959 (40%) X13 : income inequality (94%) Posterior estimates of prison variables, averaged over models log prob. of prison: -0.29 (-0.55, -0.05) log av. time in prison: -0.27 (-0.69, 0.14)

Discussion Bayesian approach provides coherent framework for combining many sources of evidence in a statistical model Formal approach to “borrowing strength” Improved precision/effective sample size Fully accounts for uncertainty Relevance of different pieces of evidence is a judgement – must be justifiable Bayesian approach forces us to be explicit about model assumptions Sensitivity analysis to assumptions is crucial

See www.bias-project.org.uk for details Discussion Bayesian calculations are computationally intensive, but: Provides exact inference; no asymptotics MCMC offers huge flexibility to model complex problems All examples discussed here were fitted using free WinBUGS software: www.mrc-bsu.cam.ac.uk Want to learn more about using Bayesian methods for social science data analysis? Short course: Introduction to Bayesian inference and WinBUGS, Sept 19-20, Imperial College See www.bias-project.org.uk for details

Thank you!

References Best, N. and Wakefield, J. (1999). Accounting for inaccuracies in population counts and case registration in cancer mapping studies. J Roy Statist Soc, Series A, 162: 363-382 Goldstein, H., Yang, M., Omar, R., Turner, R. and Thompson, S. (2000). Meta-analysis using multilevel models with an application to the study of class size effects. Applied Statistics, 49: 399-412. Raftery, A., Madigan, D. and Hoeting, J. (1997). Bayesian model averaging for linear regression models. J Am Statist Assoc, 92: 179-191. Spiegelhalter, D., Abrams, K. and Myles, J. (2004). Bayesian Approaches to Clinical Trials and Health Care Evaluation, Wiley, Chichester. Western, B. and Jackman, S. (1994). Bayesian inference for comparative research. The American Political Science Review, 88: 412-423.