Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

Introduction to Monte Carlo Markov chain (MCMC) methods

MCMC estimation in MlwiN

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.

Bayesian Estimation in MARK

Sampling Distributions and Sample Proportions

Sampling Distributions (§ )

CHAPTER 16 MARKOV CHAIN MONTE CARLO

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.

Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.

Presenting: Assaf Tzabari

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.

Inference about a Mean Part II

Chapter 13: Inference in Regression

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Statistical Decision Theory

Chapter 8: Confidence Intervals

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Estimation This is our introduction to the field of inferential statistics. We already know why we want to study samples instead of entire populations,

Confidence intervals and hypothesis testing Petter Mostad

The Normal Curve Theoretical Symmetrical Known Areas For Each Standard Deviation or Z-score FOR EACH SIDE:  34.13% of scores in distribution are b/t the.

- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.

Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.

Week 41 Estimation – Posterior mean An alternative estimate to the posterior mode is the posterior mean. It is given by E(θ | s), whenever it exists. This.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:

Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.

1 Bayesian Essentials Slides by Peter Rossi and David Madigan.

An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.

Lecture #9: Introduction to Markov Chain Monte Carlo, part 3

Sampling and estimation Petter Mostad

Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.

Linear Judgment Models: What Do They Suggest About Human Judgment?

Bayesian Statistics, Modeling & Reasoning What is this course about? P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/04/2016:

1 Chapter 8 Interval Estimation. 2 Chapter Outline  Population Mean: Known  Population Mean: Unknown  Population Proportion.

Bayes Theorem, a.k.a. Bayes Rule

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Univariate Gaussian Case (Cont.)

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Introduction: Metropolis-Hasting Sampler Purpose--To draw samples from a probability distribution There are three steps 1Propose a move from x to y 2Accept.

Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.

Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.

Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:

Computing with R & Bayesian Statistical Inference P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/11/2016: Lecture 02-1.

Bayesian Estimation and Confidence Intervals Lecture XXII.

Markov Chain Monte Carlo in R

Univariate Gaussian Case (Cont.)

MCMC Output & Metropolis-Hastings Algorithm Part I

Use of Pseudo-Priors in Bayesian Model Comparison

Set Up for Instructor MGH Display: Try setting your resolution to 1024 by 768 Run Powerpoint. For most reliable start up: Start laptop & projector before.

CS 4/527: Artificial Intelligence

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

CAP 5636 – Advanced Artificial Intelligence

My Office Hours I will stay after class on both Monday and Wednesday, i.e., 1:30 Mon/Wed in MGH 030. Can everyone stay if they need to? Psych 548, Miyamoto,

Introduction to Bayesian Model Comparison

One-Sample Models (Continuous DV) Then: Simple Linear Regression

Intro to Bayesian Hierarchical Modeling

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence

EM for Inference in MV Data

LECTURE 09: BAYESIAN LEARNING

LECTURE 07: BAYESIAN ESTIMATION

EM for Inference in MV Data

Sampling Distributions (§ )

Mathematical Foundations of BME Reza Shadmehr

Presentation transcript:

Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016: Lecture 03-1 Note: This Powerpoint presentation may contain macros that I wrote to help me create the slides. The macros aren’t needed to view the slides. You can disable or delete the macros without any change to the presentation.

Outline Overview of the JAGS approach to Bayesian computation Metropolis-Hastings Algorithm - a basic tool for approximating a posterior distribution. ♦ This is the central step in computing a Bayesian analysis. Psych 548: Miyamoto, Win ‘16 2

Outline Metropolis-Hastings Algorithm - a basic tool for approximating a posterior distribution. ♦ This is the central step in computing a Bayesian analysis. coda.samples and mcmc.list objects: ♦ Understanding the structure of the output that is returned by the coda.samples function. Psych 548, Miyamoto, Win '16 3 #

Metropolis-Hastings Algorithm

Outline Metropolis-Hastings (MH) algorithm – one of the main tools for approximating a posterior distribution by means of Markov-Chain- Monte-Carlo (MCMC) General idea behind MH algorithm Kruschke's example for computing the posterior of a binomial parameter by means of MH algorithm Psych 548, Miyamoto, Win '16 5 General Strategy of Bayesian Statistical Inference

Three Strategies of Bayesian Statistical Inference Psych 548, Miyamoto, Win '16 6 Define Prior Distributions Define Likelihoods Conditional on Parameters Data Compute Posterior from Conjugate Priors (if possible) Compute Approximate Posterior by MCMC Algorithm (if possible) Compute Posterior with Grid Approximation (if practically possible) Define the Class of Statistical Models (Reality is Assumed to Lie within this Class of Models Same Slide – Summary Representation

Psych 548, Miyamoto, Win '16 General Strategy of Bayesian Statistical Inference 7 Define Prior Distributions Define Likelihoods Conditional on Parameters Data Compute Posterior from Conjugate Priors (if possible) Compute Approximate Posterior by MCMC Algorithm (if possible) Compute Posterior with Grid Approximation (if practically possible) Define the Class of Statistical Models (Reality is Assumed to Lie within this Class of Models Illustrate Idea of Sampling from a Posterior Distribution

MCMC Algorithm Samples from the Posterior Distribution Psych 548, Miyamoto, Win '16 8 Validity of the MCMC Approximation BIG NUMBER

Validity of the MCMC Approximation Theorem:Under very general mathematical conditions: as sample size K gets very large, the sample distribution converges to the true posterior probability distribution. Psych 548, Miyamoto, Win '16 9 Look Closely at Bayes Rule

The true stat model belongs to an infinite class of models. The true stat model is characterized by a vector of parameters, θ = ( θ 1, θ 2,..., θ n ). o E.g., in a twoway normal model: θ = (θ 1, θ 2, θ 3, θ 4 ), where θ 1 = mean of pop 1, θ 2 = mean of pop 2, θ 3 = variance of pop 1, θ 4 = variance of pop 2 Psych 548, Miyamoto, Win '16 10 Before computing a Bayesian analysis, the researcher knows: Continue This Slide with Additional Bullet Points

The true stat model belongs to an infinite class of models. The true stat model is characterized by a vector of parameters, θ = ( θ 1, θ 2,..., θ n ). P( θ ) = the prior probability distribution over the vector θ. P(D | θ ) = the likelihood of the data D given a specific θ. Bayes Rule: Psych 548, Miyamoto, Win '16 11 Before computing a Bayesian analysis, the researcher knows: Why Is Bayes Rule Hard to Compute in Practice? Known No simple math formula Unknown

Before computing a Bayesian analysis, the researcher knows: θ = ( θ 1, θ 2,..., θ n ) is a vector of parameters for a statistical model. o E.g., in a oneway anova with 3 group, θ 1 = mean 1, θ 2 = mean 2,  3 = mean 3, and  4 = the common variance each of the 3 populations. P( θ ) = the prior probability distribution over the vector θ. P(D | θ ) = the likelihood of the data D given any particular vector θ of parameters. Bayes Rule: Psych 548, Miyamoto, Win '16 12 Bayes Rule Why Is Bayes Rule Hard to Compute in Practice?

Fact #1: is easy to compute for individual cases, but hard to compute for an entire distribution. Metropolis-Hastings Algorithm uses Fact #1 to compute an approximation to P(θ | D) where Why Is Bayes Rule Hard to Apply in Practice? Psych 548, Miyamoto, Win '16 13 Reminder: Each Step of Metropolis-Hastings Algorithm Depends only on Immediately Preceding Step.

Reminder: Each Sample from the Posterior Depends only on the Immediate Preceding Step Psych 548, Miyamoto, Win '16 14 BIG PICTURE: Metropolis-Hastings Algorithm

BIG PICTURE: Metropolis Hastings Algorithm At the k-th step, you have a current vector of k parameters. This is your current sample. A “proposal function” F proposes a random new vector based only on the values in Iteration k: A “rejection rule” decides whether the proposal is acceptable or not. ♦ If it is acceptable:Iteration k + 1 = Proposal k ♦ If it is rejected:Iteration k + 1 = Iteration k Repeat the process at the next step. Psych 548, Miyamoto, Win '16 15 Metropolis-Hastings: The Proposal Density

Psych 548, Miyamoto, Win '16 16 Metropolis-Hastings (M-H) Algorithm The “Proposal” Density Notation: Let be a vector of specific values for θ 1, θ 2,..., θ n that make up the k-th sample. Choose a "proposal" density F(θ | θ k ) where for any specific, F(θ | θ k ) is a probability distribution over the θ  Ω = the set of all parameter vectors. Example: F(θ | θ k ) might be defined by: θ 1 ~ N(θ k 1,  = 2), θ 2 ~ N(θ k 2,  = 2),...., θ n ~ N(θ k n,  = 2). Step-by-Step Explanation of Metropolis-Hastings Algorithm With Symmetric Proposal Function

Psych 548, Miyamoto, Win '16 17 MH Algorithm for Case Where Proposal Function is Symmetric Step 1: Choose starting values for θ: Step 2: Draw a candidate θ c from F(θ | θ k ), i.e. θ c ~ F(θ | θ k ) (Remember that F( θ | θ k ) is the proposal distribution which depends only on θ k. Step 3: Compute the posterior odds: Step 4: Draw u ~ Uniform(0, 1). If R > u, set θ k+1 = θ c. If R  u, set θ k+1 = θ k. Step 5: Set k = k+1, and return to Step 2. Continue this process until you have a very large sample of θ. Closer Look at Steps 3 and 4

Closer Look at Steps 3 & 4 (Assuming Symettric Proposal Fn) Step 3: Compute the posterior odds: θ c = “candidate” sample θ k = previously accepted k-th sample Step 4: Draw random u ~ Uniform(0, 1). If R > u, set θ k+1 = θ c. If R  u, set θ k+1 = θ k. ♦ If P( θ c | D) > P( θ k | D), then R > 1.0, so it is certain that set θ k+1 = θ c. ♦ If P( θ c | D) < P( θ k | D), then R is the probability that set θ k+1 = θ c. ♦ Conclusion: The MCMC chain tends to jump towards high probability regions of the posterior, but it can jump to low probability regions. Psych 548, Miyamoto, Win '16 18 Return to Slide Showing the Metropolis-Hastings Algorithm - END

Remainder of these Slides The remainder of these slides were added after class on 1/20/2016. They are pure of interest for students who want to delve deeper into the Metropolis-Hastings algorithm. Scott Lynch has a nice description of the Metropolis-Hastings algorithm, including the role of asymmetric proposal functions: Lynch, S. M. (2007). Introduction to applied bayesian statistics and estimation for social scientists. New York: Springer. Psych 548:, Miyamoto, Win ‘16 19 M-H with Asymmetric Proposal Functions

Psych 548, Miyamoto, Win '16 20 MH Algorithm for Case Where Proposal Function is Asymmetric When the proposal function is asymmetric, Step 3 must be modified but all other steps remain exactly the same. Kruschke only discusses the cases with a symmetric proposal function. This slide and the following slides shrink the font for Steps 1, 2, 4 and 5 because they are the same as before. Step 1: Choose starting values for θ: Step 2: Draw a candidate θ c from F(θ | θ k ), i.e. θ c ~ F(θ | θ k ) (Remember that F(  |  k ) is the proposal distribution which depends only on  k. Step 3: Compute the criterion R: Step 4: Draw u ~ Uniform(0, 1). If R > u, set θ k+1 = θ c. If R  u, set θ k+1 = θ k. Step 5: Set k = k+1, and return to Step 2. Continue this process until you have a very large sample of θ. Same Slide – Focus on Correction for Asymmetry Same as before Different

Psych 548, Miyamoto, Win '16 21 MH Algorithm for Case Where Proposal Function is Asymmetric Step 1: Choose starting values for θ: Step 2: Draw a candidate θ c from F(θ | θ k ), i.e. θ c ~ F(θ | θ k ) (Remember that F(  |  k ) is the proposal distribution which depends only on  k. Step 3: Compute the criterion R: Step 4: Draw u ~ Uniform(0, 1). If R > u, set θ k+1 = θ c. If R  u, set θ k+1 = θ k. Step 5: Set k = k+1, and return to Step 2. Continue this process until you have a very large sample of θ. Graphs that Contrast Symmetric & Asymmetric Proposal Functions What is this ratio? See next slide.

for Symmetric & Asymmetric Proposal Functions 22 Psych 548:, Miyamoto, Win ‘16 Symmetric, Equal Variance Asymmetric Proposal Functions Functions Note that heights are equal. Note that heights are unequal. Return to Slide with M-H Algorithm Steps

Psych 548, Miyamoto, Win '16 23 MH Algorithm for Case Where Proposal Function is Asymmetric Step 1: Choose starting values for θ: Step 2: Draw a candidate θ c from F(θ | θ k ), i.e. θ c ~ F(θ | θ k ) (Remember that F(  |  k ) is the proposal distribution which depends only on  k. Step 3: Compute the criterion R: Step 4: Draw u ~ Uniform(0, 1). If R > u, set θ k+1 = θ c. If R  u, set θ k+1 = θ k. Step 5: Set k = k+1, and return to Step 2. Continue this process until you have a very large sample of θ. Summary re Symmetric & Asymmetric Proposal Functions - END Correction for the asymmetry of the proposal function.

Summary re Proposal Functions Kruschke’s examples use a symmetric proposal function. Therefore the criterion R is computed as: For Psych 548, you don’t have to worry about asymmetric proposal functions – just be aware that they are possible and they influence how the M-H algorithm works. Psych 548:, Miyamoto, Win ‘16 24 END