Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Slice Sampling Radford M. Neal The Annals of Statistics (Vol. 31, No. 3, 2003)
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Bayesian Estimation in MARK
Markov-Chain Monte Carlo
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian Reasoning: Markov Chain Monte Carlo
BAYESIAN INFERENCE Sampling techniques
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
Visual Recognition Tutorial
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Visual Recognition Tutorial
Introduction to Monte Carlo Methods D.J.C. Mackay.
Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,
Priors, Normal Models, Computing Posteriors
1 Lesson 3: Choosing from distributions Theory: LLN and Central Limit Theorem Theory: LLN and Central Limit Theorem Choosing from distributions Choosing.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.
Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
CS774. Markov Random Field : Theory and Application Lecture 15 Kyomin Jung KAIST Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Gil McVean, Department of Statistics Thursday February 12 th 2009 Monte Carlo simulation.
Introduction: Metropolis-Hasting Sampler Purpose--To draw samples from a probability distribution There are three steps 1Propose a move from x to y 2Accept.
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
The Monte Carlo Method/ Markov Chains/ Metropolitan Algorithm from sec in “Adaptive Cooperative Systems” -summarized by Jinsan Yang.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Introduction to Sampling based inference and MCMC
MCMC Output & Metropolis-Hastings Algorithm Part I
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Optimization of Monte Carlo Integration
Advanced Statistical Computing Fall 2016
Linear and generalized linear mixed effects models
Jun Liu Department of Statistics Stanford University
Markov chain monte carlo
Hidden Markov Models Part 2: Algorithms
Location-Scale Normal Model
More about Posterior Distributions
Confidence intervals: The basics
Multidimensional Integration Part I
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ch13 Empirical Methods.
Slides for Sampling from Posterior of Shape
Approximate Inference by Sampling
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior () then the posterior is If we can get a formula for this (e.g. have this as an explicit function of y and ) then we are done. But usually we can’t compute this particularly when y and  are multidimensional. So instead what we will do is estimate the distribution of the posterior by taking lots of samples from it. From this we can then do Bayesian estimation.

Idea of MCMC: We are going to set up a Markov Chain for which we can compute samples and which has the posterior density as it’s stationary distribution. There are several different ways to do this, but in all cases there are two main issues: Once we can be sure the MC indeed does have the posterior as its stationary distribution, we need a stopping criteria to ensure we have indeed reached stationarity. (convergence diagnosis) The samples from the MC are correlated, rather than being iid draws from the posterior, so we need some way to relate how the variance in the estimators (e.g. of moments of the posterior) compares to that of the actual posterior (variance estimation)

How do we set up an appropriate MC? Gibbs Sampling: This is useful in the situation which we can sample from conditional distributions but we don’t have the full or marginal distribution for the posterior. The assumption is (usually justified) that the full joint posterior distribution is determined uniquely if we know all conditional distributions, and from this we can find all marginal distributions as well. So for a posterior p(|y) where  = (1, …,k) the conditional distributions are p(i| j≠i,y) i = 1, …,k and if we know all these conditional distributions, e.g. we can sample from them, then we can use this to obtain samples from the full joint prior p(|y) and sample from the marginal posteriors p(i|y)

The Gibbs sampler: to find this we need a way to get a MC, meaning finding an update of the current state of the MC to the next state – this will not be deterministic since we are sampling from distributions, so what we are doing is specifying the transition function for the MC using the conditional distributions of the posterior. Procedure: Start with initial values of (2(0), …,k(0)) chosen from the prior density Draw 1(1) from p(1| 2(0),…, k(0),y) Draw 2(1) from p(2| 1(1), 3(0),…, k(0),y) And proceed in similar way to have a sample (1(1),2(1), …,k(1))

So the above procedure provides a way to get a new update of the parameters from an initial set – the “state” of the MC is the vector (1(1),2(1), …,k(1)) and we continue this same procedure to update the state for t = 2, …, T Then we have a stopping criteria to say that when we have t = t0, …, T then we are near stationary distribution so these vectors ( (t) , t = t0, …, T ) are a sample (they are correlated though since they are taken as a particular sample trajectory of the MC) of the stationary distribution of the MC and thus are a sample from the posterior. So for example, we could use (1 (t) , t = t0, …, T ) to get a histogram of values that are a sample of the marginal posterior density for 1 , p(1 | y ) and from this get estimate of the posterior mean

There’s no reason to limit this to a single chain though – it would be typical to start several trajectories for the MC from different initial values. This is a typical approach to try to determine and appropriate burn-in period (e.g. the value t0 ). You look at some estimate from the MC trajectory (such as the above mean) and then compare the variation within a single trajectory of this parameter varies as compared to the variation across the trajectories. Due to ergodicity of the MC, if we are near stationary distribution, variation across trajectories should be identical to variation within trajectories (see the book for scale reduction factor)

How many calculations does the above need? We need k samples from various conditional distributions for each transition step of the MC. We do T of these, so we need a total of kT samples from the conditional distributions and if we do a total of m trajectories, and sample for each k values from the prior then we have mk(T+1) total samples to take, each of which might require several pseudo-random number draws.

Bivariate example – case with k=2 parameters a,b Procedure: Start with initial values of (a,b) chosen from the prior density Draw a’ from p(a|b,y) Draw b’ from p(b|a’,y) Then the new state of the MC is (a’,b’) and note that as we continue this, if we are at stationary distribution, the probability of drawing a particular b’ given a’ will be the same across iterations.

The Gibbs sampler has a variety of variations (see the handout) but all assume that we have a reasonably feasible way to sample from all conditional distributions. When this is not feasible, there are other approaches an the Metropolis-Hastings sampler is one: The idea here is to add a rejection criteria in which we reject some samples under certain criteria. This is a standard approach in using pseudo-random number generators to develop a sample from a probability distribution where we cannot easily sample from the distribution. This goes back to von-Neumann. A simple example is if you want to sample points uniformly in the unit circle, then you independently choose two pseudo-random numbers x, y uniform on [-1,1], and accept them if x2+y2 ≤1 and reject them otherwise.

The Metropolis-Hasting algorithm: we wish to generate a MC for the posterior p() and suppose there is the transition kernel q(,φ) which we know how to sample from. Procedure: Start with initial value of (0) chosen from the prior density Generate proposed value φ from q((0),φ) Evaluate acceptance probability α((0),φ) Put (1)=φ with probability α((0),φ) and otherwise put (1)=(0) Repeat

The above procedure provides a MC and the handout shows that it has p() as a stationary distribution. The trick is to choose the transition kernel –there are several simplified cases: If we choose q(,φ)=q(φ, ) then this is Metropolis sampler and acceptance depends only on ratio – accept if step is to higher density, accept step to lower density depending upon ratio of densities. Random walks – φ = (j) + wj where wj are iid with some density. Independent chains - q(,φ)=f(φ) for some density f( ) so is independent of state of chain and if choose prior for proposal density then get ratio of likelihoods in acceptance criteria