Markov Chain Monte Carlo

Slides:

Advertisements

Similar presentations

Introduction to Monte Carlo Markov chain (MCMC) methods

Advertisements

Other MCMC features in MLwiN and the MLwiN->WinBUGS interface

Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.

Monte Carlo Methods and Statistical Physics

Exact Inference in Bayes Nets

Bayesian Estimation in MARK

Statistics review of basic probability and statistics.

Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.

Markov-Chain Monte Carlo

Bayesian Methods with Monte Carlo Markov Chains III

Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.

CHAPTER 16 MARKOV CHAIN MONTE CARLO

Bayesian Reasoning: Markov Chain Monte Carlo

Bayesian statistics – MCMC techniques

Suggested readings Historical notes Markov chains MCMC details

BAYESIAN INFERENCE Sampling techniques

CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov

Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.

1 Sociology 601, Class 4: September 10, 2009 Chapter 4: Distributions Probability distributions (4.1) The normal probability distribution (4.2) Sampling.

Normal and Sampling Distributions A normal distribution is uniquely determined by its mean, , and variance,  2 The random variable Z = (X-  /  is.

Standard Error of the Mean

One Sample  M ean μ, Variance σ 2, Proportion π Two Samples  M eans, Variances, Proportions μ1 vs. μ2 σ12 vs. σ22 π1 vs. π Multiple.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Introduction to Monte Carlo Methods D.J.C. Mackay.

Bayes Factor Based on Han and Carlin (2001, JASA).

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Priors, Normal Models, Computing Posteriors

2 nd Order CFA Byrne Chapter 5. 2 nd Order Models The idea of a 2 nd order model (sometimes called a bi-factor model) is: – You have some latent variables.

Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.

Exam I review Understanding the meaning of the terminology we use. Quick calculations that indicate understanding of the basis of methods. Many of the.

Sampling Distributions & Standard Error Lesson 7.

Module 1: Statistical Issues in Micro simulation Paul Sousa.

1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.

Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.

A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models Russell Almond Florida State University College of Education Educational Psychology.

Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.

- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Application of the MCMC Method for the Calibration of DSMC Parameters James S. Strand and David B. Goldstein The University of Texas at Austin Sponsored.

Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.

Reducing MCMC Computational Cost With a Two Layered Bayesian Approach

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:

Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.

SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

The Monte Carlo Method/ Markov Chains/ Metropolitan Algorithm from sec in “Adaptive Cooperative Systems” -summarized by Jinsan Yang.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Markov Chain Monte Carlo in R

MCMC Output & Metropolis-Hastings Algorithm Part I

Optimization of Monte Carlo Integration

GEOGG121: Methods Monte Carlo methods, revision

Advanced Statistical Computing Fall 2016

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

ERGM conditional form Much easier to calculate delta (change statistics)

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Introduction to the bayes Prefix in Stata 15

Markov Chain Monte Carlo (MCMC)

Markov chain monte carlo

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

Lecture 2 – Monte Carlo method in finance

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Hierarchical Models.

CHAPTER 6 Statistical Inference & Hypothesis Testing

Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics

Presentation transcript:

Markov Chain Monte Carlo

Bayesian Inference: Analytic vs MCMC Methods Numeric Methods Conjugate Priors Variational Approximation Metropolis Algorithm Gibbs Sampling Explored in Ch6. Use if: Conjugate prior is known. Joint distribution of parameter space is tractable (few continuous parameters) Explored in Ch7. Less accurate: use as last resort! Because many posteriors are intractable: MCMC revolutionized the field.

Numeric Methods: Metropolis

MCMC Overview If it is impossible to solve the integral, numeric approximations are a good approach. In practice, only use MCMC when an integral cannot be solved analytically (multiple continuous parameters) Numerical methods can approximate distributions, by the Law of Large Numbers (LLN). MCMC approximates a very important distribution: the posterior.

Metropolis Algorithm: Motivation Scenario: A politician lives on an island chain. Wants to spend time on each island, proportional to its population You cannot learn every island’s populations a priori You can only query the populations of neighboring islands Let’s design an algorithm...

Metropolis Algorithm Procedure Generate proposal distribution Ex: L vs R Select proposal candidate Ex: flip a coin Move to candidate by this rule: Where P = P(D|Θ)*P(Θ); there is no need to compute the integral! Island Populations Moves Visit Frequency

Metropolis: Migrating Location Distribution That was one case. What can be said for all trial runs? Consider the probability of the politician’s location. You can prove that this “goal distribution” is: Stable: once it is achieved, it will not change. Unique: there is only one such stable distribution Achievable: real algorithms arrive at the solution. In short, the “goal distribution” is an attractor.

Plausibility Distributions: stddev = Learning Rate The above example used a discrete parameter (island). Our proposal distribution was just { left, right }. For continuous parameters, we can use a normal proposal distribution: Ɲ(μ,σ) σ will drive the learning rate

Example: Two Coins

Exact Solution We must select a conjugate prior. Beta distributions are conjugate to binomial distributions (product is integrable). p(Θ1,Θ2) = beta(Θ1|a, b) * beta(Θ2|a,b)

A measure of algorithmic efficiency (defined later). Metropolis Solution For two parameters, our proposal distribution is a bivariate normal. If we set σ to 0.02, we get a fractal-looking walk. With a larger σ, our walk is less clumpy, more efficient. Effective Size (ESS) A measure of algorithmic efficiency (defined later).

Gibbs Sampling

Introducing Gibbs Metropolis works best if proposal distribution is properly tuned to the posterior. Gibbs sampling is more efficient, and good for hierarchical models. In Gibbs, parameters are selected one at a time, and cycled through (Θ1, Θ2,...,Θ1,Θ2,...) New proposal distribution: E.g., with only two parameters Because the proposal distribution exactly mirrors the posterior for that parameter, the proposed move is always accepted.

Gibbs Sample: Pros & Cons Intermediate Steps Shown Whole Steps Shown Pros: No inefficiency of rejected proposals No need to tune proposal distributions Cons: 1-D posterior dists must be derivable. Progress can be stalled by highly correlated parameters. To motivate the second “con”: Imagine a long, narrow, diagonal hallway. How does Gibbs differ from Metropolis?

MCMC Diagnostics

The Space of MCMC Algorithms Monte Carlo: samples random values from a distribution Inspired by the famous casino locale Markov Chain: any sequential process where each step has no memory for states before the current one. We have seen two specific Markov chain Monte Carlo (MCMC) methods MCMC Metropolis Gibbs Slice Reversible Jump

MCMC Desiderata First, output must be representative. Not unduly influenced by starting value, don’t get “stuck”. Second, method must be repeatable. Central tendency & HDI should not be much different if analysis is rerun. Third, method must be efficient. Good approximations of the posterior should emerge from minimal resources.

Representative: Density Plots A density plot averages across overlapping intervals to show probability density. Here are three density plots for MCMC, run against 3 different starting conditions. Importantly, if we only consider the first Steps 1-500, we get distorted results We usually exclude the initial steps from our diagnostics, called the burn-in period.

Representative: Trace Plot We’ve seen trace plots before. They trace parameter location across time. Sticking with the same example, here is the trace plot for our 1D parameter All three runs show similar behavior. The algorithm converges. As before, the burn-in period should be excluded (irrelevant information).

Representative: Shrink Factor A measure for convergence is the Gelman-Rubin statistic, i.e., the shrink factor. This factor explores how between-chain and within-chain distances relate. If one or more chains gets “stuck”, it will increase the between-chain variance relative to the within-chain variance. If no chain gets stuck, the shrink factor will converge to 1.0, as in our example.

Autocorrelation Tutorial Some chains are more clumpy than others. Successive steps don’t provide independent information about the parameter distribution. To measure clumpiness, we use autocorrelation: the correlation of the chain values with the chain values k steps ahead. Consider ACF(k=10): For step=50, the chain values are 13 and 16. This values is converted into a point: (13, 16). Other value pairs are likewise converted into points. The correlation of the resultant scatterplot is 0.55 We can likewise compute ACF(1), ACF(5), etc.

Accuracy: Autocorrelation After excluding the burn-in period, the ACFs of different trials also match one another. How sharply the curve drops off is a good measure of algorithmic efficiency.

ESS & MCSE The effective sample size (ESS) shrinks as ACF grows. Thus, the amount of independent information is captured. R Command: effectiveSize function in coda pkg The Monte Carlo Standard Error (MCSE) is an analogue of classical standard error, SE = SD / √ N. We simply replace N with ESS! MCSE = SD / √ESS

The End