Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC)
Chapter 7 of Kruschke text Darrell A. Worthy Texas A&M University

MCMC It is MCMC algorithms and software, along with fast computer hardware that allow us to do Bayesian data analyses that would have effectively been impossible 30 years ago. We will use the example of inferring the underlying probability q that a coin comes up heads, given an observed set of flips. Previously using this example, we used numerical approximation to yield the posterior distribution. In Chapter 6 Kruschke uses a beta prior distribution, which is the conjugate prior of the Bernoulli likelihood function, which yields a beta posterior distribution for q.

MCMC There are numerous situations where analytical solutions and numerical approximations will not work to solve Bayes’ rule. Our beliefs about q may not be represented by a beta distribution or by any function that yields an analytically solvable posterior function. We could use grid approximation to address this situation, but none of these methods work well as we add additional parameters. MCMC is an approach that overcomes these problems. MCMC is used as the underlying algorithmic machinery to infer the posterior distribution in packages such as JASP, JAGS, and Stan. We will mostly focus on the one parameter example because it is useful to learn about MCMC from a simple example.

MCMC MCMC assumes that prior distributions are specified by a function that is easily evaluated. It also assumes that the value of the likelihood function, p(D|q), can be computed for any specified values of D and q. The important thing about these requirements is that the method does not require evaluating the difficult integral in the denominator of Bayes’ rule.

MCMC What we get out of MCMC is an approximation of the posterior distribution, p(q|D), in the form of a large number of q values sampled from that distribution. Conceptually similar to bootstrapping, another Monte Carlo method This big pile of representative parameter (q) values can then be used to estimate the central tendency of the posterior (usually the mode), the 95% HDI, and other statistics such as the Bayes Factor. The random generation of values from the posterior distribution is analogous to the random events at casino games…

Approximating a Distribution
The concept of representing a distribution by a large representative sample is foundational to Bayesian analysis of complex models. This idea is applied intuitively and routinely in everyday life and in science. Polls and surveys are founded on this concept By randomly sampling a subset of people from a population we are able to estimate the population’s central tendencies and variance. In MCMC we just sample from a population that is a mathematically defined distribution, such as a posterior probability distribution.

Approximating a Distribution
This figure just shows an example of approximating a mathematical distribution by a large sample. The top left panel plots an exact beta distribution. The top right shows a rough distribution with N=500. The bottom rows show samples of 5,000 and 50,000 which better approximate the true distribution.

The Metropolis Algorithm
Our goal in Bayesian inference is to get an accurate representation of the posterior distribution. One way to do that is to sample a large number of representative points from the posterior. Our question then becomes: How can we sample a large number of representative values from a distribution? For an answer, let’s ask a politician…

Suppose an elected politician lives on a long chain of islands. He is constantly traveling from island to island, wanting to stay in the public eye. At the end of a long day of photo ops and fundraising he has to decide whether to (1) stay on the current island, (2) move to the island to the west, or (3) move to the island in the east. His goal is to visit all the islands proportionally to their relative population, so he spends more time on more populated islands.

Unfortunately our politician holds his office despite not knowing what the total population of the island chain is, and he doesn’t even know exactly how many islands there are! When his advisers are not busy fundraising they can ask the mayor of the island they are on how many people are on the island. And, when the politician proposes to visit an adjacent island, his advisers can ask the mayor of the adjacent island how many people live on that proposed island.

The politician has a simple heuristic he uses to decide whether to travel to the proposed island. First he flips a fair coin to decide whether to propose going to the adjacent island to the east or the one to the west. If the proposed island has a larger population than his current island then he definitely goes to the proposed island. Or else, if the proposed island has a smaller population than the current island then he goes to the proposed island only probabilistically, proportional to the size of the proposed island and his current island. If the proposed island is half as big as his current island, he goes there with a probability of .50.

The specific formula for determining the probability of moving to a smaller island is p(move)=Pop(proposed)/Pop(current). The politician has a random number generator that gives a number between 0 and 1. If the number is less than p(move), then he moves to the proposed island. What’s amazing is that this simple algorithm works! Over many visits the number of visits to each island will be proportional to relative population of each island.

A random walk Looking at this example in more detail, suppose that there are seven islands in the chain, with relative populations as shown here: The islands are indexed by the value q, and note that the uppercase P( ) refers to the relative population of the island. The islands to the left of one and the right of seven have populations of zero; proposals made to those islands will always be declined.

A random walk The figure below shows one possible trajectory of the “random walk” taken by our politician. Each time step on the y-axis is a draw from the proposal distribution, or a possible move to a different island by the politician. The politician is more likely to jump to islands with larger populations according to the rule, but he can also jump to less populated islands, probabilistically.

A random walk This figure shows histogram of the frequencies with which each position is visited during this sequence. Note that the sampled relative frequencies closely follow the actual relative populations of the islands we saw a couple slides ago. The beauty of the Metropolis algorithm is that a sequence generated like this will converge to an arbitrarily close approximation of the actual relative probabilities.

A random walk This figure shows the probability of being in each position as a function of time. These are computed over many random walks. Note that at each time point the politician can only move one step left or right.

Summary of our random walk
We start at our current position. We flip a coin to propose moving left or right. We then decide whether to accept this move. If the target distribution is greater at the proposed distribution then we definitely accept the move (we always move higher, if possible) If the target position is less at the proposed distribution then we move to the proposed position with Pmove=P(qproposed )/P(qcurrent ) Having proposed a move we draw from a uniform distribution between 0 and 1; if that number is less than Pmove then we move.

Why we care? Notice what we must be able to do in the random-walk process Must be able to generate a random value from the proposal distribution. Must be able to evaluate the target distribution at any proposed position. Must be able to generate a random value from a uniform distribution. In being able to do these three things we are able to indirectly generate samples from the target distribution. Moreover, we can do these three things without having the target distribution normalized (we don’t have to know the total population). We just need the ratio of the population of each island in our example This method obviates direct computation of the evidence p(D), which was that pesky integral in the denominator of Bayes’ rule.

The Metropolis algorithm more generally
In the previous example we considered the simple case of: 1. discrete positions (islands) 2. on a single dimension (population) 3. with moves that can go just one position left or right. The general algorithm applies to: 1. continuous values 2. on any number of positions 3. with more general proposal distributions

The essentials of the general method are the same as for the simple case. We have some target distribution, P(q), over a multidimensional continuous parameter space from which we want to generate sample values. We have to be able to compute the value of P(q) for any candidate value. Sample values from the target distribution are generated by taking a random walk through the parameter space. Here instead of a 50/50 discrete move left or right we use a normal distribution to move through the parameter space that is centered at the current value.

The idea behind using a normal distribution as our proposal distribution is that the proposed move will typically be near the current position. The probability of proposing a more distant position drops off according to the normal curve. Computer programs such as R have built-in functions for generating pseudorandom values from the normal distribution. proposedJump=rnorm(1,mean=0, sd=.2) The first argument indicates we want 1 random value, rather than a vector.

The likelihood function
We skipped Chapter 6 of the Kruschke text, which covers a Bayesian analysis of inferring a binomial probability via exact mathematical analysis. This is the same problem we are solving using MCMC. One important section of Chapter 6 concerns the likelihood function, which must be specified to use Bayes’ rule. In the case of estimating the bias of a coin we use a beta prior distribution and a Bernoulli distribution as the likelihood function. This yields a posterior distribution which is also a beta distribution.

In the Bernoulli distribution the outcome of a single flip is denoted as g and can take on values of 0 or 1. The probability of each outcome (0 or 1) can be given as a function of parameter q: P(g|q)=qg (1-q)(1-g) This Equation will reduce to p(g=1|q) = q, so parameter q can be interpreted as the underlying probability of heads for a coin flip. You can think of this equation as the data value g fixed by an observation, and the value of q as variable.

Different values of q yield different probabilities of the datum g. When thought of in this way, as the probability of a parameter given the data, the Equation on the previous slide is the likelihood function of q. Note that the likelihood function is a function of the continuous value q, whereas the Bernoulli distribution is a discrete distribution over the two possible values of g. In Bayesian inference, the function p(g|q) is usually thought of with the data, g, known and fixed, and the parameter, q, as uncertain and variable.

In most cases we are interested in inferring the probability of a set of outcomes, rather than a single outcome. In this case we use the general form, the Bernoulli likelihood function. If each outcome is independent, and denotes the number of heads in a set ({gi}) of flips as z and the number of tails as N-z then the likelihood function is: p({gi}|q)=qz (1 - q)N-z This general form is useful for applying Bayes’ rule to large data sets. This is what we will use to compute the probabilities of jumping to a different possible parameter values in MCMC.

Likelihood functions are important for both Bayesian and traditional analyses. Often the natural log of the likelihood is more useful and multiplying the log-likelihood by -2 gives the deviance, which is used for likelihood ratio tests. Multiplying the log-likelihood by any negative number, such as -1, allows for minsearch algorithm programs to search for the single best parameter sets. Bayesian analysis searches for the distribution of q whereas traditional, frequentist statistics search for the single best point estimates of the parameters (which may be less informative than a distribution).

In a set of 10 coin flips, 7 of which were heads, the likelihood, log- likelihood, and Deviance or G2 values are shown below for possible values of theta.

A description of credibilities: The Beta distribution
It is important to note that the likelihood function is not a probability distribution. The values do not sum to or integrate to one, as probability distributions must. For our prior and posterior distributions we do need them to be probability distributions so that we can describe both the prior and posterior probabilities for each candidate parameter value, given the data. We could use any probability density function, but there are two desiderata for mathematical tractability.

First we’d like for the product of the likelihood and the prior, p(g|q) * p(q), the numerator of Bayes’ rule, to result in a function in the same form as the prior, p(q). Second, we’d like for the denominator of Bayes’ rule, the integral, to be solvable analytically. The conjugate prior of the Bernoulli likelihood function is the beta distribution. What this means is that when you multiply the Bernoulli likelihood function by it’s conjugate prior, the Beta distribution, then this will result in a posterior distribution that is also a Beta distribution.

The beta distribution is a probability density function that has two parameters, a and b, that represent the number of distinct outcomes for each of two possibilities. The density is defined as: p(q|a,b) = q(a-1) (1-q)(b-1)/B(a,b) Where B(a,b) is simply a normalizing constant that ensures the area under the beta density integrates to 1, as all probability density functions must. In a few slides we will use a beta prior of beta(q|1,1), where each “1” is the prior default value for a and b This assumes a prior of where two outcomes have occurred, one for each possibility, making them equiprobable.

Metropolis algorithm applied to Bernoulli likelihood and Beta prior
In the island-hopping politician scenario, the islands represented candidate parameter values, and the relative populations represented relative posterior probabilities. In the scenario of coin-flipping, the parameter q has values that run on a continuum from zero to one. The relative posterior probability is computed as likelihood times prior probability.

To apply the Metropolis algorithm we can think of the parameter dimension as a dense chain of infinitesimal islands. We can think of the relative populations of each island as its relative posterior probability density. Instead of the proposed jump being only to immediately adjacent islands, the proposed jump can be to islands further away. We will use the familiar normal distribution to accomplish this because it will let us visit any particular island (parameter value) on the continuum.

We flip a coin N times and observe z heads. We use a Bernoulli likelihood function: p(z,N|q)=qz (1 - q)N-z We start with a beta prior p(q)=beta(q|a,b) We denote the proposed jump as Dq ~ normal(m=0,s) ~ means here that a value is randomly sampled from the distribution Thus qpro = qcur + Dq Thus, the proposed jump is usually close to the current position and can deviate based on Dq. D is commonly used to represent “change” in some unit.

The metropolis algorithm proceeds as follows: Start at an arbitrary initial value of q which is the current value Generate a proposed jump by adding Dq to the current value Compute the probability of moving to the proposed value as the product of the likelihood times the prior Accept the proposed parameter value if a random value sampled from a [0,1] uniform distribution is larger than p(move), otherwise reject. Repeat the steps above many (usually at least 10K) times until it is judged that a representative sample has been generated.

The figure on the right shows the metropolis algorithm applied to a case with a beta(q|1,1) prior, N=20, and z=14. From left to right s for the proposal distribution is .02, .2, and 2. In all cases the chain was arbitrarily started at .01.

The middle column uses a moderately sized SD for the proposal distribution. The middle row shows the proportion of accepted jumps, which is about half the time when the SD=.2. This setting allows the chain to move around fairly efficiently.

The upper row shows a histogram of the chain positions after 50K steps. The effective size of the chain is less than the number of steps and takes into account clumpiness in the chain. The middle column with a moderate s is clearly the most efficient.

More on MCMC The main benefit of MCMC algorithms like Metropolis is that we can estimate the posterior distribution for complex models without solving the integral in Bayes’ rule. The Kruschke text (Chapter 7) offers much more on the mechanics of MCMC. He discusses why it works as well as a few other MCMC algorithms such as Gibbs sampling. Chapter 8 covers JAGS, or Just Another Gibbs Sampler Chapter 14 covers Stan, probably the most cutting edge Bayesian sampler, which uses Hamiltonian Monte Carlo.

Markov Chain Monte Carlo (MCMC)

Similar presentations

Presentation on theme: "Markov Chain Monte Carlo (MCMC)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Markov Chain Monte Carlo (MCMC)

Similar presentations

Presentation on theme: "Markov Chain Monte Carlo (MCMC)"— Presentation transcript:

Similar presentations

About project

Feedback