 # Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.

## Presentation on theme: "Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008."— Presentation transcript:

Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008

Questions to ask… What is likelihood-based inference? What is Bayesian inference and why is it different? How do you estimate parameters in a Bayesian framework? How do you choose a suitable prior? How do you compare models in Bayesian inference?

A recap on likelihood For any model the maximum information about model parameters is obtained by considering the likelihood function The likelihood function is proportional to the probability of observing the data given a specified parameter value The likelihood principle states that all information about the parameters of interest is contained in the likelihood function

An example Suppose we have data generated from a Poisson distribution. We want to estimate the parameter of the distribution The probability of observing a particular random variable is If we have observed a series of iid Poisson RVs we obtain the joint likelihood by multiplying the individual probabilities together

Relative likelihood We can compare the evidence for different parameter values through their relative likelihood For example, suppose we observe counts of 12, 22, 14 and 8 from a Poisson process The maximum likelihood estimate is 14. The relative likelihood is given by

Maximum likelihood estimation The maximum likelihood estimate is the set of parameter values that maximise the probability of observing the data we got The mle is consistent in that it converges to the truth as the sample size gets infinitely large The mle is asymptotically efficient in that it achieves the minimum possible variance (the Cramér-Rao Lower Bound) as n→∞ However, the mle is often biased for finite sample sizes –For example, the mle for the variance parameter in a normal distribution is the sample variance

Confidence intervals and likelihood Thanks to the CLT there is another useful result that allows us to define confidence intervals from the log-likelihood surface Specifically, the set of parameter values for which the log-likelihood is not more than 1.92 less than the maximum likelihood will define a 95% confidence interval –In the limit of large sample size the LRT is approximately chi-squared distributed under the null This is a very useful result, but shouldn’t be assumed to hold –i.e. Check with simulation

Likelihood ratio tests Suppose we have two models, H0 and H1, in which H0 is a special case of H1 We can compare the likelihood of the MLEs for the two models –Note the likelihood under H1 can be no worse than under H0 Theory shows that if H0 is true, then twice the difference in log-likelihood is asymptotically  2 distributed with degrees of freedom equal to the difference in the number of parameters between H0 and H1 –The likelihood ratio test Theory also tells us that if H1 is true, then the likelihood ratio test is the most powerful test for discriminating between H0 and H1 –Useful, though perhaps not as useful as it sounds

Criticisms of the frequentist approach The choice between models using P-values is focused on rejecting the null rather than proving the appropriateness of the alternative Representing uncertainty through the use of confidence intervals is messy and unintuitive –Cannot say that the probability of the true parameter being within the interval is 0.95 The frequentist approach requires a predefined experimental approach that must be followed through to completion (at which point data are analysed) –Bayesian inference naturally adapts to interim analysis, changes in stopping rules, combining data from different sources Focusing on point estimation leads to models that are ‘over-fitted’ to data

Bayesian estimators Bayesian statistics aims to make statements about the probability attached to different parameter values given the data you have collected It makes use of Bayes’ theorem Prior Likelihood Posterior Normalising constant

Are parameters random variables? The single most important conceptual difference between Bayesian statistics and frequentist statistics is the notion that the parameters you are interested in are themselves random variables This notion is encapsulated in the use of a subjective prior for your parameters Remember that to construct a confidence interval we have to define the set of possible parameter values A prior does the same thing, but also gives a weight to different values

Example: coin tossing I toss a coin twice and observe two heads I want to perform inference about the probability of obtaining a head on a single throw for the coin in question The MLE of the probability is 1.0 – yet I have a very strong prior belief that the answer is 0.5 Bayesian statistics forces the researcher to be explicit about prior beliefs but, in return, can be very specific about what information has been gained by performing the experiment It also provides a natural way for combining data from different experiments

The posterior Bayesian inference about parameters is contained in the posterior distribution The posterior can be summarised in various ways Prior Posterior Posterior mean Credible Interval

Choosing priors A prior reflects your belief before the experiment This might be relatively unfocused –Uniform distributions in the case of single parameters –Jeffreys prior (and other ‘uninformative’ priors) Or might be highly focused –In the coin-tossing experiment, most of my prior would be on P=0.5 –In an association study, my prior on a SNP being causal might be 1/10 7

Using posteriors Posterior summary to provide statements about point estimates and certainty Posterior prediction to make statements about future events Posterior predictive simulation to check the fit of the model to data

Bayes factors Bayes factors can be used to compare the evidence for different models –These do not need to be nested Bayes factors generalise the likelihood ratio by integrating the likelihood over the prior Importantly, if model 2 is a subset of model 1, it does not follow that the Bayes factor is necessarily greater than 1 –The subspace of model 1 that improves the likelihood may be very small and the extra parameter carry extra cost It is generally accepted that a BF of 3 is worth mention, a BF of 10 is strong evidence and a BF of 100 is decisive (Jeffreys)

Example Consider the crossing data of Bateson and Punnett in which we want to estimate the recombination fraction I will use a beta prior for the recombination fraction with parameters 3 and 7 Bateson and Punnett experiment Phenotype and genotype Observed Expected from 9:3:3:1 ratio Purple, long (P_L_)284216 Purple, round (P_ll) 2172 Red, long (ppL_)2172 Red, round (ppll)5524

Conditional on the total sample (381), the likelihood function is described by the multinomial We get the following posterior distribution Comparing the model to one in which r = 0.5 gives a BF of 3.9 Posterior mean = 0.134 Posterior mode = 0.13 95% ETPI = 0.10 – 0.16

Bayesian inference and the notion of shrinkage The notion of shrinkage is that you can obtained better estimates by assuming a certain degree of similarity among the things you want to estimate and a lack of complexity Practically, this means three things –Borrowing information across observations –Penalising inferences that are very different from anything else –Penalising more complex models The notion of shrinkage is implicit in the use of priors in Bayesian statistics There are also forms of frequentist inference where shrinkage is used –But NOT MLE

Download ppt "Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008."

Similar presentations