Presentation on theme: "Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for "— Presentation transcript:
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for if the posterior distribution of given the sample observations x 1, …, x n is the same as the posterior distribution of given T, i.e. The Bayesian definition is equivalent with the original definition (Theorem 7.1)
Credible regions Let be the parameter space for and let S x be a subset of . If then S x is a 1 – credible region for . For a scalar we refer to it as a credible interval The important difference compared to confidence regions: A credible region is a fixed region (conditional on the sample) to which the random variable belongs with probability 1 – . A confidence region is a random region that covers the fixed with probability 1 –
Highest posterior density (HPD) regions Equal-tailed credible intervals An equal-tailed 1 – credible interval is a 1 – credible interval (a, b ) such that
Example: In a consignment of 5000 pills suspected to contain the drug Ecstasy a sample of 10 pills are sampled for chemical analysis. Let be the unknown proportion of Ecstasy pills in the consignment and assume we have a high prior belief that this proportion is 100%. Such a high prior belief can be modelled with a Beta density where is set to 1 and is set to a fairly high value, say 20, i.e. Beta (20,1)
Now, suppose after chemical analysis of the 10 pills, all of them showed to contain Ecstasy. The posterior density for is Beta( + x, + n – x) (conjugate prior with binomial sample distribution as the population is large) Beta (20 + 10, 1 + 10 – 10) = Beta (30, 1) Then a lower-tail 99% credible interval for satisfies Thus with 99% certainty we can state that at least 85.8% of the pills in the consignment consist of Ecstasy
Comments: We have used the binomial distribution for the sample. More correct would be to use the hypergeometric distribution, but the binomial is a good approximation. For a smaller consignment (i.e. population) we can benefit on using the result that the posterior for the number of pills containing Ecstasy in the rest of the consignment after removing the sample is beta-binomial. This would however give similar results If a sample consists of 100% of one kind, how would a confidence interval for be obtained?
Bayesian hypothesis testing The issue is to test H 0 vs. H 1 Without specifying the hypotheses further, we seek to judge upon which of the two hypothesis that, conditional on the sample, is the most probable. Note the difference compared to the classical approach: There we seek to reject H 0 in favour of H 1 and never the opposite. The aim of Bayesian hypothesis testing is to determine the posterior “odds” With Q* > 1 we then say that conditional on the sample H 0 is Q* times more probable than H 1 and with Q* < 1 the expression is reversed.
Some review of probability theory: For two events A and B from a random experiment we have This result is usually referred to as Bayes theorem on odds form
Now, it is possible to replace A with H 0, A with H 1 and B with x (the sample) where Q is the prior “odds” Note that where L(H ; x) is the likelihood of H. The concept of likelihood is not restricted to parameters. Note also that f (x | H 0 ) need not have the same functional form as f (x | H 1 )
To make the comparison with classical hypothesis testing more transparent the posterior odds may be transformed to posterior probabilities for each of the two hypotheses: Now, if the posterior probability of H 1 is 0.95 this would be a result that could be compared with “H 0 is rejected at 5% level of significance” However, the two approaches cannot be made equal
Another example from forensic science Assume a crime has been conducted where a blood stain was left at the crime scene. A suspect is identified and a saliva sample is taken from this person. The DNA profiles are compared between the saliva sample and the blood stain and they appear to match (i.e. they are equal). Put H 0 : “The blood stain comes from the suspect” H 1 : “The blood stain comes from another person than the suspect The Bayes factor becomes Now, if laboratory mistakes can be discarded Pr (Matching DNA profiles | H 0 ) = 1 How about the probability in the denominator of B ?
This probability relates to the commonness of the current profile among people in general. Today’s DNA analysis is such that if a full profile is obtained (i.e. if there are no missing DNA markers), the probability is very low, about 1 in 10 million. B becomes very large If the suspect was caught on reasons not related to the DNA-analysis (as it should be), the prior odds, Q = Pr (H 0 ) / Pr (H 1 ) is probably greater than 1 If we calculate with Q =1 then Q * = B and thus very large The DNA analysis very strongly supports H 0 to be true.
Hypotheses expressed in terms of a parameter For sake of simplicity we express the parameter as a scalar, but the results also apply to multidimensional parameters Case 1:H 0 : = 0 H 1 : = 1 Example: Let x be an observation from Bin(n, ) and H 0 : = 0 vs. H 1 : = 1
Case 3:H 0 : = 0 H 1 : 0 It can be shown (Theorem 7.3) that in this case (As 0 defines the region for H 1 the conditioning on H 1 in the textbook seems redundant. )
Nuisance parameters and predictive distributions If the parameters involved are two and and the parameter of interest is , then is referred to as a nuisance parameter. The marginal posterior density for is then obtained by integrating out from the joint posterior density: where is the parameter space for.
Predictive distributions Let x n = ( x 1, …, x n ) be a random sample from a distribution with p.d.f. f (x ; ) Suppose we will take a new observation x n + 1 and would like to make so- called predictive inference about it. In practice this means that we would like to express the uncertainty about it in terms of a prediction interval. The marginal p.d.f. of X n + 1 is the same as that of each variable X i in the sample, i.e. f (x ; ) However, if we want to make use of the sample we should rather study the simultaneous density of X n + 1 and | x n. X n + 1 and X n are independent by definition X n + 1 and | x n are also independent as the latter is conditional on x n
The simultaneous density of X n + 1 and | x is Now, treating (temporarily) as a nuisance parameter. The posterior predictive distribution (density) for X n + 1 given and x n is then g can be used to find a point prediction of X n + 1 as – the mean of g with quadratic loss, i.e. – the median of g with absolute error loss – the mode of g with zero-one loss compute a 1 – prediction interval for X n + 1 by solving for c and d
Example The number of calls to a telephone central during an hour can usually be shown to follow a Poisson distribution with mean. Assume that a prior for is a Gamma (a,b)-distribution, i.e. the prior density is Now assume that we have observed x 1, x 2 and x 3 calls during each of the three previous hours and we wish to make predictive inference about the number of calls x 4 during the current hour. The posterior distribution is also Gamma with
and e.g. a point prediction under quadratic loss of the number of calls is obtained by
Empirical Bayes “Something between the Bayesian and the frequentist approach” General idea: Use some of the available data (sample) to estimate the prior for Use the rest of the available data to make inference about Parametric set-up: Let data-point i be represented by the bivariate random variable (X i, i ), i =1, 2, …, k Let f (x; i ) be the p.d.f. for X i and p( ) be the marginal density of i f is assumed to be known part from and p is assumed to be unknown.
1, …, k are unobservable quantities. With Empirical Bayes we try to make inference about k Use x 1, …, x k – 1 to estimate p( ) Then make inference about k through its posterior distribution conditional on x k the usual way: point estimation under certain choices of loss functions credible regions hypothesis testing
How to estimate p( ) ? The marginal density for X i is Let p = p( ; ) where is a (multidimensional) parameter defining the prior assuming the functional form of p is known. Method of moments: Put up the equations from which an estimate of can be deduced.
Maximum-Likelihood: is estimated as Apparently, the textbook is not correct here as the estimation of p should be based on the first k – 1 observations and the kth should be excluded from that stage. See further the textbook for simplifications of the MLE procedure.