Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 3: STATISTICS BAYESIAN INFERENCE

Similar presentations


Presentation on theme: "Lecture 3: STATISTICS BAYESIAN INFERENCE"— Presentation transcript:

1 Lecture 3: STATISTICS BAYESIAN INFERENCE
“The theory of probabilities is basically only common sense reduced to calculus.” P.S. Laplace “To ask the right question is harder than to answer it.” G. Cantor See Lecture Notes (Chapter 2) at arXiv: v3 … + examples, exercises and references.

2 The Reverend Thomas Bayes, F.R.S.
(1701?-1761) Bayes Theorem LII. An Essay towards solving a Problem in the Doctrine of Chances. By the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, M. A. and F. R. S. Dear Sir, Read Dec. 23, I now send you an essay which I have found among the papers of our deceased friend Mr. Bayes, and which, in my opinion, has great merit, and well deserves to be preserved. Experimental philosophy, you will find, is nearly interested in the subject of it; and on this account there seems to be particular reason for thinking that a communication of it to the Royal Society cannot be improper. … to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times. …some rule could be found, according to which we ought to estimate the chance that the probability for the happening of an event perfectly unknown, should lie between any two named degrees of probability, antecedently to any experiments made about it; … Common sense is indeed sufficient to shew us that, form the observation of what has in former instances been the consequence of a certain cause or action, one may make a judgement what is likely to be the consequence of it another time.

3 Bayesian inference: Elements
Inferences on θ; the model,… Bayes “rule”: Posterior: quantifies the knowledge (“degree of belief”) we have on the parameter of interest θ conditioned (modified by) to the observed data Likelihood: How data modifies our knowledge on the parameters Experiment affect knowledge on parameters only through likelihood (thus, same likelihoods same inferences) Defined up to a multiplicative constant Prior: Knowledge (“degree of credibility”) about parameters before data is taken Informative: Include Prior knowledge (in particular, if trustable info from previous experiments) Non-Informative: Relative ignorance before experiment is done (… independent experiment) … posterior is dominated by likelihood likelihood prior posterior likelihood prior posterior

4 EXCHANGEABLE Sequences of Observations:
A sequence of r.q. is exchangeable if the joint density is invariant under any permutation of indices Symmetry in observations so order in which are taken is irrelevant iid exchangeable and product is invariant to re-ordering exchangeability is less restrictive than iid … more interesting…

5 2) An infinite sequence is exchangeable if any finite sub-sequence
De Finetti’s Theorem (1930’s) (…+ Hewit and Savage (1950’s)) 2) If is an exchangeable sequence… for any finite subset such that any finite sequence is described by a model and if there is a prior function (…measure…density) … justifies / leads to formal use of Bayes Theorem

6 Bayesian Inference: Scheme
1) Experiment: n-fold repetition under same conditions 2) Analysis of observed data data: 2.1) Model to describe data: Unknown parameters about which we want to make inferences from observed data Usually parameter(s) of interest nuisance parameter(s) 2.2) Full Model: “Prior (…density)” for the parameters 2.3) Get distribution of unknown parameters conditioned to observed data: Bayes “rule”

7 2.4) Draw inferences on parameters of interest from “posterior” densities:
Integrate on nuisance parameters: parameter(s) of interest nuisance parameter(s) … if any, for otherwise simpler expression: … may get conditional densities if needed: Normalisation: usually, domain does not depend on parameters

8 3) “Predictive Inferences”
If desired: 3) “Predictive Inferences” Bayesian approach allows, eventually, to make “predictive inferences” from present knowledge Within same model predict new data yet to come Bayes rule Independent samplings related through (… model checking)

9 Bayesian inference: Structures
One experiment (same conditions) Sequence of observations by the same detector Isolated experiments Incidence of same disease on population of different countries, ethnic groups, social status,… Same model, different parameters independent Parameters come from a common source governed by hyperparameters Same source… different conditions…

10 EXAMPLE:… just to fix ideas One experiment; same conditions…
Acceptance of events after cuts (or efficiency or whatever) Total number of events N Number of events that have been observed to pass the cuts n (0 ≤ n≤N) One experiment; same conditions… Binomial model accepted with probability θ or rejected with probability 1-θ An event is either Consider the proper prior Proper posterior

11 If they exist, we do not have to work with the whole sample
Sufficient Minimal Statistics: (… make life easier) Statistic: random vector Sufficient: statistics that have all relevant info from experiment on the parameters of interest Sufficient and Minimal: (better if k(m)=k independent of m ) All relevant info from experiment on the parameters of interest If they exist, we do not have to work with the whole sample Example:

12 Exponential Family Example:
The only distributions that admit a fixed number of minimal sufficient statistics belong to the exponential family (… but for some irregular cases) Exponential Family Belongs to the k-parametric exponential family if: In most cases we shall deal with Regular family for which does not depend on (+ some other conditions) … more than what you think … and less than what we would like… Example: not regular No Sufficient and Minimal Statistics  work with the whole sample

13 EXAMPLE: Exponential (1) (Life-Time of a Particle )
Model: Experiment : Minimal Sufficient Statistic: Show that from Fourier Transform Bayes Rule: improper proper Prior measure: proper Posterior:

14 For the Binomial, and Exponential examples …
we took a uniform prior density motivated by the: Bayes-Laplace postulate (“Principle of Insufficient Reason”) If no special reason, all possible outcomes are equally likely Not always the best choice 2) Consistency: Non-linear one-to-one transformation Ex: Given the Model: need a prior Inferences from Posterior: Prior ~ Knowledge (“degree of credibility”) about parameters before data is taken ~ A measure (De Finetti, Hewit, Savage) Informative: Include Prior knowledge (… trustable info from previous experiments,…) What “prior” info do we have on parameters of interest? Non-Informative: Relative ignorance before experiment is done (… independent analysis,…)

15 (All detailed in the notes (2.6.*))
Usually … State of “Relative ignorance” before experiment is done so Eventually we would like to Specify priors which provide little info relative to what is expected to be given by experiment (“Non-Informative”) ●“Non-Informative”….One is never in a state of complete ignorance “knowing little a priori”… is relative to info provided by experiment … posterior is dominated by likelihood (the data) likelihood prior posterior “Used as standard” reference function Usually are improper densities: ● do not quantify prior knowledge on parameters in a pdf …(sequence of compact coverings, …) ● do we really need a proper prior?... Reasonable and sound criteria to choose a proper prior density: Invariances Conjugated Priors Probability Matching Priors Reference Priors Hierarchical Structures … more… (Jeffreys (1939); Jaynes (1964)) (Raiffa+Schlaifer (1961)) (Welch+Peers (1963) … Ghosh, Datta, Mukerjee,…) (Bernardo (1979), Berger) (All detailed in the notes (2.6.*))

16 1) INVARIANCES under a Group of Transformations
Model: Consider a Group of transformations that acts ●on the sample space: ●on the parametric space: The model M is invariant under G if the random quantity is distributed as EXAMPLE: Exponential Model 16

17 Two simple and important cases: Position and scale parameters
SCHEME: 1) Identify the group of transformations on sample space This induces the action of the group G=(S,*) on the parameter space under which the model is invariant 2) Choose a prior density that is invariant under this group so that:  Initial transformations of data under G will make no difference on inferences  Same Prior Beliefs on original and transformed parameters … there may be no obvious (or no) symmetry … … we may want to consider invariant measures under transformations that are not explicit in the model … Usual Transformations: Translations, scaling, affine, Matrix Transforms,… Two simple and important cases: Position and scale parameters Position Scale Model M: Invariance under: Translation Group Multiplication Group 17

18 Scale Parameter ● ● New random quantity
Is there any transformation group acting on θ so that the model M is invariant? Reparameterization: Prior considerations on θ and θ´…

19 Position When considered independent
Same rationale for position parameters (translation invariance) … consistent: position parameter When considered independent … Many important models with LOCATION AND SCALE parameters: (see notes or further examples)

20 1) 2) Example: 2) location and scale parameter: (improper prior)
1) Set of sufficient statistics 2) location and scale parameter: (improper prior) inferences on inferences on 2) variances: means: H1: same (but unknown) variances: H2: unknown and different variances: (Behrens-Fisher problem)

21 INVARIANCE under reparameterisations (Sir H. Jeffreys; 1950)
Any criteria to specify a prior density for should give a consistent result for parameter with a single-valued function (Fisher’s Matrix) (Show this !!) Fisher’s Matrix: …obviously if exists: 1) 2) 3) does not depend on well behaved :

22 Example: Exercise: Scale parameter !! improper Suf. Stat.: (proper)

23 EXAMPLE: Mixture Model
3He/4He Ek/nucleon: 1-5 GeV/n Both priors….

24 n dimensional parameters:
1) Fisher’s Matrix: smooth one-to-one transformation … like a covariant tensor Connection Geometry - Information and Probability… … may not be the best choice… (as pointed out by Jeffreys) 2) (Haar) Invariant Measures under a Group of Transformations n-dimensional parameters… Non-abelian Groups LEFT: RIGHT: … consistency + convergence issues RIGHT (OK in most cases) 3) Probability Matching Priors Useful Pragmatic/Appeasement option Prior function such that one-sided credible intervals derived from posterior distribution coincide (to a certain degree of accuracy) with those derived from frequentist approach  Differential equation involving Fisher’s matrix (See the example on the Correlation Coefficient of the Bivariate Normal Model)

25 May Set-up more elaborated models…
4) Hierarchical Structures: (hyperprior) and marginalise for : 5) Conjugated Priors (… closed under sampling → simplify life) 1) Choose a class of priors that reflect the structure of the model 2) Choose a prior distribution for the “hyperparameters” (hyperprior) 3) Posterior density Option 1): … choose reference prior for model Option 2,3): “reasonable vague hyperpriors” …“Empirical method”

26 Reference Priors Central Idea:
Bernardo, J.M. (1979) Expected amount of info on parameter provided by k independent observations of the model relative to prior knowledge (Expected Mutual Information; Kullback-Leibler Discrepancy between two distributions) Maximum info on θ that could expect to get from model relative to prior knowledge “Perfect info” Central Idea: Reference prior relative to model … that for which “perfect info” is maximal … “less informative” for this model

27 Explicit form of the reference prior
● ● ● ● ● Calculus of Variations: Implicit, divergentregularization,… ● ● ● ● ● Explicit form of the reference prior (sect. 6.7) Berger, J.O., Bernardo J.M., Sun D (2009) Ann. Stat.; Vol. 37, No. 2; Consider a continuous strictly positive function such that the corresponding posterior is proper (the easier the better) 2) and define with any interior point Under very general conditions is an admissible reference prior and

28 Binomial Distribution
EXERCISE: Consider and Poisson Distribution Take ; show that and, in consequence Binomial Distribution Take and show that (more involved look at the notes…) Show that both are Jeffrey’s priors and Probability Matching Priors

29 … see notes for all of them...
For n>1 dimensions Ex: n=2 1) arrange parameters in order of importance : ordered parameterization 2) proceed sequentially with conditionals … check if conditionals are proper… sequence of compacts sets… … see notes for all of them... However: In many cases, sufficiently “vague” reasonable priors work fine (wise to check dependence) Example: Regression

30 Regression Problems Data: 1) Specify the Model: Examples:
exchangeable sequence 1) Specify the Model: Examples: Normal Model: Linear relation: If known (a,b): position and scale Poisson Model: Number of counts at a given position,time,… see notes for examples… In general, non-linear expressions,… priors are a non-trivial task … but, usually, sufficiently vague priors work fine:

31 EXAMPLE: proton flux in cosmic rays
1) Explicit a simplemodel: Normal densities with large variances (σ>>) for Priors: and Normal with support restricted to R+ for

32 Marginal and joint posterior for

33 TEST OF HYPOTHESIS POINT ESTIMATION DECISION THEORY
…Evaluate evidence in favour of a scientific theory… … “Representative” numeric value for parameters of interest… DECISION THEORY

34 DECISION THEORY PROBLEM: How to choose the optimal action among a set of different alternatives (… Games Theory) For a given problem, we have to specify: 1) Sets Set of all possible “states of nature” (Parameter Space) Set of all possible experimental outcomes (Sample Space) Set of all possible actions to be taken Example: Disease Test states of nature experimental outcomes actions {healthy, sic} {test +, test -} {apply treatment, do not apply treatment}

35 In nontrivial situations, we can’t take any action without potential losses
2) Loss Function Basic element in Decision Theory: Loss Function Quantifies the loss associated to take an action (or decision) when the “state of nature” is …What do we know about ? The knowledge we have on the “state of nature” is quantified by the posterior density 3) Risk Function Risk associated to take the action having observed the data

36 Two types of problems: Hypothesis Testing: Inference:
4) Bayesian Decision: Take the action that minimises the Risk (Minimum Expected loss) Two types of problems: Hypothesis Testing: {accept, reject} an hypothesis Inference: statistic that we shall take as an estimator of

37 Hypothesis Testing: Example with two alternatives
Hypothesis are exclusive and exhaustive Bayes’ rule Possible actions: action to be taken … decide for hypothesis Loss function (i.e. choose hypothesis) when “state of nature” is take action Loss function

38 Bayesian Decision: Take the action that minimises the Risk
Risk Function: Bayesian Decision: Take the action that minimises the Risk take action (choose hypothesis ) if If we take (same for )

39 Bayes Factor: take action (decide for hypothesis ) if Posterior odds
Bayes Theorem: Posterior odds Evidence from data Prior odds Change of prior beliefs… How strongly data favours one model over the other Ratio of likelihoods Choose hypothesis if Usually: decide for hypothesis if ( … 0-1 loss function) (may not be realistic !!)

40 Quantify evidence for H1
Hypothesis are exclusive and exhaustive Posterior odds for H1

41 Parameters not specified by model
HYPOTHESIS TESTING: General cases: SIMPLE: specify everything about the model including values of parameters Hypothesis: COMPOSITE: Models have parameters not specified by hypothesis (if nuisance parameters… integrated) S (H1) vs S (H2) S (H1) vs C (H2) C (H1) vs C (H2) Parameters not specified by model … average likelihood

42 1) “Little” problem… composite hypothesis
average likelihood and if improper, (fine for inferences if posterior is proper) is not well defined (…arbitrary constant) Ways out: … take sufficiently general proper priors (conjugated for instance) but… 1) O’Hagan, Berger, Pericchi 1.1) Take a minimal subset of observed sample to render a proper posterior (usually ) (with for instance reference priors) 1.2) Compute the Bayes Factor with the remaining subsample 1.3) Dilute effect of a particular “training” sample evaluating BF with all possible combinations (…median)

43 Second approach 2) Schwarz + … Quantify evidence for a particular model avoiding prior specification: Bayes Information Criteria: … Asymptotic Normality of likelihood (see notes) (favours less dimensions) (easy to evaluate) “preference” for H1 See notes for important examples: significance of an unexpected “signal”,…

44 STATISTICAL INFERENCE
Point Estimation “…when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind.” (Lord W.T. Kelvin)

45 INFERENCE: (Point estimation)
posterior 1) Quadratic Loss: mean 2) Lineal Loss: median 3) Zero-one Loss: mode

46 CREDIBLE REGIONS Def.: Credible Region with probability content α
such that One dimension: an interval Credible Regions are not unique … and all have same probability content… equally valid… Need a prescription to single out one “representative” region

47 HPD (Highest Probability Density): Smallest possible volume
in parameter space such that Equivalent definitions: 1) such that 1) 2) 2) where is the largest constant for which Monte Carlo sampling may help

48 1) 2) 3) 4) 5) Some properties: One dimension:
If distribution has one mode and is symmetric 2) HPD regions may not be connected (for instance more than one mode) 3) both are included or excluded 4) one to one: Region with probability content in has probability content in … but in general is not HPD unless linear relation 5) In general equal tailed credible intervals may not have the smallest size (are not HPD)

49 Bayesian and Frequentist/Classical Back to the Exponential Model
views Back to the Exponential Model

50 EXAMPLE: LIFE-TIME of a PARTICLE (n)
1) Model: 2) Experiment : Suf. Stat.: B F B Scale parameter: Posterior (proper):

51 Maximizes likelihood function :
Frequentist (Classical) approach: Maximizes likelihood function : Other options: equate sample moments to those of model, minimize “distance”,… Why? … “It has good properties” Bayesian Rationale … 1) Uniform Prior 2) Zero-one Loss: posterior mode “good properties”… under regularity conditions : Regularity conditions (can be relaxed for some properties): (… different θ same sample… )

52 1) Consistency: 2) Usually is (or leads to) unbiased estimators: 3) Asymptotic Normality: 4) Efficiency: (Cramèr-Rao) lower bound for unbiased estimators 5) Invariance under monotonous 1-to-1 transformations: B 3) 4) 5) … if not linear… usually is biased 2) sufficient statistic

53 …asymptotic properties…
F Statistic so if I repeat the experiment many times I shall get a sequence of values nicely clustered around B Those are “nice sampling properties” of the m.l. “estimator” but we are interested on the parameter and not on the “estimator” (… besides the fact that we usually do not repeat the experiment) so… what can you say about ? F …CONFIDENCE LIMITS … (… Bayesian: Credible Regions …) Observed value is a sampling of the random quantity for a fixed value of and an specified CL

54 For each possible value of
… As for Credible Regions, they are not unique: , smallest size,… For this example I took For each possible value of β=0.68 … but what about τ ?

55 1) 2) invertible functions ( … )
A particular observation (t) will single out one interval within the β (0.68) band Does not mean that has a 68% chance to lie in this interval B F B They are Random Intervals : If you repeat the experiment, most likely you will get different intervals but will not change

56 100 identical experiments
what means ? 100 identical experiments 34 do not contain The chance that the interval contains the true value of the parameter is 0.68 If you repeat the experiment 100 times under the same conditions, you will get 100 Intervals …. and ~ 68 will contain the true value of the parameter (… “constant coverage”!!) You do the experiment once so you pick up one of these intervals at random, Does it contain the true value of the parameter? Absolutely different philosophy: F Given the parameters, how likely is the observed sample?,… Great but of hardly any interest;… we are interested in the parameters B Given the data, draw inferences on the parameters of interest,…

57 (i.e.: significance of an unexpected “signal”,…)
See notes for other important examples (i.e.: significance of an unexpected “signal”,…) And Example with One sided Intervals Isotropy of cosmic rays (or whatever)

58 Example: Isotropy of cosmic rays
Arrival directions of selected 16–350 GeV electrons (left) and positrons (right) in galactic coordinates (Hammer-Aitoff projection.) The colour code reflects the number of events per bin. J. Casaus et al.; 33rd ICRC-2013 Model to describe observations: Spherical Harmonics expansion of pdf: Real basis in Ω:

59 2) For a given l, the experiment provides k=2l+1 statistics
1) Interest in: l=1: dipole anisotropy From the experiment: 2) For a given l, the experiment provides k=2l+1 statistics with similar precision and ~independent

60 PROBLEM:  1)  2)  3) Ordered parameterization
 1) independent  2) Parameterization of interest Spherical polar parameters  3) Ordered parameterization Parameter of interest Nuisance parameters independent priors Joint sampling density

61  4) Prior nuisance parameters
 4) Prior nuisance parameters Lebesgue measure on k-1 dimensional sphere is  Haar invariant measure under rotations (see notes; Problem 2.5)  Uniform distribution on k-1 dimensional sphere surface element Take surface element as prior density for nuisance parameters  proper density  determinant of the Fisher’s submatrix for the angular part  Jeffrey’s for nuisance parameters k-dimensional volume element

62  6) and last  5) Integrate nuisance parameters and for
 5) Integrate nuisance parameters and for  6) and last 6.1) Guess what shall we get: Moments: (… for instance Mellin Transform) Jeffreys’ for Normal approximation

63 6.2) Reference Prior: 6.3) Jeffreys’ Prior:
Start with a non-negative function that renders a proper posterior such that integrals are simplified (Problem: Do it with or ) and after some algebra any interior point of From asymptotic behaviour of Bessel Functions: 6.3) Jeffreys’ Prior: (a bit more involved than 6.2) Fisher’s information from : and for large b: (regardless )

64  7) Posterior: Sampling Distribution:
 7) Posterior: Sampling Distribution: Let’s derive One-sided Upper Credible Region for dipole anisotropy:

65 One-sided Upper Credible Region:
 7) DIPOLE: parameterization: Sampling Distribution: Posterior Distribution: Bayesian One-sided Upper Credible Region: … given the data x then…

66 Classical One sided Interval: Neyman’s construction
Upper (Lower) bound on a parameter… same rationale Now no ambiguity in the interval definition For a probability content For each possible value of find 1) Get closer as x grows 2.1) For x<2 underestimated 2.2) For x≤xc= no solution …ordering prescription:

67 Most interesting solution… Feldman-Cousins’ construction:
For each possible value of , find the region such that and “best” estimation of given (Remember properties of HPD regions) usually MaxLik pdf ratio for θ0=2

68 1) For large x becomes an interval
not favored by data … modify of ordering rule or use Neyman to get an upper bound if desired 2) Easier to include constrains on parameters … nuisance parameters 3) Discrete random quantities… may not be possible to satisfy exactly probability content Poisson Counts with known background parameter (see notes for detailed example) 4) They are frequentist intervals (… constant coverage…) and as such should be interpreted


Download ppt "Lecture 3: STATISTICS BAYESIAN INFERENCE"

Similar presentations


Ads by Google