Kevin Murphy UBC CS & Stats 9 February 2005

Kevin Murphy UBC CS & Stats 9 February 2005
Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005

Where does the title come from?
“Why I am not a Bayesian”, Glymour, 1981 “Why Glymour is a Bayesian”, Rosenkrantz, 1983 “Why isn’t everyone a Bayesian?”, Efron, 1986 “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001 Many other such philosophical essays…

Frequentist vs Bayesian
Prob = objective relative frequencies Params are fixed unknown constants, so cannot write e.g. P(=0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Source: “All of statistics”, Larry Wasserman

Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach?

HHTHT HHHHH Coin flipping What process produced these sequences?
The following slides are from Tenenbaum & Griffiths

Hypotheses in coin flipping
Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... statistical models

Hypotheses in coin flipping
Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... generative models

Representing generative models
Graphical model notation Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation d d d d4 Fair coin, P(H) = 0.5 d d d d4 Markov model HHTHT d1 d2 d3 d4 d5

Models with latent structure
d d d d4 P(H) = p p Not all nodes in a graphical model need to be observed Some variables reflect latent structure, used in generating D but unobserved d d d d4 Hidden Markov model s s s s4 HHTHT d1 d2 d3 d4 d5 How do we select the “best” model?

Bayes’ rule Likelihood Prior probability Posterior probability
Sum over space of hypotheses

The origin of Bayes’ rule
A simple consequence of using probability to represent degrees of belief For any two random variables:

Why represent degrees of belief with probabilities?
Good statistics consistency, and worst-case error bounds. Cox Axioms necessary to cohere with common sense “Dutch Book” + Survival of the Fittest if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. Provides a theory of incremental learning a common currency for combining prior knowledge and the lessons of experience.

Hypotheses in Bayesian inference
Hypotheses H refer to processes that could have generated the data D Bayesian inference provides a distribution over these hypotheses, given D P(D|H) is the probability of D being generated by the process identified by H Hypotheses H are mutually exclusive: only one process could have generated D

Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p

Comparing two simple hypotheses
Contrast simple hypotheses: H1: “fair coin”, P(H) = 0.5 H2:“always heads”, P(H) = 1.0 Bayes’ rule: With two hypotheses, use odds form

Bayes’ rule in odds form
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) = x Prior odds Posterior odds Bayes factor (likelihood ratio)

P(H1|D) / P(H2|D) = infinity
Data = HHTHT P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000 P(H1|D) / P(H2|D) = infinity = x

Data = HHHHH P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) D: HHHHH
H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D)  30 = x

Data = HHHHHHHHHH P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2)
D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D)  1 = x

Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p

Comparing simple and complex hypotheses
d d d d4 P(H) = p p vs. d d d d4 Fair coin, P(H) = 0.5 Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p?

P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5

Probability

Probability HHHHH p = 1.0

Probability HHTHT p = 0.6

P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 How can we deal with this? frequentist: hypothesis testing information theorist: minimum description length Bayesian: just use probability theory!

? Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior:
NH: number of heads NT: number of tails Prior: P(p)  pFH-1 (1-p)FT-1 ?

A simple method of specifying priors
Imagine some fictitious trials, reflecting a set of previous experiences strategy often used with neural networks e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair In fact, this is a sensible statistical idea...

Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior:
NH: number of heads NT: number of tails Prior: P(p)  pFH-1 (1-p)FT-1 FH: fictitious observations of heads FT: fictitious observations of tails Beta(FH,FT) (pseudo-counts)

Posterior / prior x likelihood
Same form!

Conjugate priors Exist for many standard distributions
formula for exponential family conjugacy Define prior in terms of fictitious observations Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000

Normalizing constants
Prior Normalizing constant for Beta distribution Posterior Hence marginal likelihood is

Marginal likelihood for H1 and H2
Probability Marginal likelihood is an average over all values of p

Sensitivity to hyper-parameters

Bayesian model selection
Simple and complex hypotheses can be compared directly using Bayes’ rule requires summing over latent variables Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)

Example: Belgian euro-coins
A Belgian euro spun N=250 times came up heads X=140. “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002) Source: Mackay exercise 3.15

Classical hypothesis testing
Null hypothesis H0 eg. q = 0.5 (unbiased coin) For classical analysis, don’t need to specify alternative hypothesis, but later we will use H1:   0.5 Need a decision rule that maps data D to accept/ reject of H0. Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2

P-values Define p-value of threshold  as
Intuitively, p-value of data is probability of getting data at least that extreme given H0

Intuitively, p-value of data is probability of getting data at least that extreme given H0 Usually choose  so that false rejection rate of H0 is below significance level  = 0.05 R

Intuitively, p-value of data is probability of getting data at least that extreme given H0 Usually choose  so that false rejection rate of H0 is below significance level  = 0.05 Often use asymptotic approximation to distribution of d(D) under H0 as N ! 1 R

P-value for euro coins N = 250 trials, X=140 heads
P-value is “less than 7%” If N=250 and X=141, pval = , so we can reject the null hypothesis at the significance level of 5%. This does not mean P(H0|D)=0.07! Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)

Bayesian analysis of euro-coin
Assume P(H0)=P(H1)=0.5 Assume P(p) ~ Beta(,) Setting =1 yields a uniform (non-informative) prior.

Bayesian analysis of euro-coin
If =1, so H0 (unbiased) is (slightly) more probable than H1 (biased). By varying  over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis. Other priors yield similar results. Bayesian analysis contradicts classical analysis.

Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense

The likelihood principle
In order to choose between hypotheses H0 and H1 given observed data, one should ask how likely the observed data are; do not ask questions about data that we might have observed but did not, such as This principle can be proved from two simpler principles called conditionality and sufficiency.

Frequentist statistics violates the likelihood principle
“The use of P-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.” – Jeffreys, 1961

Another example Suppose X ~ N(,2); we observe x=3
Compare H0: =0 with H1: >0 P-value = P(X ¸ 3|H0)=0.001, so reject H0 Bayesian approach: update P(|X) using conjugate analysis; compute Bayes factor to compare H0 and H1

When are P-values valid?
Suppose X ~ N(,2); we observe X=x. One-sided hypothesis test: H0:  · 0 vs H1:  > 0 If P() / 1, then P(|x) ~ N(x,2), so P-value is the same in this case, since Gaussian is symmetric in its arguments

Stopping rule principle
Inferences you make should only depend on the observed data, not the reasons why this data was collected. If you look at your data to decide when to stop collecting, this should not change any conclusions you draw. Follows from likelihood principle.

Frequentist statistics violates stopping rule principle
Observe D=HHHTHHHHTHHT. Is there evidence of bias (Pt > Ph)? Let X=3 heads be observed random variable and N=12 trials be fixed constant. Define H0: Ph=0.5. Then, at the 5% level, there is no significant evidence of bias:

Frequentist statistics violates stopping rule principle
Suppose the data was generated by tossing coins until we got X=3 heads. Now X=3 heads is a fixed constant and N=12 is a random variable. Now there is significant evidence of bias! First n-1 trials contain x-1 heads; last trial always heads

Ignoring stopping criterion can mislead classical estimators
Let Xi ~ Bernoulli() Max lik. estimator MLE is unbiased: Toss coin; if head, stop, else toss second coin. P(H)=, P(HT)= (1-), P(TT)=(1-)2. Now MLE is biased! Many classical rules for assessing significance when complex stopping rules are used.

Confidence intervals An interval (min(D),max(D)) is a 95% CI if  lies inside this interval 95% of the time across repeated draws D~P(.|) This does not mean P( 2 CI|D) = 0.95! Mackay sec 37.3

Example Draw 2 integers from If =39, we would expect

Example If =39, we would expect Define confidence interval as
eg (x1,x2)=(40,39), CI=(39,39) 75% of the time, this will contain the true 

CIs violate common sense
If =39, we would expect If (x1,x2)=(39,39), then CI=(39,39) at level 75%. But clearly P(=39|D)=P(=38|D)=0.5 If (x1,x2)=(39,40), then CI=(39,39), but clearly P(=39|D)=1.0.

What’s wrong with the classical approach?
Violates likelihood principle Violates stopping rule principle Violates common sense

What’s right about the Bayesian approach?
Simple and natural Optimal mechanism for reasoning under uncertainty Generalization of Aristotelian logic that reduces to deductive logic if our hypotheses are either true or false Supports interesting (human-like) kinds of learning

Bayesian humor “A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”

Kevin Murphy UBC CS & Stats 9 February 2005

Similar presentations

Presentation on theme: "Kevin Murphy UBC CS & Stats 9 February 2005"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kevin Murphy UBC CS & Stats 9 February 2005

Similar presentations

Presentation on theme: "Kevin Murphy UBC CS & Stats 9 February 2005"— Presentation transcript:

Similar presentations

About project

Feedback