Download presentation
Presentation is loading. Please wait.
1
Kevin Murphy UBC CS & Stats 9 February 2005
Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005
2
Where does the title come from?
“Why I am not a Bayesian”, Glymour, 1981 “Why Glymour is a Bayesian”, Rosenkrantz, 1983 “Why isn’t everyone a Bayesian?”, Efron, 1986 “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001 Many other such philosophical essays…
3
Frequentist vs Bayesian
Prob = objective relative frequencies Params are fixed unknown constants, so cannot write e.g. P(=0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Source: “All of statistics”, Larry Wasserman
4
Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach?
5
HHTHT HHHHH Coin flipping What process produced these sequences?
The following slides are from Tenenbaum & Griffiths
6
Hypotheses in coin flipping
Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... statistical models
7
Hypotheses in coin flipping
Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... generative models
8
Representing generative models
Graphical model notation Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation d d d d4 Fair coin, P(H) = 0.5 d d d d4 Markov model HHTHT d1 d2 d3 d4 d5
9
Models with latent structure
d d d d4 P(H) = p p Not all nodes in a graphical model need to be observed Some variables reflect latent structure, used in generating D but unobserved d d d d4 Hidden Markov model s s s s4 HHTHT d1 d2 d3 d4 d5 How do we select the “best” model?
10
Bayes’ rule Likelihood Prior probability Posterior probability
Sum over space of hypotheses
11
The origin of Bayes’ rule
A simple consequence of using probability to represent degrees of belief For any two random variables:
12
Why represent degrees of belief with probabilities?
Good statistics consistency, and worst-case error bounds. Cox Axioms necessary to cohere with common sense “Dutch Book” + Survival of the Fittest if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. Provides a theory of incremental learning a common currency for combining prior knowledge and the lessons of experience.
13
Hypotheses in Bayesian inference
Hypotheses H refer to processes that could have generated the data D Bayesian inference provides a distribution over these hypotheses, given D P(D|H) is the probability of D being generated by the process identified by H Hypotheses H are mutually exclusive: only one process could have generated D
14
Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p
15
Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p
16
Comparing two simple hypotheses
Contrast simple hypotheses: H1: “fair coin”, P(H) = 0.5 H2:“always heads”, P(H) = 1.0 Bayes’ rule: With two hypotheses, use odds form
17
Bayes’ rule in odds form
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) = x Prior odds Posterior odds Bayes factor (likelihood ratio)
18
P(H1|D) / P(H2|D) = infinity
Data = HHTHT P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000 P(H1|D) / P(H2|D) = infinity = x
19
Data = HHHHH P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) D: HHHHH
H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D) 30 = x
20
Data = HHHHHHHHHH P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2)
D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D) 1 = x
21
Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p
22
Comparing simple and complex hypotheses
d d d d4 P(H) = p p vs. d d d d4 Fair coin, P(H) = 0.5 Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p?
23
Comparing simple and complex hypotheses
P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5
24
Comparing simple and complex hypotheses
Probability
25
Comparing simple and complex hypotheses
Probability HHHHH p = 1.0
26
Comparing simple and complex hypotheses
Probability HHTHT p = 0.6
27
Comparing simple and complex hypotheses
P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 How can we deal with this? frequentist: hypothesis testing information theorist: minimum description length Bayesian: just use probability theory!
28
Comparing simple and complex hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x
29
Comparing simple and complex hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Marginal likelihood likelihood Prior
30
? Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior:
NH: number of heads NT: number of tails Prior: P(p) pFH-1 (1-p)FT-1 ?
31
A simple method of specifying priors
Imagine some fictitious trials, reflecting a set of previous experiences strategy often used with neural networks e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair In fact, this is a sensible statistical idea...
32
Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior:
NH: number of heads NT: number of tails Prior: P(p) pFH-1 (1-p)FT-1 FH: fictitious observations of heads FT: fictitious observations of tails Beta(FH,FT) (pseudo-counts)
33
Posterior / prior x likelihood
Same form!
34
Conjugate priors Exist for many standard distributions
formula for exponential family conjugacy Define prior in terms of fictitious observations Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000
35
Normalizing constants
Prior Normalizing constant for Beta distribution Posterior Hence marginal likelihood is
36
Comparing simple and complex hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Likelihood for H1 Marginal likelihood (“evidence”) for H2
37
Marginal likelihood for H1 and H2
Probability Marginal likelihood is an average over all values of p
38
Sensitivity to hyper-parameters
39
Bayesian model selection
Simple and complex hypotheses can be compared directly using Bayes’ rule requires summing over latent variables Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)
40
Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach?
41
Example: Belgian euro-coins
A Belgian euro spun N=250 times came up heads X=140. “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002) Source: Mackay exercise 3.15
42
Classical hypothesis testing
Null hypothesis H0 eg. q = 0.5 (unbiased coin) For classical analysis, don’t need to specify alternative hypothesis, but later we will use H1: 0.5 Need a decision rule that maps data D to accept/ reject of H0. Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2
43
P-values Define p-value of threshold as
Intuitively, p-value of data is probability of getting data at least that extreme given H0
44
P-values Define p-value of threshold as
Intuitively, p-value of data is probability of getting data at least that extreme given H0 Usually choose so that false rejection rate of H0 is below significance level = 0.05 R
45
P-values Define p-value of threshold as
Intuitively, p-value of data is probability of getting data at least that extreme given H0 Usually choose so that false rejection rate of H0 is below significance level = 0.05 Often use asymptotic approximation to distribution of d(D) under H0 as N ! 1 R
46
P-value for euro coins N = 250 trials, X=140 heads
P-value is “less than 7%” If N=250 and X=141, pval = , so we can reject the null hypothesis at the significance level of 5%. This does not mean P(H0|D)=0.07! Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)
47
Bayesian analysis of euro-coin
Assume P(H0)=P(H1)=0.5 Assume P(p) ~ Beta(,) Setting =1 yields a uniform (non-informative) prior.
48
Bayesian analysis of euro-coin
If =1, so H0 (unbiased) is (slightly) more probable than H1 (biased). By varying over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis. Other priors yield similar results. Bayesian analysis contradicts classical analysis.
49
Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach?
50
Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense
51
The likelihood principle
In order to choose between hypotheses H0 and H1 given observed data, one should ask how likely the observed data are; do not ask questions about data that we might have observed but did not, such as This principle can be proved from two simpler principles called conditionality and sufficiency.
52
Frequentist statistics violates the likelihood principle
“The use of P-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.” – Jeffreys, 1961
53
Another example Suppose X ~ N(,2); we observe x=3
Compare H0: =0 with H1: >0 P-value = P(X ¸ 3|H0)=0.001, so reject H0 Bayesian approach: update P(|X) using conjugate analysis; compute Bayes factor to compare H0 and H1
54
When are P-values valid?
Suppose X ~ N(,2); we observe X=x. One-sided hypothesis test: H0: · 0 vs H1: > 0 If P() / 1, then P(|x) ~ N(x,2), so P-value is the same in this case, since Gaussian is symmetric in its arguments
55
Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense
56
Stopping rule principle
Inferences you make should only depend on the observed data, not the reasons why this data was collected. If you look at your data to decide when to stop collecting, this should not change any conclusions you draw. Follows from likelihood principle.
57
Frequentist statistics violates stopping rule principle
Observe D=HHHTHHHHTHHT. Is there evidence of bias (Pt > Ph)? Let X=3 heads be observed random variable and N=12 trials be fixed constant. Define H0: Ph=0.5. Then, at the 5% level, there is no significant evidence of bias:
58
Frequentist statistics violates stopping rule principle
Suppose the data was generated by tossing coins until we got X=3 heads. Now X=3 heads is a fixed constant and N=12 is a random variable. Now there is significant evidence of bias! First n-1 trials contain x-1 heads; last trial always heads
59
Ignoring stopping criterion can mislead classical estimators
Let Xi ~ Bernoulli() Max lik. estimator MLE is unbiased: Toss coin; if head, stop, else toss second coin. P(H)=, P(HT)= (1-), P(TT)=(1-)2. Now MLE is biased! Many classical rules for assessing significance when complex stopping rules are used.
60
Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense
61
Confidence intervals An interval (min(D),max(D)) is a 95% CI if lies inside this interval 95% of the time across repeated draws D~P(.|) This does not mean P( 2 CI|D) = 0.95! Mackay sec 37.3
62
Example Draw 2 integers from If =39, we would expect
63
Example If =39, we would expect Define confidence interval as
eg (x1,x2)=(40,39), CI=(39,39) 75% of the time, this will contain the true
64
CIs violate common sense
If =39, we would expect If (x1,x2)=(39,39), then CI=(39,39) at level 75%. But clearly P(=39|D)=P(=38|D)=0.5 If (x1,x2)=(39,40), then CI=(39,39), but clearly P(=39|D)=1.0.
65
What’s wrong with the classical approach?
Violates likelihood principle Violates stopping rule principle Violates common sense
66
What’s right about the Bayesian approach?
Simple and natural Optimal mechanism for reasoning under uncertainty Generalization of Aristotelian logic that reduces to deductive logic if our hypotheses are either true or false Supports interesting (human-like) kinds of learning
68
Bayesian humor “A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.