Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kevin Murphy UBC CS & Stats 9 February 2005

Similar presentations


Presentation on theme: "Kevin Murphy UBC CS & Stats 9 February 2005"— Presentation transcript:

1 Kevin Murphy UBC CS & Stats 9 February 2005
Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005

2 Where does the title come from?
“Why I am not a Bayesian”, Glymour, 1981 “Why Glymour is a Bayesian”, Rosenkrantz, 1983 “Why isn’t everyone a Bayesian?”, Efron, 1986 “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001 Many other such philosophical essays…

3 Frequentist vs Bayesian
Prob = objective relative frequencies Params are fixed unknown constants, so cannot write e.g. P(=0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Source: “All of statistics”, Larry Wasserman

4 Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach?

5 HHTHT HHHHH Coin flipping What process produced these sequences?
The following slides are from Tenenbaum & Griffiths

6 Hypotheses in coin flipping
Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... statistical models

7 Hypotheses in coin flipping
Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... generative models

8 Representing generative models
Graphical model notation Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation d d d d4 Fair coin, P(H) = 0.5 d d d d4 Markov model HHTHT d1 d2 d3 d4 d5

9 Models with latent structure
d d d d4 P(H) = p p Not all nodes in a graphical model need to be observed Some variables reflect latent structure, used in generating D but unobserved d d d d4 Hidden Markov model s s s s4 HHTHT d1 d2 d3 d4 d5 How do we select the “best” model?

10 Bayes’ rule Likelihood Prior probability Posterior probability
Sum over space of hypotheses

11 The origin of Bayes’ rule
A simple consequence of using probability to represent degrees of belief For any two random variables:

12 Why represent degrees of belief with probabilities?
Good statistics consistency, and worst-case error bounds. Cox Axioms necessary to cohere with common sense “Dutch Book” + Survival of the Fittest if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. Provides a theory of incremental learning a common currency for combining prior knowledge and the lessons of experience.

13 Hypotheses in Bayesian inference
Hypotheses H refer to processes that could have generated the data D Bayesian inference provides a distribution over these hypotheses, given D P(D|H) is the probability of D being generated by the process identified by H Hypotheses H are mutually exclusive: only one process could have generated D

14 Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p

15 Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p

16 Comparing two simple hypotheses
Contrast simple hypotheses: H1: “fair coin”, P(H) = 0.5 H2:“always heads”, P(H) = 1.0 Bayes’ rule: With two hypotheses, use odds form

17 Bayes’ rule in odds form
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) = x Prior odds Posterior odds Bayes factor (likelihood ratio)

18 P(H1|D) / P(H2|D) = infinity
Data = HHTHT P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000 P(H1|D) / P(H2|D) = infinity = x

19 Data = HHHHH P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) D: HHHHH
H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D)  30 = x

20 Data = HHHHHHHHHH P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2)
D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D)  1 = x

21 Coin flipping Comparing two simple hypotheses
P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p

22 Comparing simple and complex hypotheses
d d d d4 P(H) = p p vs. d d d d4 Fair coin, P(H) = 0.5 Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p?

23 Comparing simple and complex hypotheses
P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5

24 Comparing simple and complex hypotheses
Probability

25 Comparing simple and complex hypotheses
Probability HHHHH p = 1.0

26 Comparing simple and complex hypotheses
Probability HHTHT p = 0.6

27 Comparing simple and complex hypotheses
P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 How can we deal with this? frequentist: hypothesis testing information theorist: minimum description length Bayesian: just use probability theory!

28 Comparing simple and complex hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x

29 Comparing simple and complex hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Marginal likelihood likelihood Prior

30 ? Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior:
NH: number of heads NT: number of tails Prior: P(p)  pFH-1 (1-p)FT-1 ?

31 A simple method of specifying priors
Imagine some fictitious trials, reflecting a set of previous experiences strategy often used with neural networks e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair In fact, this is a sensible statistical idea...

32 Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior:
NH: number of heads NT: number of tails Prior: P(p)  pFH-1 (1-p)FT-1 FH: fictitious observations of heads FT: fictitious observations of tails Beta(FH,FT) (pseudo-counts)

33 Posterior / prior x likelihood
Same form!

34 Conjugate priors Exist for many standard distributions
formula for exponential family conjugacy Define prior in terms of fictitious observations Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000

35 Normalizing constants
Prior Normalizing constant for Beta distribution Posterior Hence marginal likelihood is

36 Comparing simple and complex hypotheses
P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Likelihood for H1 Marginal likelihood (“evidence”) for H2

37 Marginal likelihood for H1 and H2
Probability Marginal likelihood is an average over all values of p

38 Sensitivity to hyper-parameters

39 Bayesian model selection
Simple and complex hypotheses can be compared directly using Bayes’ rule requires summing over latent variables Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)

40 Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach?

41 Example: Belgian euro-coins
A Belgian euro spun N=250 times came up heads X=140. “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002) Source: Mackay exercise 3.15

42 Classical hypothesis testing
Null hypothesis H0 eg. q = 0.5 (unbiased coin) For classical analysis, don’t need to specify alternative hypothesis, but later we will use H1:   0.5 Need a decision rule that maps data D to accept/ reject of H0. Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2

43 P-values Define p-value of threshold  as
Intuitively, p-value of data is probability of getting data at least that extreme given H0

44 P-values Define p-value of threshold  as
Intuitively, p-value of data is probability of getting data at least that extreme given H0 Usually choose  so that false rejection rate of H0 is below significance level  = 0.05 R

45 P-values Define p-value of threshold  as
Intuitively, p-value of data is probability of getting data at least that extreme given H0 Usually choose  so that false rejection rate of H0 is below significance level  = 0.05 Often use asymptotic approximation to distribution of d(D) under H0 as N ! 1 R

46 P-value for euro coins N = 250 trials, X=140 heads
P-value is “less than 7%” If N=250 and X=141, pval = , so we can reject the null hypothesis at the significance level of 5%. This does not mean P(H0|D)=0.07! Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)

47 Bayesian analysis of euro-coin
Assume P(H0)=P(H1)=0.5 Assume P(p) ~ Beta(,) Setting =1 yields a uniform (non-informative) prior.

48 Bayesian analysis of euro-coin
If =1, so H0 (unbiased) is (slightly) more probable than H1 (biased). By varying  over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis. Other priors yield similar results. Bayesian analysis contradicts classical analysis.

49 Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach?

50 Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense

51 The likelihood principle
In order to choose between hypotheses H0 and H1 given observed data, one should ask how likely the observed data are; do not ask questions about data that we might have observed but did not, such as This principle can be proved from two simpler principles called conditionality and sufficiency.

52 Frequentist statistics violates the likelihood principle
“The use of P-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.” – Jeffreys, 1961

53 Another example Suppose X ~ N(,2); we observe x=3
Compare H0: =0 with H1: >0 P-value = P(X ¸ 3|H0)=0.001, so reject H0 Bayesian approach: update P(|X) using conjugate analysis; compute Bayes factor to compare H0 and H1

54 When are P-values valid?
Suppose X ~ N(,2); we observe X=x. One-sided hypothesis test: H0:  · 0 vs H1:  > 0 If P() / 1, then P(|x) ~ N(x,2), so P-value is the same in this case, since Gaussian is symmetric in its arguments

55 Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense

56 Stopping rule principle
Inferences you make should only depend on the observed data, not the reasons why this data was collected. If you look at your data to decide when to stop collecting, this should not change any conclusions you draw. Follows from likelihood principle.

57 Frequentist statistics violates stopping rule principle
Observe D=HHHTHHHHTHHT. Is there evidence of bias (Pt > Ph)? Let X=3 heads be observed random variable and N=12 trials be fixed constant. Define H0: Ph=0.5. Then, at the 5% level, there is no significant evidence of bias:

58 Frequentist statistics violates stopping rule principle
Suppose the data was generated by tossing coins until we got X=3 heads. Now X=3 heads is a fixed constant and N=12 is a random variable. Now there is significant evidence of bias! First n-1 trials contain x-1 heads; last trial always heads

59 Ignoring stopping criterion can mislead classical estimators
Let Xi ~ Bernoulli() Max lik. estimator MLE is unbiased: Toss coin; if head, stop, else toss second coin. P(H)=, P(HT)= (1-), P(TT)=(1-)2. Now MLE is biased! Many classical rules for assessing significance when complex stopping rules are used.

60 Outline Hypothesis testing – Bayesian approach
Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense

61 Confidence intervals An interval (min(D),max(D)) is a 95% CI if  lies inside this interval 95% of the time across repeated draws D~P(.|) This does not mean P( 2 CI|D) = 0.95! Mackay sec 37.3

62 Example Draw 2 integers from If =39, we would expect

63 Example If =39, we would expect Define confidence interval as
eg (x1,x2)=(40,39), CI=(39,39) 75% of the time, this will contain the true 

64 CIs violate common sense
If =39, we would expect If (x1,x2)=(39,39), then CI=(39,39) at level 75%. But clearly P(=39|D)=P(=38|D)=0.5 If (x1,x2)=(39,40), then CI=(39,39), but clearly P(=39|D)=1.0.

65 What’s wrong with the classical approach?
Violates likelihood principle Violates stopping rule principle Violates common sense

66 What’s right about the Bayesian approach?
Simple and natural Optimal mechanism for reasoning under uncertainty Generalization of Aristotelian logic that reduces to deductive logic if our hypotheses are either true or false Supports interesting (human-like) kinds of learning

67

68 Bayesian humor “A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”


Download ppt "Kevin Murphy UBC CS & Stats 9 February 2005"

Similar presentations


Ads by Google