Kevin Murphy UBC CS & Stats 9 February 2005

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

A Tutorial on Learning with Bayesian Networks
Brief introduction on Logistic Regression
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Statistical Background
Thanks to Nir Friedman, HU
Chapter 9 Hypothesis Testing.
BCOR 1020 Business Statistics
© 1999 Prentice-Hall, Inc. Chap Chapter Topics Hypothesis Testing Methodology Z Test for the Mean (  Known) p-Value Approach to Hypothesis Testing.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Inference about Population Parameters: Hypothesis Testing
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Crash Course on Machine Learning
AM Recitation 2/10/11.
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
The smokers’ proportion in H.K. is 40%. How to testify this claim ?
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Inference for a Single Population Proportion (p).
1 Today Null and alternative hypotheses 1- and 2-tailed tests Regions of rejection Sampling distributions The Central Limit Theorem Standard errors z-tests.
Statistical Decision Theory
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Theory of Probability Statistics for Business and Economics.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
IE241: Introduction to Hypothesis Testing. We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Bayesian vs. frequentist inference frequentist: 1) Deductive hypothesis testing of Popper--ruling out alternative explanations Falsification: can prove.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
"Classical" Inference. Two simple inference scenarios Question 1: Are we in world A or world B?
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Machine Learning 5. Parametric Methods.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Chapter 12 Tests of Hypotheses Means 12.1 Tests of Hypotheses 12.2 Significance of Tests 12.3 Tests concerning Means 12.4 Tests concerning Means(unknown.
Basic Bayes: model fitting, model selection, model averaging Josh Tenenbaum MIT.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Inference for a Single Population Proportion (p)
Bayesian Estimation and Confidence Intervals
Bayes Net Learning: Bayesian Approaches
Bayes for Beginners Luca Chech and Jolanda Malamud
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Kevin Murphy UBC CS & Stats 9 February 2005 Why I am a Bayesian (and why you should become one, too) or Classical statistics considered harmful Kevin Murphy UBC CS & Stats 9 February 2005

Where does the title come from? “Why I am not a Bayesian”, Glymour, 1981 “Why Glymour is a Bayesian”, Rosenkrantz, 1983 “Why isn’t everyone a Bayesian?”, Efron, 1986 “Bayesianism and causality, or, why I am only a half-Bayesian”, Pearl, 2001 Many other such philosophical essays…

Frequentist vs Bayesian Prob = objective relative frequencies Params are fixed unknown constants, so cannot write e.g. P(=0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Source: “All of statistics”, Larry Wasserman

Outline Hypothesis testing – Bayesian approach Hypothesis testing – classical approach What’s wrong the classical approach?

HHTHT HHHHH Coin flipping What process produced these sequences? The following slides are from Tenenbaum & Griffiths

Hypotheses in coin flipping Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... statistical models

Hypotheses in coin flipping Describe processes by which D could be generated D = HHTHT Fair coin, P(H) = 0.5 Coin with P(H) = p Markov model Hidden Markov model ... generative models

Representing generative models Graphical model notation Pearl (1988), Jordan (1998) Variables are nodes, edges indicate dependency Directed edges show causal process of data generation d1 d2 d3 d4 Fair coin, P(H) = 0.5 d1 d2 d3 d4 Markov model HHTHT d1 d2 d3 d4 d5

Models with latent structure d1 d2 d3 d4 P(H) = p p Not all nodes in a graphical model need to be observed Some variables reflect latent structure, used in generating D but unobserved d1 d2 d3 d4 Hidden Markov model s1 s2 s3 s4 HHTHT d1 d2 d3 d4 d5 How do we select the “best” model?

Bayes’ rule Likelihood Prior probability Posterior probability Sum over space of hypotheses

The origin of Bayes’ rule A simple consequence of using probability to represent degrees of belief For any two random variables:

Why represent degrees of belief with probabilities? Good statistics consistency, and worst-case error bounds. Cox Axioms necessary to cohere with common sense “Dutch Book” + Survival of the Fittest if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. Provides a theory of incremental learning a common currency for combining prior knowledge and the lessons of experience.

Hypotheses in Bayesian inference Hypotheses H refer to processes that could have generated the data D Bayesian inference provides a distribution over these hypotheses, given D P(D|H) is the probability of D being generated by the process identified by H Hypotheses H are mutually exclusive: only one process could have generated D

Coin flipping Comparing two simple hypotheses P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p

Coin flipping Comparing two simple hypotheses P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p

Comparing two simple hypotheses Contrast simple hypotheses: H1: “fair coin”, P(H) = 0.5 H2:“always heads”, P(H) = 1.0 Bayes’ rule: With two hypotheses, use odds form

Bayes’ rule in odds form P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) = x Prior odds Posterior odds Bayes factor (likelihood ratio)

P(H1|D) / P(H2|D) = infinity Data = HHTHT P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000 P(H1|D) / P(H2|D) = infinity = x

Data = HHHHH P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) D: HHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D)  30 = x

Data = HHHHHHHHHH P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D)  1 = x

Coin flipping Comparing two simple hypotheses P(H) = 0.5 vs. P(H) = 1.0 Comparing simple and complex hypotheses P(H) = 0.5 vs. P(H) = p

Comparing simple and complex hypotheses d1 d2 d3 d4 P(H) = p p vs. d1 d2 d3 d4 Fair coin, P(H) = 0.5 Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p?

Comparing simple and complex hypotheses P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5

Comparing simple and complex hypotheses Probability

Comparing simple and complex hypotheses Probability HHHHH p = 1.0

Comparing simple and complex hypotheses Probability HHTHT p = 0.6

Comparing simple and complex hypotheses P(H) = p is more complex than P(H) = 0.5 in two ways: P(H) = 0.5 is a special case of P(H) = p for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 How can we deal with this? frequentist: hypothesis testing information theorist: minimum description length Bayesian: just use probability theory!

Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x

Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Marginal likelihood likelihood Prior

? Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior: NH: number of heads NT: number of tails Prior: P(p)  pFH-1 (1-p)FT-1 ?

A simple method of specifying priors Imagine some fictitious trials, reflecting a set of previous experiences strategy often used with neural networks e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair In fact, this is a sensible statistical idea...

Likelihood and prior Likelihood: P(D | p) = pNH (1-p)NT Prior: NH: number of heads NT: number of tails Prior: P(p)  pFH-1 (1-p)FT-1 FH: fictitious observations of heads FT: fictitious observations of tails Beta(FH,FT) (pseudo-counts)

Posterior / prior x likelihood Same form!

Conjugate priors Exist for many standard distributions formula for exponential family conjugacy Define prior in terms of fictitious observations Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000

Normalizing constants Prior Normalizing constant for Beta distribution Posterior Hence marginal likelihood is

Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: = x Likelihood for H1 Marginal likelihood (“evidence”) for H2

Marginal likelihood for H1 and H2 Probability Marginal likelihood is an average over all values of p

Sensitivity to hyper-parameters

Bayesian model selection Simple and complex hypotheses can be compared directly using Bayes’ rule requires summing over latent variables Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” Maximum likelihood cannot be used for model selection (always prefers hypothesis with largest number of parameters)

Outline Hypothesis testing – Bayesian approach Hypothesis testing – classical approach What’s wrong the classical approach?

Example: Belgian euro-coins A Belgian euro spun N=250 times came up heads X=140. “It looks very suspicious to me. If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%” – Barry Blight, LSE (reported in Guardian, 2002) Source: Mackay exercise 3.15

Classical hypothesis testing Null hypothesis H0 eg. q = 0.5 (unbiased coin) For classical analysis, don’t need to specify alternative hypothesis, but later we will use H1:   0.5 Need a decision rule that maps data D to accept/ reject of H0. Define a scalar measure of deviance d(D) from the null hypothesis e.g., Nh or 2

P-values Define p-value of threshold  as Intuitively, p-value of data is probability of getting data at least that extreme given H0

P-values Define p-value of threshold  as Intuitively, p-value of data is probability of getting data at least that extreme given H0 Usually choose  so that false rejection rate of H0 is below significance level  = 0.05 R

P-values Define p-value of threshold  as Intuitively, p-value of data is probability of getting data at least that extreme given H0 Usually choose  so that false rejection rate of H0 is below significance level  = 0.05 Often use asymptotic approximation to distribution of d(D) under H0 as N ! 1 R

P-value for euro coins N = 250 trials, X=140 heads P-value is “less than 7%” If N=250 and X=141, pval = 0.0497, so we can reject the null hypothesis at the significance level of 5%. This does not mean P(H0|D)=0.07! Pval=(1-binocdf(139,n,0.5)) + binocdf(110,n,0.5)

Bayesian analysis of euro-coin Assume P(H0)=P(H1)=0.5 Assume P(p) ~ Beta(,) Setting =1 yields a uniform (non-informative) prior.

Bayesian analysis of euro-coin If =1, so H0 (unbiased) is (slightly) more probable than H1 (biased). By varying  over a large range, the best we can do is make B=1.9, which does not strongly support the biased coin hypothesis. Other priors yield similar results. Bayesian analysis contradicts classical analysis.

Outline Hypothesis testing – Bayesian approach Hypothesis testing – classical approach What’s wrong the classical approach?

Outline Hypothesis testing – Bayesian approach Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense

The likelihood principle In order to choose between hypotheses H0 and H1 given observed data, one should ask how likely the observed data are; do not ask questions about data that we might have observed but did not, such as This principle can be proved from two simpler principles called conditionality and sufficiency.

Frequentist statistics violates the likelihood principle “The use of P-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.” – Jeffreys, 1961

Another example Suppose X ~ N(,2); we observe x=3 Compare H0: =0 with H1: >0 P-value = P(X ¸ 3|H0)=0.001, so reject H0 Bayesian approach: update P(|X) using conjugate analysis; compute Bayes factor to compare H0 and H1

When are P-values valid? Suppose X ~ N(,2); we observe X=x. One-sided hypothesis test: H0:  · 0 vs H1:  > 0 If P() / 1, then P(|x) ~ N(x,2), so P-value is the same in this case, since Gaussian is symmetric in its arguments

Outline Hypothesis testing – Bayesian approach Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense

Stopping rule principle Inferences you make should only depend on the observed data, not the reasons why this data was collected. If you look at your data to decide when to stop collecting, this should not change any conclusions you draw. Follows from likelihood principle.

Frequentist statistics violates stopping rule principle Observe D=HHHTHHHHTHHT. Is there evidence of bias (Pt > Ph)? Let X=3 heads be observed random variable and N=12 trials be fixed constant. Define H0: Ph=0.5. Then, at the 5% level, there is no significant evidence of bias:

Frequentist statistics violates stopping rule principle Suppose the data was generated by tossing coins until we got X=3 heads. Now X=3 heads is a fixed constant and N=12 is a random variable. Now there is significant evidence of bias! First n-1 trials contain x-1 heads; last trial always heads

Ignoring stopping criterion can mislead classical estimators Let Xi ~ Bernoulli() Max lik. estimator MLE is unbiased: Toss coin; if head, stop, else toss second coin. P(H)=, P(HT)= (1-), P(TT)=(1-)2. Now MLE is biased! Many classical rules for assessing significance when complex stopping rules are used.

Outline Hypothesis testing – Bayesian approach Hypothesis testing – classical approach What’s wrong the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense

Confidence intervals An interval (min(D),max(D)) is a 95% CI if  lies inside this interval 95% of the time across repeated draws D~P(.|) This does not mean P( 2 CI|D) = 0.95! Mackay sec 37.3

Example Draw 2 integers from If =39, we would expect

Example If =39, we would expect Define confidence interval as eg (x1,x2)=(40,39), CI=(39,39) 75% of the time, this will contain the true 

CIs violate common sense If =39, we would expect If (x1,x2)=(39,39), then CI=(39,39) at level 75%. But clearly P(=39|D)=P(=38|D)=0.5 If (x1,x2)=(39,40), then CI=(39,39), but clearly P(=39|D)=1.0.

What’s wrong with the classical approach? Violates likelihood principle Violates stopping rule principle Violates common sense

What’s right about the Bayesian approach? Simple and natural Optimal mechanism for reasoning under uncertainty Generalization of Aristotelian logic that reduces to deductive logic if our hypotheses are either true or false Supports interesting (human-like) kinds of learning

Bayesian humor “A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.”