Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Basics of Statistical Estimation
A Tutorial on Learning with Bayesian Networks
Probabilistic models Haixu Tang School of Informatics.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
Learning: Parameter Estimation
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Binomial Distribution & Bayes’ Theorem. Questions What is a probability? What is the probability of obtaining 2 heads in 4 coin tosses? What is the probability.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Parameter Estimation using likelihood functions Tutorial #1
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Bayesian learning finalized (with high probability)
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Thanks to Nir Friedman, HU
Language Modeling Approaches for Information Retrieval Rong Jin.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Crash Course on Machine Learning
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Statistical Learning (From data to distributions).
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Statistics What is the probability that 7 heads will be observed in 10 tosses of a fair coin? This is a ________ problem. Have probabilities on a fundamental.
Maximum Likelihood Estimation
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
Dirichlet Distribution
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Applied statistics Usman Roshan.
Oliver Schulte Machine Learning 726
Bayesian Estimation and Confidence Intervals
CS 2750: Machine Learning Density Estimation
Ch3: Model Building through Regression
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Tutorial #3 by Ma’ayan Fishelson
Maximum Likelihood Find the parameters of a model that best fit the data… Forms the foundation of Bayesian inference Slide 1.
More about Posterior Distributions
CS498-EA Reasoning in AI Lecture #20
Statistical NLP: Lecture 4
CSCI 5822 Probabilistic Models of Human and Machine Learning
Important Distinctions in Learning BNs
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations, you can infer the bias of the coin

Maximum Likelihood Estimate Sequence of observations H T T H T T T H Maximum likelihood estimate? Θ = 3/8 What about this sequence? T T T T T H H H What assumption makes order unimportant? Independent Identically Distributed (IID) draws

The Likelihood Independent events -> Related to binomial distribution NH and NT are sufficient statistics How to compute max likelihood solution? Binomial: Prob of getting Nh heads out of Nh+Nt flips – combines over orderings; here it’s a specific sequence Max likelihood solution: use log likelihood and take derivative

Bayesian Hypothesis Evaluation: Two Alternatives Two hypotheses h0: θ=.5 h1: θ=.9 Role of priors diminishes as number of flips increases Note weirdness that each hypothesis has an associated probability, and each hypothesis specifies a probability probabilities of probabilities! Setting prior to zero -> narrowing hypothesis space hypothesis, not head! Whether we observe a particular sequence or just a count of #H and #T, same posterior because the combinatorial factor drops out

Bayesian Hypothesis Evaluation: Many Alternatives 11 hypotheses h0: θ=0.0 h1: θ=0.1 … h10: θ=1.0 Uniform priors P(hi) = 1/11

MATLAB Code

Infinite Hypothesis Spaces Consider all values of θ, 0 <= θ <= 1 Inferring θ is just like any other sort of Bayesian inference Likelihood is as before: Normalization term: With uniform priors on θ:

Infinite Hypothesis Spaces Consider all values of θ, 0 <= θ <= 1 Inferring θ is just like any other sort of Bayesian inference Likelihood is as before: Normalization term: With uniform priors on θ: This is a beta distribution: Beta(NH+1, NT+1)

Beta Distribution Gamma function: generalization of factorial x

Incorporating Priors Suppose we have a Beta prior Can compute posterior analytically Posterior is also Beta distributed

Imaginary Counts VH and VT can be thought of as the outcome of coin flipping experiments either in one’s imagination or in past experience Equivalent sample size = VH + VT The larger the equivalent sample size, the more confident we are about our prior beliefs… And the more evidence we need to overcome priors.

Regularization Suppose we flip coin once and get a tail, i.e., NT = 1, NH = 0 What is maximum likelihood estimate of θ? What if we toss in imaginary counts, VH = VT = 1? i.e., effective NT = 2, NH = 1 What if we toss in imaginary counts, VH = VT = 2? i.e., effective NT = 3, NH = 2 Imaginary counts smooth estimates to avoid bias by small data sets Issue in text processing Some words don’t appear in train corpus Note that for V>=1, ML estimate here is the same as the MAP estimate with V as prior counts

Prediction Using Posterior Given some sequence of n coin flips (e.g., HTTHH), what’s the probability of heads on the next flip? expectation of a beta distribution

Summary So Far Beta prior on θ Binomial likelihood for observations Beta posterior on θ Conjugate priors The Beta distribution is the conjugate prior of a binomial or Bernoulli distribution

Notable: Gaussian prior, Gaussian likelihood, Gaussian posterior – appeared in Weiss model

Conjugate Mixtures If a distribution Q is a conjugate prior for likelihood R, then so is a distribution that is a mixture of Q’s. E.g., mixture of Betas After observing 20 heads and 10 tails: Note change in mixture weights Example from Murphy (Fig 5.10)

Dirichlet-Multinomial Model We’ve been talking about the Beta-Binomial model Observations are binary, 1-of-2 possibilities What if observations are 1-of-K possibilities? K sided dice K English words K nationalities

Multinomial RV Variable X with values x1, x2, … xK Likelihood, given Nk observations of xk: Analogous to binomial draw θ specifies a probability mass function (pmf)

Dirichlet Distribution The conjugate prior of a multinomial likelihood … for θ in K-dimensional probability simplex, 0 otherwise Dirichlet is a distribution over probability mass functions (pmfs) Compare {αk} to VH and VT From Frigyik, Kapila, & Gupta (2010)

Hierarchical Bayes Consider generative model for multinomial One of K alternatives is chosen by drawing alternative k with probability θk But when we have uncertainty in the {θk}, we must draw a pmf from {αk} Hyperparameters Parameters of multinomial

Hierarchical Bayes Whenever you have a parameter you don’t know, instead of arbitrarily picking a value for that parameter, pick a distribution. Weaker assumption than selecting parameter value. Requires hyperparameters (hypernparameters), but results are typically less sensitive to hypernparameters than hypern-1parameters

Example Of Hierarchical Bayes: Modeling Student Performance Collect data from S students on performance on N test items. There is variability from student-to-student and from item-to-item student distribution item distribution

Item-Response Theory Parameters for Student ability Item difficulty P(correct) = logistic(Abilitys-Difficultyi) Need different ability parameters for each student, difficulty parameters for each item But can we benefit from the fact that students in the population share some characteristics, and likewise for items?