Chapter Two Probability Distributions: Discrete Variables

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Probabilistic models Haixu Tang School of Informatics.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Statistical Estimation and Sampling Distributions
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Maximum likelihood (ML) and likelihood ratio (LR) test
Probability Densities
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Continuous Random Variables and Probability Distributions
Computer vision: models, learning and inference
Computer vision: models, learning and inference Chapter 3 Common probability distributions.
Probability Distributions and Frequentist Statistics “A single death is a tragedy, a million deaths is a statistic” Joseph Stalin.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Maximum likelihood (ML)
Chapter 21 Random Variables Discrete: Bernoulli, Binomial, Geometric, Poisson Continuous: Uniform, Exponential, Gamma, Normal Expectation & Variance, Joint.
Recitation 1 Probability Review
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Machine Learning Overview Adapted from Sargur N. Srihari University at Buffalo, State University of New York USA.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Exam I review Understanding the meaning of the terminology we use. Quick calculations that indicate understanding of the basis of methods. Many of the.
Affiliation: Kyoto University
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, Update by B.-H. Kim Summarized by M.H. Kim Biointelligence.
Bayesian Belief Network AI Contents t Introduction t Bayesian Network t KDD Data.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Applied statistics Usman Roshan.
Oliver Schulte Machine Learning 726
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
Appendix A: Probability Theory
CS 2750: Machine Learning Density Estimation
Parameter Estimation 主講人:虞台文.
CS 2750: Machine Learning Probability Review Density Estimation
CH 5: Multivariate Methods
Bayes Net Learning: Bayesian Approaches
Computer vision: models, learning and inference
Review of Probability and Estimators Arun Das, Jason Rebello
Distributions and Concepts in Probability Theory
Maximum Likelihood Find the parameters of a model that best fit the data… Forms the foundation of Bayesian inference Slide 1.
More about Posterior Distributions
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

Chapter Two Probability Distributions: Discrete Variables Distributions: Relationships Binary Variables Bernoulli, Binomial and Beta Multinomial Variables Generalized Bernoulli and Dirichlet

Distributions: Landscape Discrete- Binary Bernoulli Discrete- Multivalued Continuous Binomial Multinomial Beta Dirichlet Gaussian Wishart Student’s-t Gamma Exponential Angular Von Mises Uniform 2

Distributions: Relationships Discrete- Binary Conjugate Prior N=1 Beta Continuous variable Binomial Bernoulli N samples of Bernoulli Single binary variable between {0,1] K=2 Discrete- Multi-valued Multinomial One of K values = K-dimensional Conjugate Prior Large N Dirichlet K random variables binary vector between [0.1] Continuous Gaussian Student’s-t Generalization of Gaussian robust to Outliers Gamma Conjugate Prior of univariate Gaussian precision Wishart Conjugate Prior of multivariate Gaussian precision matrix Exponential Special case of Gamma Infinite mixture of Gaussians Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix Angular Von Mises Uniform 3

Bernoulli, Binomial and Beta Binary Variables Bernoulli, Binomial and Beta 4

Bernoulli Distribution Expresses distribution of single binary-valued random variable x  {0,1} Probability of x=1 is denoted by parameter , i.e., Therefore, Probability distribution has the form Mean is shown to be E[x]= Variance is Var[x]=(1-) Likelihood of n observations independently drawn from p(x|) is Log-likelihood is Maximum likelihood estimator – obtained by setting derivative of ln p(D|) wrt  equal to zero is If no of observations of x=1 is m then ML=m/N Jacob Bernoulli 1654-1705 5

Binomial Distribution • Related to Bernoulli distribution • Expresses Distribution of m – No of observations for which x=1 • It is proportional to Bern(x|) • Add up all ways of obtaining heads Histogram of Binomial for N=10 and =0.25 • Mean and Variance are N times m: head N-m: tail 6

Bayesian Inference with Beta • MLE of  in Bernoulli is fraction of observations with x=1 – Severely over-fitted for small data sets. // MLE: argmax{p(D|)} • Likelihood function takes products of factors of the form x(1- )(1-x) • If prior distribution of  is chosen to be proportional to powers of  and 1-, posterior will have same functional form as the prior – Called conjugacy • Beta has form suitable for a prior distribution of p() posterior p(|D)  likelihood p(D|)  prior p() where 8

Beta Distribution • Mean and Variance • Beta distribution • Where the Gamma function is defined as • a and b are hyperparameters that control distribution of parameter  • Mean and Variance a=0.1, b=0.1 a=2, b=3 a=1, b=1 a=8, b=4 Beta distribution as function of  For values of hyperparameters a and b 7

Bayesian Inference with Beta posterior p(|D)  likelihood p(D|)  prior p() Illustration of one step in process a=2, b=2 N=m=1, with x=1 • Posterior obtained by multiplying beta prior with binomial likelihood yields – where l = N - m, which is no of tails – m is no of heads • It is another beta distribution – Effectively increase value of a by m and b by l – As number of observations increases distribution becomes more peaked a=3, b=2 9

Predicting Next Trial Outcome • Need predictive distribution of x given observed D – From sum and products rule • Expected value of the posterior distribution can be shown to be – Which is fraction of observations (both fictitious and real) that correspond to x=1 • Maximum likelihood and Bayesian results agree in the limit of infinite observations – On average uncertainty (variance) decreases with observed data 10

Summary • Single Binary variable distribution is represented by Bernoulli • Binomial is related to Bernoulli – Expresses distribution of number of occurrences of either 1 or 0 in N trials • Beta distribution is a conjugate prior for Bernoulli – Both have the same functional form 11

Multinomial Variables Generalized Bernoulli and Dirichlet 12

Generalization of Bernoulli • Discrete variable that takes one of K values (instead of 2) • Represent as 1 of K scheme – Represent x as a K-dimensional vector – If x=3 then we represent it as x=(0,0,1,0,0,0)T – Such vectors satisfy • If probability of xk =1 is denoted k then distribution of x is given by Generalized Bernoulli 13

Likelihood Function • Given a set of D of N independent observations x1,..xN • The likelihood function has the form • Where mk=nxnk is the number of observations of xk=1 • The maximum likelihood solution (obtained by log-likelihood and derivative wrt zero) is which is fraction of N observations for which xk = 1. D: N = 5, K = 6 (0, 0, 1, 0, 0, 0)T (1, 0, 0, 0, 0, 0)T (0, 0, 0, 0, 1, 0)T 14

• Multinomial distribution Generalized Binomial Distribution • Multinomial distribution • Where the normalization coefficient is the no of ways of partitioning N objects into K groups of size • Given by D: N = 5, K = 6 (0, 0, 1, 0, 0, 0)T (1, 0, 0, 0, 0, 0)T (0, 0, 0, 0, 1, 0)T D: N = 7, K = 6 (1, 0, 0, 0, 0, 0)T (0, 1, 0, 0, 0, 0)T 15

Dirichlet Distribution • Family of prior distributions for parameters k of multinomial distribution • By inspection of multinomial, form of conjugate prior is • Normalized form of Dirichlet distribution Lejeune Dirichlet 1805-1859 16

Dirichlet over 3 variables • Due to summation constraint – Distribution over space of {k} is confined to the simplex of dimensionality K-1 – For K=3 k = 0.1 k = 1 Plots of Dirichlet distribution over the simplex for various settings of parameters k k = 10 17

Dirichlet Posterior Distribution • Multiplying prior by likelihood • Which has the form of the Dirichlet distribution 18

• Multinomial is a generalization of Bernoulli Summary • Multinomial is a generalization of Bernoulli –Variable takes on one of K values instead of 2 • Conjugate prior of Multinomial is Dirichlet distribution 19