Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Biointelligence Laboratory, Seoul National University
Linear Models for Classification: Probabilistic Methods
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Visual Recognition Tutorial
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
Visual Recognition Tutorial
Computer vision: models, learning and inference
Lecture II-2: Probability Review
The Multivariate Normal Distribution, Part 2 BMTRY 726 1/14/2014.
Jointly distributed Random variables
Today Wrap up of probability Vectors, Matrices. Calculus
Biointelligence Laboratory, Seoul National University
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.
Summarized by Soo-Jin Kim
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Objectives: Normal Random Variables Support Regions Whitening Transformations Resources: DHS – Chap. 2 (Part 2) K.F. – Intro to PR X. Z. – PR Course S.B.
Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, Update by B.-H. Kim Summarized by M.H. Kim Biointelligence.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Biointelligence Laboratory, Seoul National University
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
CS 2750: Machine Learning Density Estimation
CS 2750: Machine Learning Probability Review Density Estimation
Latent Variables, Mixture Models and EM
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
Multivariate Methods Berlin Chen
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Applied Statistics and Probability for Engineers
Presentation transcript:

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence Laboratory, Seoul National University

2(C) 2007, SNU Biointelligence Lab, Binary Variables  The beta distribution 2.2. Multinomial Variables  The Dirichlet distribution 2.3. The Gaussian Distribution  Conditional Gaussian distributions  Marginal Gaussian distributions  Bayes` theorem for Gaussian variables  Maximum likelihood for the Gaussian  Sequential estimation

3(C) 2007, SNU Biointelligence Lab, Density Estimation Modeling the probability distribution p(x) of a random variable x, given a finite set x 1,…,x n of observations.  We will assume that the data points are i.i.d. Fundamentally ill-posed  There are infinitely many probability distributions that could have given rise to the observed finite data set. The issue of choosing an appropriate distribution relates to the problem of model selection. Begins by considering parametric distributions.  binomial, multinomial, and Gaussian  Governed by a small number of adaptive parameters.  Such as the mean and variance in the case of a Gaussian.

4(C) 2007, SNU Biointelligence Lab, Frequentist and Bayesian Treatments for the Density Estimation Frequentist  Choose specific values for the parameters by optimizing some criterion, such as the likelihood function. Bayesian  Introduce prior distributions over the parameters and the use Bayes` theorem to compute the corresponding posterior distribution given the observed data.

5(C) 2007, SNU Biointelligence Lab, Bernoulli Distribution  Considering single binary r.v. x ∈ {0,1}. Frequentist treatment  Likelihood function  Suppose we have a data set D={x 1,…,x N } of observed values of x.  Maximum likelihood estimator  If we flip a coin 3 times and happen to observe 3 heads the ML estimator is 1.  An extreme example of the overfitting associated with ML.

6(C) 2007, SNU Biointelligence Lab, Binomial Distribution  The distribution of the number m of observations of x=1 given that the data set has size N.  Histogram plot of the binomial distribution (N=10, μ =0.25)

7(C) 2007, SNU Biointelligence Lab, Beta Distribution

8(C) 2007, SNU Biointelligence Lab, Bernoulli & Binomial Distribution - Bayesian Treatment (1/3) We need to introduce a prior distribution. Conjugacy  Posterior distribution have the same functional form as the prior.  Beta (prior)  Binomial (likelihood)  Beta (posterior)  Dirichlet (prior)  Multinomial (likelihood)  Dirichlet (posterior)  Gaussian (prior)  Gaussian (likelihood)  Gaussian (posterior) When Beta distribution is a prior.  The posterior distribution of μ is now obtained by multiplying the beta prior by the binomial likelihood function and normalizing.  Has the same functional dependence on μ as the prior distribution reflecting the conjugacy properties.

9(C) 2007, SNU Biointelligence Lab, Bernoulli & Binomial Distribution - Bayesian Treatment (2/3) Because of the beta distribution property, it is simple to normalize.  Simply another beta distribution  Observing a data set of m observations of x=1 and has been to increase the value of a by m.  This allows us to provide a simple interpretation of the hyperparameters a and b in the prior as an effective number of observations if x=1 and x=0.

10(C) 2007, SNU Biointelligence Lab, Bernoulli & Binomial Distribution - Bayesian Treatment (3/3) The posterior distribution can act as the prior if we subsequently observe additional data. Prediction of the outcome of the next trial  If m,l  ∞, then the result reduces to the maximum likelihood result.  The Bayesian and maximum likelihood results (frequentist view) will agree in the limit of an infinitely large data set.  For a finite data set, the posterior mean for μ always lies between the prior mean and the maximum likelihood estimate for μ corresponding to the relative frequencies of events given by μ ML. As the number of observations increases, the posterior distribution becomes more sharply peaked (variance is reduced). Illustration of one step of sequential Bayesian inference

11(C) 2007, SNU Biointelligence Lab, Multinomial Variables (1/2) We will use 1-of-K scheme  The variable is represented by a K-dimensional vector x in which one of the elements x k equals 1, and all remaining elements equal 0.  Ex) x=(0,0,1,0,0,0) T  Considering a data set D of N independent observations.

12(C) 2007, SNU Biointelligence Lab, Multinomial Variables (2/2) Maximizing the log-likelihood using a Lagrange multiplier Multinomial distribution  The joint distribution of the quantities m 1,…,m K, conditioned on the parameters μ and on the total number N of observations.

13(C) 2007, SNU Biointelligence Lab, Dirichlet Distribution (1/2) Dirichlet distribution  The relation of multinomial and Dirichlet distributions is the same as that of binomial and beta distributions.  Prior  Posterior

14(C) 2007, SNU Biointelligence Lab, Dirichlet Distribution (2/2) The Dirichlet distribution over three variables.  Confined to a simplex because of the constraints.  Two horizontal axes are simplex, the vertical axis corresponds to the density. (a k =0.1, 1, 10, respectively)

15(C) 2007, SNU Biointelligence Lab, The Gaussian Distribution In the case of a single variable For a D-dimensional vector x The distribution maximizes the entropy The central limit theorem  The sum of a set of random variables has a distribution that becomes increasingly Gaussian as the number of terms in the sum increases.

16(C) 2007, SNU Biointelligence Lab, The Geometrical Form of the Gaussian Distribution (1/3) Functional dependence of the Gaussian on x Δ is called the Mahalanobis distance  Is the Euclidean distance when ∑ is I. ∑ can be taken to be symmetric.  Because any antisymmetric component would disappear from the exponent. The eigenvector equation  Choose the eigenvectors to form an orthonormal set.

17(C) 2007, SNU Biointelligence Lab, The Geometrical Form of the Gaussian Distribution (2/3) The covariance matrix can be expressed as an expansion in terms of its eigenvectors The functional dependence becomes We can interpret {y i } as a new coordinate system defined by the orthonormal vectors u i that are shifted and rotated.

18(C) 2007, SNU Biointelligence Lab, The Geometrical Form of the Gaussian Distribution (3/3) y=U(x-μ) U is a matrix whose rows are given by u i T.  Orthogonal matrix.  To be well defined, it should be positive definite. The determinant |∑| of the covariance matrix can be written as the product of its eigenvalues. The Gaussian distribution takes the form which is the product of D independent univariate Gaussian distributions. Normalization of p(y) confirms that the multivariate Gaussian is indeed normalized.

19(C) 2007, SNU Biointelligence Lab, 1 st Moment 1 st moment of multivariate Gaussian Integrate this to get

20(C) 2007, SNU Biointelligence Lab, 2 nd Moment 2 nd moment of multivariate Gaussian Vanish by integration

21(C) 2007, SNU Biointelligence Lab, Covariance Matrix Form for the Gaussian Distribution For large D, the total number of parameters (D(D+3)/2) grows quadratically with D. One way to reduce computation cost is to restrict the form of the covariance matrix.  (a) general form  (b) diagonal  (c) isotropic (proportional to the identity matrix)

22(C) 2007, SNU Biointelligence Lab, Conditional & Marginal Gaussian Distributions (1/2) If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian.  The mean of the conditional distribution p(x a |x b ) is a linear function of x b and covariance is independent of x a.  An example of a Linear Gaussian model. If a joint distribution p(x a, x b ) is Gaussian, then the marginal distribution is also Gaussian.  Prove using

23(C) 2007, SNU Biointelligence Lab, Conditional & Marginal Gaussian Distributions (2/2) The contours of a Gaussian distribution p(x a,x b ) over two variables. The marginal distribution p(x a ) and the conditional distribution p(x a |x b )

24(C) 2007, SNU Biointelligence Lab, Conditional Gaussian Distributions (1/3) Find from joint distribution  Rearrange multivariate Gaussian w.r.t. x a. Define:  Caution: Partitioning quadratic part of Gaussian  again quadratic form w.r.t. x a.

25(C) 2007, SNU Biointelligence Lab, Conditional Gaussian Distributions (2/3) Fix x b and use  2 nd order terms in x a :  Linear terms in x a :

26(C) 2007, SNU Biointelligence Lab, Conditional Gaussian Distributions (3/3) Use the identity for inverse of a partitioned matrix Conditional mean: Conditional variance: Note  the mean of the conditional distribution is a linear function of x b.  linear-Gaussian model.  The covariance is independent of x a.

27(C) 2007, SNU Biointelligence Lab, Marginal Gaussian Distributions (1/2) Find Again, start from partitioned quadratic form Terms that involve x b : Integrated out when marginalized by x b.

28(C) 2007, SNU Biointelligence Lab, Marginal Gaussian Distributions (2/2) After marginalization, Again compare with And with We obtain intuitively satisfying result that the marginal distribution has mean and covariance given by

29(C) 2007, SNU Biointelligence Lab, Bayes’ Theorem for Gaussian Variables (1/2) Setting: –The mean of y is a linear function of x

30(C) 2007, SNU Biointelligence Lab, Bayes’ Theorem for Gaussian Variables (2/2) With similar processes,  Note that From above equations and mean variance equations of conditional Gaussian distribution, we can also get these results.

31(C) 2007, SNU Biointelligence Lab, Maximum Likelihood for the Gaussian (1/2) ML w.r.t. μ. ML w.r.t. ∑.  Imposing the symmetry and positive definiteness constraints

32(C) 2007, SNU Biointelligence Lab, Maximum Likelihood for the Gaussian (2/2) Evaluating the expectations of the ML solutions under the true distribution.  The ML estimate for the covariance has an expectation that is less than the true value.  Using following estimator, that can be corrected.

33(C) 2007, SNU Biointelligence Lab, Sequential Estimation (1/4) Allow data points to be processed one at a time and then discarded. Robbins-Monro algorithm  More general formulation of sequential learning  Consider a pair of random variables θ and z governed by a joint distribution p(z,θ).

34(C) 2007, SNU Biointelligence Lab, Sequential Estimation (2/4)  Regression function  Our goal is to find the root θ * at which f(θ * )=0.  We observe z one at a time and wish to find a corresponding sequential estimation scheme for θ *.

35(C) 2007, SNU Biointelligence Lab, Sequential Estimation (3/4) In the case of a Gaussian distribution. (Θ corresponds to μ) The Robbins-Monro procedure defines a sequence of successive estimate of the root θ *.

36(C) 2007, SNU Biointelligence Lab, Sequential Estimation (4/4) A general maximum likelihood problem Finding the maximum likelihood solution corresponds to finding the root of a regression function. For Gaussian: maximum likelihood solution for μ is the μ s.t. makes E[z| μ]=0