ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Component Analysis (Review)
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Newton’s Method Application to LMS Recursive Least Squares Exponentially-Weighted.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Maximum likelihood (ML) and likelihood ratio (LR) test
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Visual Recognition Tutorial
Computer vision: models, learning and inference
Introduction to Bayesian Parameter Estimation
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Maximum likelihood (ML)
Review of Lecture Two Linear Regression Normal Equation
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Definitions Random Signal Analysis (Review) Discrete Random Signals Random.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
Univariate Gaussian Case (Cont.)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Bayesian Estimation and Confidence Intervals Lecture XXII.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Bayesian Estimation and Confidence Intervals
12. Principles of Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
Statistical NLP: Lecture 4
More Parameter Learning, Multinomial and Continuous Variables
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
12. Principles of Parameter Estimation
Presentation transcript:

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example Resources: Wiki: Conjugate Priors GN: Bayesian Regression DF: Conjugate Priors and Posteriors GK: MAP Estimation of Parameters PML: Bayesian Statistics: An Introduction Wiki: Conjugate Priors GN: Bayesian Regression DF: Conjugate Priors and Posteriors GK: MAP Estimation of Parameters PML: Bayesian Statistics: An Introduction URL:.../publications/courses/ece_8423/lectures/current/lecture_23.ppt.../publications/courses/ece_8423/lectures/current/lecture_23.ppt MP3:.../publications/courses/ece_8423/lectures/current/lecture_23.mp3.../publications/courses/ece_8423/lectures/current/lecture_23.mp3 LECTURE 23: CONJUGATE PRIORS AND MAP VARIANCE ESTIMATION

ECE 8423: Lecture 23, Slide 1 Introduction Recall our maximum a posteriori (MAP formulation): Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement of the data-generating process. We have discussed the need to assume a form for the prior distribution, g(  ), that makes the mathematics tractable, and yet is a reasonable model. A conjugate prior is referred to as a choice of the prior that makes the posterior have the same form (or class) as the prior (but different parameters). A conjugate prior is an algebraic convenience: otherwise a difficult numerical integration may be necessary (or a complex set of equations might result). Another important reason to use a conjugate prior relates to the way we iteratively estimate parameters. It is common to assume a prior, estimate a posterior, then use that posterior as a prior to estimate the posterior for some additional data: With the use of a conjugate prior, we can be assured the form of the distribution will not change with each iteration.

ECE 8423: Lecture 23, Slide 2 Conjugate Prior Examples Recall when we developed the MAP estimate of the mean for a Gaussian distribution, we assumed a Gaussian prior: and observed that the posterior is also a Gaussian (with a different mean and variance), and hence the Gaussian prior (unknown mean) is a conjugate prior. Next, consider a binomial distribution (k outcomes or “successes” in n trials): The conjugate prior for this distribution is a Beta distribution: and B() is a Beta function (which is defined in terms of a Gamma function). The posterior is given by:

ECE 8423: Lecture 23, Slide 3 Conjugate Prior for Multinomial and Multivariate Gaussian We can define a generalization of the binomial known as the multinomial distribution that describes the probability of k possible outcomes of n independent trials, where x i is the number of times the k th outcome is observed: The conjugate prior for this distribution is known as a Dirichlet density: Note that this Dirichlet prior was used for the mixture weight prior in MAP estimation of the mixture weights in an HMM. For a multivariate Gaussian distribution with an unknown mean and precision (precision is the inverse of the covariance), the conjugate prior is a Normal- Wishart distribution: This was used to model the priors for the means and covariances in the Gaussian mixture distribution in an HMM.

ECE 8423: Lecture 23, Slide 4 MAP Variance Adaptation (Scalar Gaussian RV) Given a set of new observations,, we would like to update our estimate of the mean and variance using MAP adaptation. The likelihood of variance given data is: Let us simplify this expression by defining some constants: The posterior is the likelihood multiplied by the prior: Once again, we have the usual problem of what to assume for the prior distribution. In the case of a Gaussian likelihood where the mean was unknown, assuming a Gaussian prior for the mean was a conjugate prior. But in this case, assuming a Gaussian prior does satisfy our condition for a conjugate prior, and even if we persevered, the equations that would result are too complicated (requiring numerical solution). Also, most likely, the solution would not deliver significantly better performance. Is there a better choice?

ECE 8423: Lecture 23, Slide 5 Prior Distribution: Normal-Wishart Conjugate Prior: In the scalar case, the Normal-Wishart form for precision (inverted form for variance) reduces to an inverse Chi-squared distribution: S 0 represents an estimate of the variance computed on the prior data, or it can be optimized based on prior beliefs about the data. (This topic is again part of the parameter estimation material we discuss in Pattern Recognition.)  is a parameter that we introduce to allow the prior to have a similar form to the likelihood, but also give us some freedom to optimize the prior to better match our prior beliefs of the data. For example, we could also estimate  from prior training data. For reasons that will make sense shortly, let. This is just a definition for convenience; it does not alter the model in any way. Combining all these results, we can write our posterior as:

ECE 8423: Lecture 23, Slide 6 Maximizing the Posterior Distribution Our posterior for the variance resembles an inverse Chi-squared (  2 ) distribution, while the inverse of the variance, which we call precision, resembles a Chi-squared distribution: We can differentiate with respect to to obtain the MAP solution (which is also the MAP solution for  2 ):

ECE 8423: Lecture 23, Slide 7 Dealing with the Parameters of the Prior Distribution We have two parameters that are unresolved:. If we have prior knowledge of the mean and variance, based on some estimates from the training data, we can equate these with the inverse Chi- squared distribution and solve for the values of : The mean and variance of this distribution are well-known: We can solve for : To recap the adaptation process: (1)For a training data set, estimate the expected value of the variance (the mean of the variance) and the variance of the variance. (2)Estimate. (3)Estimate S from the new data: (4)Estimate :

ECE 8423: Lecture 23, Slide 8 Example: MAP Adaptation Convergence

ECE 8423: Lecture 23, Slide 9 Introduced the concept of a conjugate prior. Introduced several priors that are commonly used in statistical modeling and signal processing. Derived adaptation of the variance of a scalar Gaussian random variable using MAP. Demonstrated the convergence properties of the estimate on a simple example using MATLAB. Next: Adaptation using discriminative training. Summary

ECE 8423: Lecture 23, Slide 10 % % map_gauss.m % % MAP Adaptation for Gaussian r.v. for: % 1) Mean with Gaussian prior assuming known variance % 2) Variance with inverse Chi-Squared prior % assuming known mean. % clear; N=5000; M=2000; % Training data and parameters % mu_train=2.0; var_train=1.1; x_train=mu_train+randn(1,N)*sqrt(var_train); %Test data and parameters % mu_test=2.5; var_test=1.5; x_test=mu_test+randn(1,M)*sqrt(var_test); MATLAB

ECE 8423: Lecture 23, Slide 11 % Initialization % x=x_test; mu_hat=[]; var_hat=[]; % Prior for mean estimation % mu_m=mu_train; var_m=5.0; % Prior for variance estimation % mu_var=var_train; vari_var=10; v=2*mu_var^2 / vari_var + 4; S0=mu_var*(v-2); for i=2:1:length(x) x_test=x(1:i); n=length(x_test); m=mean(x_test); var1=var(x_test); MATLAB (Cont.)

ECE 8423: Lecture 23, Slide 12 % MAP adaptation for mean % vari=1/(1/var_m + n/var1); mu_hat=[mu_hat vari*(mu_m/var_m + m/(var1/n))]; % MAP adaptation for variance % S=sum((x_test-m).^2); var_hat=[var_hat (S0+S)/(v+n+2)]; end; subplot(211);line([1,M],[mu_train, mu_train],'linestyle','-.'); hold on; subplot(211);line([1,M],[m, m],'linestyle','-.','color','r'); subplot(211);plot([2:M],mu_hat); text(-260, mu_train, 'train \mu'); text(-260, m, 'adapt \mu'); title('Mean Adaptation Convergence'); subplot(212);line([1,M],[var_train, var_train],'linestyle','-.'); hold on; subplot(212);line([1,M],[S/M, S/M],'linestyle','-.','color','r'); subplot(212);plot([2:M],var_hat); text(-285, var_train, 'train \sigma^2'); text(-285, S/M, 'adapt \sigma^2'); title('Variance Adaptation Convergence'); MATLAB (Cont.)