ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

Expectation Maximization
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Lecture 5: Learning models using EM
Expectation-Maximization
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
EM Algorithm Likelihood, Mixture Models and Clustering.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
EM and expected complete log-likelihood Mixture of Experts
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Overview Particle filtering is a sequential Monte Carlo methodology in which the relevant probability distributions are iteratively estimated using the.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Other Models for Time Series. The Hidden Markov Model (HMM)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Lecture 18 Expectation Maximization
Model Inference and Averaging
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Expectation-Maximization
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Statistical Models for Automatic Speech Recognition
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 15: REESTIMATION, EM AND MIXTURES
Biointelligence Laboratory, Seoul National University
EM Algorithm 主講人:虞台文.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Application: Hidden Markov Models Resources: Wiki: EM History T.D.: Brown CS Tutorial UIUC: Tutorial F.J.: Statistical Methods Wiki: EM History T.D.: Brown CS Tutorial UIUC: Tutorial F.J.: Statistical Methods LECTURE 10: EXPECTATION MAXIMIZATION (EM)

ECE 8527: Lecture 10, Slide 1 The Expectation Maximization Algorithm (Preview)

ECE 8527: Lecture 10, Slide 2 The Expectation Maximization Algorithm (Cont.)

ECE 8527: Lecture 10, Slide 3 The Expectation Maximization Algorithm

ECE 8527: Lecture 10, Slide 4 Expectation maximization (EM) is an approach that is used in many ways to find maximum likelihood estimates of parameters in probabilistic models. EM is an iterative optimization method to estimate some unknown parameters given measurement data. Used in a variety of contexts to estimate missing data or discover hidden variables. The intuition behind EM is an old one: alternate between estimating the unknowns and the hidden variables. This idea has been around for a long time. However, in 1977, Dempster, et al., proved convergence and explained the relationship to maximum likelihood estimation.Dempster, et al., EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated. This approach is the cornerstone of important algorithms such as hidden Markov modeling and discriminative training, and has been applied to fields including human language technology and image processing. Synopsis

ECE 8527: Lecture 10, Slide 5 Lemma: If p(x) and q(x) are two discrete probability distributions, then: with equality if and only if p(x) = q(x) for all x. Proof: The last step follows using a bound for the natural logarithm:. Special Case of Jensen’s Inequality

ECE 8527: Lecture 10, Slide 6 Continuing in efforts to simplify: We note that since both of these functions are probability distributions, they must sum to 1.0. Therefore, the inequality holds. The general form of Jensen’s inequality relates a convex function of an integral to the integral of the convex function and is used extensively in information theory. Special Case of Jensen’s Inequality

ECE 8527: Lecture 10, Slide 7 Theorem: If then. Proof: Let y denote observable data. Let be the probability distribution of y under some model whose parameters are denoted by. Let be the corresponding distribution under a different setting. Our goal is to prove that y is more likely under than. Let t denote some hidden, or latent, parameters that are governed by the values of. Because is a probability distribution that sums to 1, we can write: Because we can exploit the dependence of y on t and using well-known properties of a conditional probability distribution. The EM Theorem

ECE 8527: Lecture 10, Slide 8 We can multiply each term by “1”: where the inequality follows from our lemma. Explanation: What exactly have we shown? If the last quantity is greater than zero, then the new model will be better than the old model. This suggests a strategy for finding the new parameters, θ: choose them to make the last quantity positive! Proof Of The EM Theorem

ECE 8527: Lecture 10, Slide 9 Discussion If we start with the parameter setting, and find a parameter setting for which our inequality holds, then the observed data, y, will be more probable under than. The name Expectation Maximization comes about because we take the expectation of with respect to the old distribution and then maximize the expectation as a function of the argument. Critical to the success of the algorithm is the choice of the proper intermediate variable, t, that will allow finding the maximum of the expectation of. Perhaps the most prominent use of the EM algorithm in pattern recognition is to derive the Baum-Welch reestimation equations for a hidden Markov model. Many other reestimation algorithms have been derived using this approach.

ECE 8527: Lecture 10, Slide 10 Example: Estimating Missing Data Consider a data set with a missing element: Let us estimate the value of the missing point assuming a Gaussian model with a diagonal covariance and arbitrary means: Expectation step: Assuming normal distributions as initial conditions, this can be simplified to:

ECE 8527: Lecture 10, Slide 11 Example: Gaussian Mixtures An excellent tutorial on Gaussian mixture estimation can be found at J. Bilmes, EM Estimation J. Bilmes, EM Estimation An interactive demo showing convergence of the estimate can be found at I. Dinov, Demonstration I. Dinov, Demonstration

ECE 8527: Lecture 10, Slide 12 Introduction To Hidden Markov Models

ECE 8527: Lecture 10, Slide 13 Introduction To Hidden Markov Models (Cont.)

ECE 8527: Lecture 10, Slide 14 Introduction To Hidden Markov Models (Cont.)

ECE 8527: Lecture 10, Slide 15 Summary Expectation Maximization (EM) Algorithm: a generalization of Maximum Likelihood Estimation (MLE) based on maximization of a posterior that data was generated by a model. EM is a special case of Jensen’s inequality. Jensen’s Inequality: describes a relationship between two probability distributions in terms of an entropy-like quantity. A key tool in proving that EM estimation converges. The EM Theorem: proved that estimation of a model’s parameters using an iteration of EM increases the posterior probability that the data was generated by the model. Demonstrated an application of the EM Theorem to the problem of estimating missing data point. Explained how EM can be used to reestimate parameters in a pattern recognition system. Introduced the concept of a hidden Markov model and explained how we will use EM to estimate the parameters of this model.