Expectation-Maximization (EM) Algorithm Md. Rezaul Karim Professor Department of Statistics University of Rajshahi Bangladesh September 21, 2012.

Slides:



Advertisements
Similar presentations
EMNLP, June 2001Ted Pedersen - EM Panel1 A Gentle Introduction to the EM Algorithm Ted Pedersen Department of Computer Science University of Minnesota.
Advertisements

Mixture Models and the EM Algorithm
An Introduction to the EM Algorithm Naala Brewer and Kehinde Salau.
Unsupervised Learning
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Biointelligence Laboratory, Seoul National University
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
1 Expectation Maximization Algorithm José M. Bioucas-Dias Instituto Superior Técnico 2005.
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
Lecture 5: Learning models using EM
Empirical Saddlepoint Approximations for Statistical Inference Fallaw Sowell Tepper School of Business Carnegie Mellon University September 2006.
Parametric Inference.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Visual Recognition Tutorial
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
EM Algorithm Likelihood, Mixture Models and Clustering.
Maximum likelihood (ML)
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Model Inference and Averaging
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
November 1, 2012 Presented by Marwan M. Alkhweldi Co-authors Natalia A. Schmid and Matthew C. Valenti Distributed Estimation of a Parametric Field Using.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Modern Navigation Thomas Herring
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Random Numbers and Simulation  Generating truly random numbers is not possible Programs have been developed to generate pseudo-random numbers Programs.
Modeling Correlated/Clustered Multinomial Data Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County Probability.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
An Introduction to the EM Algorithm By Naala Brewer and Kehinde Salau Project Advisor – Prof. Randy Eubank Advisor – Prof. Carlos Castillo-Chavez MTBI,
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Lecture 2: Statistical learning primer for biologists
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
CSE 517 Natural Language Processing Winter 2015
Expectation-Maximization (EM) Algorithm & Monte Carlo Sampling for Inference and Approximation.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
10 October, 2007 University of Glasgow 1 EM Algorithm with Markov Chain Monte Carlo Method for Bayesian Image Analysis Kazuyuki Tanaka Graduate School.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Random Numbers and Simulation
Model Inference and Averaging
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Igor V. Cadez, Padhraic Smyth, Geoff J. Mclachlan, Christine and E
Introduction to EM algorithm
KAIST CS LAB Oh Jong-Hoon
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Learning From Observed Data
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Fractional-Random-Weight Bootstrap
Presentation transcript:

Expectation-Maximization (EM) Algorithm Md. Rezaul Karim Professor Department of Statistics University of Rajshahi Bangladesh September 21, 2012

2 Basic Concept (1) Dr. M. R. Karim, Stats, R.U.  EM algorithm stands for “Expectation- Maximization” algorithm  A parameter estimation method: it falls into the general framework of maximum - likelihood estimation (MLE)  The general form was given in Dempster, Laird, and Rubin (1977), although essence of the algorithm appeared previously in various forms.

3 Basic Concept (2) Dr. M. R. Karim, Stats, R.U.  The EM algorithm is a broadly applicable iterative procedure for computing maximum likelihood estimates in problems with incomplete data.  The EM algorithm consists of two conceptually distinct steps at each iteration: o the Expectation or E-step and o the Maximization or M-step  Details can be found: Hartley (1958), Dempster et al. (1977), Little and Rubin (1987) and McLachlan and Krishnan (1997)

4 Formulation of the EM Algorithm (1) Dr. M. R. Karim, Stats, R.U.  Y = (Y obs, Y mis ) Complete data Y (e.g., what we’d like to have!) Observed data Y obs (e.g., what we have) Missing data Y mis (e.g., incomplete/unobserved)

5 Formulation of the EM Algorithm (2) Dr. M. R. Karim, Stats, R.U.

6 Formulation of the EM Algorithm (3) Dr. M. R. Karim, Stats, R.U.

7 Formulation of the EM Algorithm (4) Dr. M. R. Karim, Stats, R.U. Guess of unknown parameters initial guess M step Observed data structure Guess of unknown/ hidden data structure and Q function E step

8 Formulation of the EM Algorithm (5) Dr. M. R. Karim, Stats, R.U.

9 Formulation of the EM Algorithm (6) Dr. M. R. Karim, Stats, R.U.

10 Formulation of the EM Algorithm (7) Dr. M. R. Karim, Stats, R.U.

11 Formulation of the EM Algorithm (8) Dr. M. R. Karim, Stats, R.U.

12 Multinomial Example (1) Dr. M. R. Karim, Stats, R.U. Observed data Probability

13 Multinomial Example (2) Dr. M. R. Karim, Stats, R.U.

14 Multinomial Example (3) Dr. M. R. Karim, Stats, R.U.

15 Multinomial Example (4) Dr. M. R. Karim, Stats, R.U. n=197 y1=12 5 y11 1/2 y12 θ/4 y2=18 (1-θ)/4 y3=20 (1-θ)/4 y4=34 θ/4 Observed data Probability Missing data

16 Multinomial Example (5) Dr. M. R. Karim, Stats, R.U.

17 Multinomial Example (6) Dr. M. R. Karim, Stats, R.U. y 1 =125 y 11 1/2 y 12 θ/4

18 Multinomial Example (7) Dr. M. R. Karim, Stats, R.U.

19 Multinomial Example (8) Dr. M. R. Karim, Stats, R.U.

20 Flowchart for EM Algorithm Dr. M. R. Karim, Stats, R.U. Yes No

21 R function for the Example: (1) (y1, y2, y3, y4 are the observed frequencies) Dr. M. R. Karim, Stats, R.U. EM.Algo = function(y1, y2, y3, y4, tol, start0) { n = y1+y2+y3+y4; theta.current = start0; theta.last = 0; theta = theta.current; while (abs(theta.last - theta) > tol ){ y12 = E.step(theta.current, y1) theta = M.step(y12, y2, y3, y4, n) theta.last = theta.current theta.current = theta log.lik = y1*log(2+theta.current) +(y2+y3)*log(1-theta.current)+ y4*log(theta.current) cat(c(theta.current, log.lik), '\n') } }

22 R function for the Example (2) Dr. M. R. Karim, Stats, R.U. M.step = function(y12, y2, y3, y4, n){ return((y12+y4)/(y12+y2+y3+y4)) } E.step = function(theta.current, y1){ y12 = y1*(theta.current/4)/(0.5+theta.current/4); return(c(y12)) } # Results: EM.Algo(125, 18, 20, 34, 10^(-7), 0.50)

23 R function for the Example (3) Dr. M. R. Karim, Stats, R.U. Iteration (k)

24 Dr. M. R. Karim, Stats, R.U. Monte Carlo EM (1) In an EM algorithm, the E-step may be difficult to implement because of difficulty in computing the expectation of log likelihood. Wei and Tanner (1990a, 1990b) suggest a Monte Carlo approach by simulating the missing data Z from the conditional distribution k(z | y, θ (k) ) on the E-step of the (k + 1)th iteration

25 Dr. M. R. Karim, Stats, R.U. Monte Carlo EM (2) Then maximizing the approximate conditional expectation of the complete-data log likelihood The limiting form of this as m tends to ∞ is the actual Q(θ; θ (k) )

26 Dr. M. R. Karim, Stats, R.U. Monte Carlo EM (3) Application of MCEM in the previous example: A Monte Carlo EM solution would replace the expectation with the empirical average where z j are simulated from a binomial distribution with size y 1 and probability

27 Dr. M. R. Karim, Stats, R.U. Monte Carlo EM (4) Application of MCEM in the previous example: The R code for the E-step becomes E.step = function(theta.current, y1){ bprob = (theta.current/4)/(0.5+theta.current/4) zm = rbinom(10000, y1, bprob) y12 = sum(zm)/10000 return(c(y12)) }

28 Dr. M. R. Karim, Stats, R.U. Applications of EM algorithm (1) EM algorithm is frequently used for –  Data clustering (the assignment of a set of observations into subsets, called clusters, so that observations in the same cluster are similar in some sense) used in many fields, including machine learning, computer vision, data mining, pattern recognition, image analysis, information retrieval, and bioinformatics  Natural language processing (NLP is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages)

29 Dr. M. R. Karim, Stats, R.U. Applications of EM algorithm (2)  Psychometrics (the field of study concerned with the theory and technique of educational and psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits.)  Medical image reconstruction, especially in positron emission tomography (PET) and single photon emission computed tomography (SPECT)

30 Dr. M. R. Karim, Stats, R.U. Applications of EM algorithm (3) More applications regarding data analysis examples are –  Multivariate Data with Missing Values o Example: Bivariate Normal Data with Missing Values  Least Squares with Missing Data o Example: Linear Regression with Missing Dependent Values o Example: Missing Values in a Latin Square Design  Example: Multinomial with Complex Cell Structure  Example: Analysis of PET and SPECT Data  Example: Mixture distributions  Example: Grouped, Censored and Truncated Data o Example: Grouped Log Normal Data o Example: Lifetime distributions for censored data

31 Dr. M. R. Karim, Stats, R.U. Advantages of EM algorithm (1)  The EM algorithm is numerically stable, with each EM iteration increasing the likelihood  Under fairly general conditions, the EM algorithm has reliable global convergence (depends on initial value and likelihood!). Convergence is nearly always to a local maximizer.  The EM algorithm is typically easily implemented, because it relies on complete data computations  The EM algorithm is generally easy to program, since no evaluation of the likelihood nor its derivatives is involved

32 Dr. M. R. Karim, Stats, R.U. Advantages of EM algorithm (2)  The EM algorithm requires small storage space and can generally be carried out on a small computer (it does not have to store the information matrix nor its inverse at any iteration).  The M-step can often be carried out using standard statistical packages in situations where the complete-data MLE’s do not exist in closed form.  By watching the monotone increase in likelihood over iterations, it is easy to monitor convergence and programming errors.  The EM algorithm can be used to provide estimated values of the “missing” data.

33 Dr. M. R. Karim, Stats, R.U. Criticisms of EM algorithm  Unlike the Fisher’s scoring method, it does not have an inbuilt procedure for producing an estimate of the covariance matrix of the parameter estimates.  The EM algorithm may converge slowly even in some seemingly innocuous problems and in problems where there is too much ‘incomplete information’.  The EM algorithm like the Newton-type methods does not guarantee convergence to the global maximum when there are multiple maxima (in this case, the estimate obtained depends upon the initial value).  In some problems, the E-step may be analytically intractable, although in such situations there is the possibility of effecting it via a Monte Carlo approach.

34 Dr. M. R. Karim, Stats, R.U. References (1) 1.Dempster AP, Laird NM, Rubin, DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J Royal Statist Soc - B 39:1–38 2.Hartley HQ (1958) Maximum likelihood estimation from incomplete data. Biometrics 14: Little RJA, Rubin DB (1987) Statistical Analysis with Missing Data. John Wiley & Sons, Inc., New York 4.Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J Royal Statist Soc - B 44:226–233 5.McLachlan GJ, Krishnan T (1997) The EM Algorithm and Extensions. John Wiley & Sons, Inc., New York

35 Dr. M. R. Karim, Stats, R.U. References (2) 6.Meng XL, Rubin DB (1991) Using EM to obtain asymptotic variance-covariance matrices: the SEM algorithm. J Am Statist Assoc 86: Oakes D (1999) Direct calculation of the information matrix via the EM algorithm. J Royal Statist Soc - B 61: Rao CR (1972) Linear Statistical Inference and its Applications. John Wiley & Sons, Inc., New York 9.Redner RA, Walker HF (1984) Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev 26:

36 Dr. M. R. Karim, Stats, R.U. References (3) 10. Wei, G.C.G. and Tanner, M.A. (1990a). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association 85, Wei, G.C.G. and Tanner, M.A. (1990b). Posterior computations for censored regression data. Journal of the American Statistical Association 85, Wu CFJ (1983) On the convergence properties of the EM algorithm. Ann Statist 11:95-103

37 Dr. M. R. Karim, Stats, R.U. Thank You