Presentation on theme: "Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray."— Presentation transcript:
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray
MLE and EM Maximum Likelihood Estimation (MLE) and Expectation Maximization are two very important tools in Machine Learning Essentially you use them in estimating probability distributions in a learning algorithm; we have already seen one such example– in logistic regression we used MLE We will revisit MLE here, realize certain difficulties of MLE Then Expectation Maximization (EM) will rescue us
Probability Density Estimation: Quick Points Two different routes: Parametric Provide a parametrized class of density functions Tools: –Maximum likelihood estimation –Expectation Maximization –Sampling techniques –…. Non-Parametric Density is modeled by samples: Tools: –Kernel Methods –Sampling techniques –…
Revisiting Maximum Likelihood The data is coming from a known probability distribution The probability distribution has some parameters that are unknown to you Example: data is distributed as Gaussian y i ~ N( , 2 ), so the unknown parameters here are = ( , 2 ) MLE is a tool that estimates the unknown parameters of the probability distribution from data
MLE: Recapitulation Assume observation data y i are independent Form the Likelihood: Form the Log-likelihood: To find out the unknown parameter values, maximize the log- likelihood with respect to the unknown parameters:
MLE: A Challenging Example Observation data: histogram Indicator variable is the probability with which the observation is chosen from density 2 (1- ) is the probability with which the observation is chosen from density 1 Mixture model: Source: Department of Statistics, CMU
MLE: A Challenging Example … Maximum likelihood fitting for parameters: Numerically (and of course analytically, too) Challenging to solve!!
Expectation Maximization: A Rescuer EM augments the data space– assumes some latent data Source: Department of Statistics, CMU
EM: A Rescuer … Maximizing this form of log-likelihood is now tractable Note that we cannot analytically maximize this log-likelihood Source: Department of Statistics, CMU
EM: The Complete Data Likelihood By simple differentiations we have: How do we get the latent variables? So, maximization of the complete data likelihood is much easier!
Obtaining Latent Variables The latent variables are computed as expected values given the data and parameters: Apply Bayes’ rule:
EM for Two-component Gaussian Mixture Initialize 1, 1, 2, 2, Iterate until convergence –Expectation of latent variables –Maximization for finding parameters
EM for Mixture of K Gaussians Initialize mean vectors, covariance matrices, and mixing probabilities: k, k, k, k =1,2,…,K. Expectation Step: compute responsibilities Maximization Step: update parameters Iterate Steps Expectation and Maximization until convergence
EM Algorithm in General T = (Z, Z m ) is the complete data; we only know Z, Z m is missing Taking logarithm: Because we have access to previous parameter values , we can do better: Let us now consider the expression: It can be shown that Thus if ’ maximizes then This is actually done by Jensen’s inequality
Start with initial parameter values (0) ; t = 1 Expectation step: compute Maximization step: t =t + 1 and iterate EM Algorithm in General
EM Algorithm: Summary Augment the original data space by latent/hidden/missing data Frame a suitable probability model for the augmented data space In EM iterations, first assume initial values for the parameters Iterate the Expectation and the Maximization steps In the Expectation step, find the expected values of the latent variables (here you need to use the current parameter values) In the Maximization step, first plug in the expected values of the latent variables in the log-likelihood of the augmented data. Then maximize this log-likelihood to reevaluate the parameters Iterate last two steps until convergence
Applications of EM –Mixture models –HMMs –PCA –Latent variable models –Missing data problems –many computer vision problems –…
References The EM Algorithm and Extensions by Geoffrey J. MacLauchlan, Thriyambakam Krishnan For a non-parametric density estimate by EM look at: http://bioinformatics.uchc.edu/LectureNotes_200 6/Tools_EM_SA_2006_files/frame.htm http://bioinformatics.uchc.edu/LectureNotes_200 6/Tools_EM_SA_2006_files/frame.htm
EM: Important Issues Is the convergence of the algorithm guaranteed? Does the outcome of EM depend on the initial choice of the parameter values? How about the speed of convergence? How easy or difficult could it be to compute the expected values of the latent variables?