Expectation-Maximization

Slides:



Advertisements
Similar presentations
Image Modeling & Segmentation
Advertisements

Unsupervised Learning
Expectation Maximization
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Supervised Learning Recap
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Segmentation and Fitting Using Probabilistic Methods
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
K-means clustering Hongning Wang
1 Expectation Maximization Algorithm José M. Bioucas-Dias Instituto Superior Técnico 2005.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Midterm Review. The Midterm Everything we have talked about so far Stuff from HW I won’t ask you to do as complicated calculations as the HW Don’t need.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
Clustering.
Parametric Inference.
Gaussian Mixture Example: Start After First Iteration.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Part 4 c Baum-Welch Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Expectation Maximization Algorithm
Maximum Likelihood (ML), Expectation Maximization (EM)
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
ECE 5984: Introduction to Machine Learning
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
EM Algorithm Likelihood, Mixture Models and Clustering.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Gaussian Mixture Models and Expectation Maximization.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
EM and expected complete log-likelihood Mixture of Experts
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Variational Bayesian Methods for Audio Indexing
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 8, Friday June 8 th, 2007 (introduction.
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
CSE 517 Natural Language Processing Winter 2015
Today's Specials ● Detailed look at Lagrange Multipliers ● Forward-Backward and Viterbi algorithms for HMMs ● Intro to EM as a concept [ Motivation, Insights]
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Classification of unlabeled data:
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Expectation-Maximization
Bayesian Models in Machine Learning
Expectation Maximization
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Expectation-Maximization Markoviana Reading Group Fatih Gelgi, ASU, 2005 4/17/2017

Outline What is EM? Intuitive Explanation Algorithm Generalized EM Example: Gaussian Mixture Algorithm Generalized EM Discussion Applications HMM – Baum-Welch K-means 4/17/2017 Fatih Gelgi, ASU’05

What is EM? Two main applications: Data has missing values, due to problems with or limitations of the observation process. Optimizing the likelihood function is extremely hard, but the likelihood function can be simplified by assuming the existence of and values for additional missing or hidden parameters. 4/17/2017 Fatih Gelgi, ASU’05

Key Idea… The observed data U is generated by some distribution and is called the incomplete data. Assume that a complete data set exists Z = (U,J), where J is the missing or hidden data. Maximize the posterior probability of the parameters  given the data U, marginalizing over J: 4/17/2017 Fatih Gelgi, ASU’05

Intuitive Explanation of EM Alternate between estimating the unknowns  and the hidden variables J. In each iteration, instead of finding the best J  J, compute a distribution over the space J. EM is a lower-bound maximization process (Minka,98). E-step: construct a local lower-bound to the posterior distribution. M-step: optimize the bound. 4/17/2017 Fatih Gelgi, ASU’05

Intuitive Explanation of EM Lower-bound approximation method ** Sometimes provides faster convergence than gradient descent and Newton’s method 4/17/2017 Fatih Gelgi, ASU’05

Example: Mixture Components 4/17/2017 Fatih Gelgi, ASU’05

Example (cont’d): True Likelihood of Parameters 4/17/2017 Fatih Gelgi, ASU’05

Example (cont’d): Iterations of EM 4/17/2017 Fatih Gelgi, ASU’05

Lower-bound Maximization Posterior probability  Logarithm of the joint distribution Idea: start with a guess t, compute an easily computed lower-bound B(; t) to the function log P(|U) and maximize the bound instead. difficult!!! 4/17/2017 Fatih Gelgi, ASU’05

Lower-bound Maximization (cont.) Construct a tractable lower-bound B(; t) that contains a sum of logarithms. ft(J) is an arbitrary prob. dist. By Jensen’s inequality, 4/17/2017 Fatih Gelgi, ASU’05

Optimal Bound B(; t) touches the objective function log P(U,) at t. Maximize B(t; t) with respect to ft(J): Introduce a Lagrange multiplier  to enforce the constraint 4/17/2017 Fatih Gelgi, ASU’05

Optimal Bound (cont.) Derivative with respect to ft(J): Maximizes at: 4/17/2017 Fatih Gelgi, ASU’05

Maximizing the Bound Re-write B(;t) with respect to the expectations: where Finally, 4/17/2017 Fatih Gelgi, ASU’05

EM Algorithm EM converges to a local maximum of log P(U,)  maximum of log P(|U). 4/17/2017 Fatih Gelgi, ASU’05

A Relation to the Log-Posterior An alternative way to compute expected log-posterior: which is the same as maximization with respect to , 4/17/2017 Fatih Gelgi, ASU’05

Generalized EM Assume and B function are differentiable in .The EM likelihood converges to a point where GEM: Instead of setting t+1 = argmax B(;t) Just find t+1 such that B(;t+1) > B(;t) GEM also is guaranteed to converge 4/17/2017 Fatih Gelgi, ASU’05

HMM – Baum-Welch Revisited Estimate the parameters (a, b, ) st. number of correct individual states to be maximum. gt(i) is the probability of being in state Si at time t xt(i,j) is the probability of being in state Si at time t, and Sj at time t+1 4/17/2017 Fatih Gelgi, ASU’05

Baum-Welch: E-step 4/17/2017 Fatih Gelgi, ASU’05

Baum-Welch: M-step 4/17/2017 Fatih Gelgi, ASU’05

K-Means Problem: Given data X and the number of clusters K, find clusters. Clustering based on centroids, A point belongs to the cluster with closest centroid. Hidden variables centroids of the clusters! 4/17/2017 Fatih Gelgi, ASU’05

K-Means (cont.) Starting with an initial 0, centroids, E-step: Split the data into K clusters according to distances to the centroids (Calculate the distribution ft(J)). M-step: Update the centroids (Calculate t+1). 4/17/2017 Fatih Gelgi, ASU’05

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Converged! 4/17/2017 Fatih Gelgi, ASU’05

Discussion Is EM a Primal-Dual algorithm? 4/17/2017 Fatih Gelgi, ASU’05

Reference: A.P.Dempster et al “Maximum-likelihood from incomplete data Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp. 1-38. F. Dellaert, “The Expectation Maximization Algorithm”, Tech. Rep. GIT-GVU-02-20, 2002. T. Minka, “Expectation-Maximization as lower bound maximization”, 1998 Y. Chang, M. Kölsch. Presentation: Expectation Maximization, UCSB, 2002. K. Andersson, Presentation: Model Optimization using the EM algorithm, COSC 7373, 2001 4/17/2017 Fatih Gelgi, ASU’05

Thanks! 4/17/2017 Fatih Gelgi, ASU’05