. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Slides:



Advertisements
Similar presentations
Learning with Missing Data
Advertisements

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:
Parameter Estimation using likelihood functions Tutorial #1
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
Maximum likelihood (ML) and likelihood ratio (LR) test
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
Maximum Likelihood (ML), Expectation Maximization (EM)
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Expectation-Maximization
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
EM and expected complete log-likelihood Mixture of Experts
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
CSE 517 Natural Language Processing Winter 2015
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Comp. Genomics Recitation 6 14/11/06 ML and EM.
Oliver Schulte Machine Learning 726
Hidden Markov Models.
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Introduction to EM algorithm
Learning Bayesian networks
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Parametric Methods Berlin Chen, 2005 References:
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Learning Bayesian networks
Presentation transcript:

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

2 Expectation Maximization (EM) for Bayesian networks Intuition (as before): u When we have access to all counts, then we can find the ML estimate of all parameters in all local tables directly by counting. u However, missing values do not allow us to perform such counts. u So instead, we compute the expected counts using the current parameter assignment, and then use them to compute the maximum likelihood estimate. A B C D P(A= a |  ) P(B=b|A= a,  ) P(C=c|A= a,  ) P(D=d|b,c,  )

3 Expectation Maximization (EM) X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH N (X,Y ) XY # HTHTHTHT HHTTHHTT P(Y=H|X=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current parameters X Y Z N (X,Z ) XZ # HTHTHTHT HHTTHHTT

4 EM (cont.) Training Data X1X1 X2X2 X3X3 Y Z1Z1 Z2Z2 Z3Z3 Initial network (G,  ’)  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(Y, X 1, X 1, X 3 ) N(Z 1, Y) N(Z 2, Y) N(Z 3, Y) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 Y Z1Z1 Z2Z2 Z3Z3 Updated network (G,  ) (M-Step) Reiterate Note: This EM iteration corresponds to the non- homogenous HMM iteration. When parameters are shared across local probability tables or are functions of each other, changes are needed.

5 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones

6 Relative Entropy – a measure of difference among distributions This is a measure of difference between P(x) and Q(x). It is not a symmetric function. The distribution P(x) is assumed the “true” distribution used for taking the expectation of the log of the difference. The following properties hold: We define the relative entropy H(P||Q) for two probability distributions P and Q of a variable X (with x being a value of X) as follows: H(P||Q)=  P(x i ) log 2 (P(x i )/Q(x i )) xixi H(P||Q)  0 Equality holds if and only if P(x) = Q(x) for all x.

7 Average Score for sequence comparisons Recall that we have defined the scoring function via Note that the average score is the relative entropy H(P ||Q) where Q(a,b) = Q(a) Q(b). Relative entropy also arises when choosing amongst competing models.

8 The setup of the EM algorithm We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x 1,…,x L of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayes network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|  ) is easy to maximize. The log-likelihood of an observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) (Because P(x,y|  ) = P(x|  ) P(y|x,  )).

9 The goal of EM algorithm The log-likelihood of ONE observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) The goal: Starting with a current parameter vector  ’, EM’s goal is to find a new vector  such that P(x|  ) > P(x|  ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x|  ).

10 The Expectation Operator Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] =  y y p(y). The expectation of a function L(Y) is given by E[L(Y)] =  y p(y) L(y). An example used by the EM algorithm: E  ’ [log p(x,y|  )] =  y p(y|x,  ’) log p(x,y|  ) The expectation operator E is linear. For two random variables X,Y, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] Q(  |  ’) 

11 Improving the likelihood Starting with log P(x|  ) = log P(x, y|  ) – log P(y|x,  ), multiplying both sides by P(y|x,  ’), and summing over y, yields Log P(x |  ) =  P(y|x,  ’) log P(x,y|  ) -  P(y|x,  ’) log P(y |x,  ) yy = E  ’ [log p(x,y|  )] = Q(  |  ’) We now observe that  = log P(x|  ) – log P(x|  ’) = Q(  |  ’) – Q(  ’ |  ’) +  P(y|x,  ’) log [P(y |x,  ’) / P(y |x,  )] y  0 (relative entropy) So choosing  * = argmax  Q(  |  ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x|  ).

12 The EM algorithm Input: A likelihood function p(x,y|  ) parameterized by . Initialization: Fix an arbitrary starting value  ’ Repeat E-step: Compute Q(  |  ’) = E  ’ [log P(x,y|  )] M-step:  ’  argmax  Q(  |  ’) Until  = log P(x|  ) – log P(x|  ’) <  Comment: At the M-step one can actually choose any  ’ as long as  > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

13 The EM algorithm (with multiple independent samples) Recall the log-likelihood of an observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) For independent samples (x i, y i ), i=1,…,m, we can write:  i log P(x i |  ) =  i log P(x i,y i |  ) –  i log P(y i |x i,  ) E-step: Compute Q(  |  ’) = E  ’ [  i log P(x i,y i |  )] =  i E  ’ [log P(x i,y i |  )] M-step:  ’  argmax  Q(  |  ’) Each sample Completed separately.

14 log P(x|  ) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem  E  ’ [log P(x,y|  )]

15 Gene Counting Revisited (as EM) The observations: The variables X=(N A, N B, N AB, N O ) with a specific assignment x = (n A,n B,n AB,n O ). The hidden quantity: The variables Y=(N a/a, N a/o, N b/b, N b/O ) with a specific assignment y = (n a/a,n a/o, n b/b, n b/O ). The parameters:  ={  a,  b,  o }. The likelihood of the completed data of n points: P(x,y|  ) = P(n AB,n O, n a/a,n a/o, n b/b, n b/o |  ) =

16 The E-step of Gene Counting The likelihood of the hidden data given the observed data of n points: P(y| x,  ’) = P(n a/a,n a/o, n b/b, n b/o | n A,n B, n AB,n O,  ’) = P(n a/a,n a/o | n A,  ’ a,  ’ o ) P(n b/b,n b/o | n B,  ’ b,  ’ o ) This is exactly the E-step we used earlier !

17 The M-step of Gene Counting The log-likelihood of the completed data of n points: Taking expectation wrt Y =(N a/a, N a/o, N b/b, N b/o ) and using linearity of E yields the function Q(  |  ’) which we need to maximize:

18 The M-step of Gene Counting (Cont.) We need to maximize the function: Under the constraint  a +  b +  o =1. The solution (obtained using Lagrange multipliers) is given by Which matches the M-step we used earlier !

19 Outline for a different derivation of Gene Counting as an EM algorithm Define a variable X with values x A,x B,x AB,x O. Define a variable Y with values y a/a, y a/o, y b/b, y b/o, y a/b, y o/o. Examine the Bayesian network: Y X The local probability table for Y is P(y a/a |  ) =  a  a, P(y a/o |  ) = 2  a  o, etc. The local probability table for X given Y is P(x A | y a/q,  ) = 1, P(x A | y a/o,  ) = 1, P(x A | y b/o,  ) = 0, etc, only 0’s and 1’s. Homework: write down for yourself the likelihood function for n independent points x i,y i, and check that the EM equations match the gene counting equations.