. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

2 Expectation Maximization (EM) for Bayesian networks Intuition (as before): u When we have access to all counts, then we can find the ML estimate of all parameters in all local tables directly by counting. u However, missing values do not allow us to perform such counts. u So instead, we compute the expected counts using the current parameter assignment, and then use them to compute the maximum likelihood estimate. A B C D P(A= a |  ) P(B=b|A= a,  ) P(C=c|A= a,  ) P(D=d|b,c,  )

3 Expectation Maximization (EM) X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH 1.3 0.4 1.7 1.6 N (X,Y ) XY # HTHTHTHT HHTTHHTT P(Y=H|X=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current parameters X Y Z 0.1 0.2 2.9 1.8 N (X,Z ) XZ # HTHTHTHT HHTTHHTT 0.2 0.1

4 EM (cont.) Training Data X1X1 X2X2 X3X3 Y Z1Z1 Z2Z2 Z3Z3 Initial network (G,  ’)  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(Y, X 1, X 1, X 3 ) N(Z 1, Y) N(Z 2, Y) N(Z 3, Y) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 Y Z1Z1 Z2Z2 Z3Z3 Updated network (G,  ) (M-Step) Reiterate Note: This EM iteration corresponds to the non- homogenous HMM iteration. When parameters are shared across local probability tables or are functions of each other, changes are needed.

5 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones

6 Relative Entropy – a measure of difference among distributions This is a measure of difference between P(x) and Q(x). It is not a symmetric function. The distribution P(x) is assumed the “true” distribution used for taking the expectation of the log of the difference. The following properties hold: We define the relative entropy H(P||Q) for two probability distributions P and Q of a variable X (with x being a value of X) as follows: H(P||Q)=  P(x i ) log 2 (P(x i )/Q(x i )) xixi H(P||Q)  0 Equality holds if and only if P(x) = Q(x) for all x.

7 Average Score for sequence comparisons Recall that we have defined the scoring function via Note that the average score is the relative entropy H(P ||Q) where Q(a,b) = Q(a) Q(b). Relative entropy also arises when choosing amongst competing models.

8 The setup of the EM algorithm We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x 1,…,x L of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayes network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|  ) is easy to maximize. The log-likelihood of an observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) (Because P(x,y|  ) = P(x|  ) P(y|x,  )).

9 The goal of EM algorithm The log-likelihood of ONE observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) The goal: Starting with a current parameter vector  ’, EM’s goal is to find a new vector  such that P(x|  ) > P(x|  ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x|  ).

10 The Expectation Operator Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] =  y y p(y). The expectation of a function L(Y) is given by E[L(Y)] =  y p(y) L(y). An example used by the EM algorithm: E  ’ [log p(x,y|  )] =  y p(y|x,  ’) log p(x,y|  ) The expectation operator E is linear. For two random variables X,Y, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] Q(  |  ’) 

12 The EM algorithm Input: A likelihood function p(x,y|  ) parameterized by . Initialization: Fix an arbitrary starting value  ’ Repeat E-step: Compute Q(  |  ’) = E  ’ [log P(x,y|  )] M-step:  ’  argmax  Q(  |  ’) Until  = log P(x|  ) – log P(x|  ’) <  Comment: At the M-step one can actually choose any  ’ as long as  > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

14 log P(x|  ) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem  E  ’ [log P(x,y|  )]

15 Gene Counting Revisited (as EM) The observations: The variables X=(N A, N B, N AB, N O ) with a specific assignment x = (n A,n B,n AB,n O ). The hidden quantity: The variables Y=(N a/a, N a/o, N b/b, N b/O ) with a specific assignment y = (n a/a,n a/o, n b/b, n b/O ). The parameters:  ={  a,  b,  o }. The likelihood of the completed data of n points: P(x,y|  ) = P(n AB,n O, n a/a,n a/o, n b/b, n b/o |  ) =

16 The E-step of Gene Counting The likelihood of the hidden data given the observed data of n points: P(y| x,  ’) = P(n a/a,n a/o, n b/b, n b/o | n A,n B, n AB,n O,  ’) = P(n a/a,n a/o | n A,  ’ a,  ’ o ) P(n b/b,n b/o | n B,  ’ b,  ’ o ) This is exactly the E-step we used earlier !

17 The M-step of Gene Counting The log-likelihood of the completed data of n points: Taking expectation wrt Y =(N a/a, N a/o, N b/b, N b/o ) and using linearity of E yields the function Q(  |  ’) which we need to maximize:

18 The M-step of Gene Counting (Cont.) We need to maximize the function: Under the constraint  a +  b +  o =1. The solution (obtained using Lagrange multipliers) is given by Which matches the M-step we used earlier !

19 Outline for a different derivation of Gene Counting as an EM algorithm Define a variable X with values x A,x B,x AB,x O. Define a variable Y with values y a/a, y a/o, y b/b, y b/o, y a/b, y o/o. Examine the Bayesian network: Y X The local probability table for Y is P(y a/a |  ) =  a  a, P(y a/o |  ) = 2  a  o, etc. The local probability table for X given Y is P(x A | y a/q,  ) = 1, P(x A | y a/o,  ) = 1, P(x A | y b/o,  ) = 0, etc, only 0’s and 1’s. Homework: write down for yourself the likelihood function for n independent points x i,y i, and check that the EM equations match the gene counting equations.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Similar presentations

Presentation on theme: ". The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Similar presentations

Presentation on theme: ". The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman."— Presentation transcript:

Similar presentations

About project

Feedback