Presentation is loading. Please wait.

Presentation is loading. Please wait.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Similar presentations


Presentation on theme: ". The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman."— Presentation transcript:

1 . The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

2 2 Expectation Maximization (EM) for Bayesian networks Intuition (as before): u When we have access to all counts, then we can find the ML estimate of all parameters in all local tables directly by counting. u However, missing values do not allow us to perform such counts. u So instead, we compute the expected counts using the current parameter assignment, and then use them to compute the maximum likelihood estimate. A B C D P(A= a |  ) P(B=b|A= a,  ) P(C=c|A= a,  ) P(D=d|b,c,  )

3 3 Expectation Maximization (EM) X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH 1.3 0.4 1.7 1.6 N (X,Y ) XY # HTHTHTHT HHTTHHTT P(Y=H|X=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current parameters X Y Z 0.1 0.2 2.9 1.8 N (X,Z ) XZ # HTHTHTHT HHTTHHTT 0.2 0.1

4 4 EM (cont.) Training Data X1X1 X2X2 X3X3 Y Z1Z1 Z2Z2 Z3Z3 Initial network (G,  ’)  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(Y, X 1, X 1, X 3 ) N(Z 1, Y) N(Z 2, Y) N(Z 3, Y) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 Y Z1Z1 Z2Z2 Z3Z3 Updated network (G,  ) (M-Step) Reiterate Note: This EM iteration corresponds to the non- homogenous HMM iteration. When parameters are shared across local probability tables or are functions of each other, changes are needed.

5 5 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones

6 6 Relative Entropy – a measure of difference among distributions This is a measure of difference between P(x) and Q(x). It is not a symmetric function. The distribution P(x) is assumed the “true” distribution used for taking the expectation of the log of the difference. The following properties hold: We define the relative entropy H(P||Q) for two probability distributions P and Q of a variable X (with x being a value of X) as follows: H(P||Q)=  P(x i ) log 2 (P(x i )/Q(x i )) xixi H(P||Q)  0 Equality holds if and only if P(x) = Q(x) for all x.

7 7 Average Score for sequence comparisons Recall that we have defined the scoring function via Note that the average score is the relative entropy H(P ||Q) where Q(a,b) = Q(a) Q(b). Relative entropy also arises when choosing amongst competing models.

8 8 The setup of the EM algorithm We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x 1,…,x L of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayes network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|  ) is easy to maximize. The log-likelihood of an observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) (Because P(x,y|  ) = P(x|  ) P(y|x,  )).

9 9 The goal of EM algorithm The log-likelihood of ONE observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) The goal: Starting with a current parameter vector  ’, EM’s goal is to find a new vector  such that P(x|  ) > P(x|  ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x|  ).

10 10 The Expectation Operator Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] =  y y p(y). The expectation of a function L(Y) is given by E[L(Y)] =  y p(y) L(y). An example used by the EM algorithm: E  ’ [log p(x,y|  )] =  y p(y|x,  ’) log p(x,y|  ) The expectation operator E is linear. For two random variables X,Y, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] Q(  |  ’) 

11 11 Improving the likelihood Starting with log P(x|  ) = log P(x, y|  ) – log P(y|x,  ), multiplying both sides by P(y|x,  ’), and summing over y, yields Log P(x |  ) =  P(y|x,  ’) log P(x,y|  ) -  P(y|x,  ’) log P(y |x,  ) yy = E  ’ [log p(x,y|  )] = Q(  |  ’) We now observe that  = log P(x|  ) – log P(x|  ’) = Q(  |  ’) – Q(  ’ |  ’) +  P(y|x,  ’) log [P(y |x,  ’) / P(y |x,  )] y  0 (relative entropy) So choosing  * = argmax  Q(  |  ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x|  ).

12 12 The EM algorithm Input: A likelihood function p(x,y|  ) parameterized by . Initialization: Fix an arbitrary starting value  ’ Repeat E-step: Compute Q(  |  ’) = E  ’ [log P(x,y|  )] M-step:  ’  argmax  Q(  |  ’) Until  = log P(x|  ) – log P(x|  ’) <  Comment: At the M-step one can actually choose any  ’ as long as  > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

13 13 The EM algorithm (with multiple independent samples) Recall the log-likelihood of an observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) For independent samples (x i, y i ), i=1,…,m, we can write:  i log P(x i |  ) =  i log P(x i,y i |  ) –  i log P(y i |x i,  ) E-step: Compute Q(  |  ’) = E  ’ [  i log P(x i,y i |  )] =  i E  ’ [log P(x i,y i |  )] M-step:  ’  argmax  Q(  |  ’) Each sample Completed separately.

14 14 log P(x|  ) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem  E  ’ [log P(x,y|  )]

15 15 Gene Counting Revisited (as EM) The observations: The variables X=(N A, N B, N AB, N O ) with a specific assignment x = (n A,n B,n AB,n O ). The hidden quantity: The variables Y=(N a/a, N a/o, N b/b, N b/O ) with a specific assignment y = (n a/a,n a/o, n b/b, n b/O ). The parameters:  ={  a,  b,  o }. The likelihood of the completed data of n points: P(x,y|  ) = P(n AB,n O, n a/a,n a/o, n b/b, n b/o |  ) =

16 16 The E-step of Gene Counting The likelihood of the hidden data given the observed data of n points: P(y| x,  ’) = P(n a/a,n a/o, n b/b, n b/o | n A,n B, n AB,n O,  ’) = P(n a/a,n a/o | n A,  ’ a,  ’ o ) P(n b/b,n b/o | n B,  ’ b,  ’ o ) This is exactly the E-step we used earlier !

17 17 The M-step of Gene Counting The log-likelihood of the completed data of n points: Taking expectation wrt Y =(N a/a, N a/o, N b/b, N b/o ) and using linearity of E yields the function Q(  |  ’) which we need to maximize:

18 18 The M-step of Gene Counting (Cont.) We need to maximize the function: Under the constraint  a +  b +  o =1. The solution (obtained using Lagrange multipliers) is given by Which matches the M-step we used earlier !

19 19 Outline for a different derivation of Gene Counting as an EM algorithm Define a variable X with values x A,x B,x AB,x O. Define a variable Y with values y a/a, y a/o, y b/b, y b/o, y a/b, y o/o. Examine the Bayesian network: Y X The local probability table for Y is P(y a/a |  ) =  a  a, P(y a/o |  ) = 2  a  o, etc. The local probability table for X given Y is P(x A | y a/q,  ) = 1, P(x A | y a/o,  ) = 1, P(x A | y b/o,  ) = 0, etc, only 0’s and 1’s. Homework: write down for yourself the likelihood function for n independent points x i,y i, and check that the EM equations match the gene counting equations.


Download ppt ". The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman."

Similar presentations


Ads by Google