Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.

Similar presentations


Presentation on theme: ". Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger."— Presentation transcript:

1 . Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger

2 2 Known Structure -- Incomplete Data Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l We consider assignments to missing values E, B, A.

3 3 Learning Parameters from Incomplete Data Incomplete data: u Posterior distributions can become interdependent u Consequence: l ML parameters can not be computed separately for each multinomial l Posterior is not a product of independent posteriors XX  Y|X=H m X[m] Y[m]  Y|X=T

4 4 Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables  a serious problem HY

5 5 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment

6 6 Expectation Maximization (EM) 1.3 0.4 1.7 1.6 X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT TT?THTT?TH HTHTHTHT HHTTHHTT P(Y=H|X=T, Z=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current model These numbers are placed for illustration; they have not been computed. X Y Z

7 7 EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G,  0 )  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G,  1 ) (M-Step) Reiterate

8 8 L(  |D) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem 

9 9 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones

10 10 The setup of the EM algorithm We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x 1,…,x L of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|  ) is easy to maximize. The log-likelihood of an observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) (Because P(x,y|  ) = P(x|  ) P(y|x,  )).

11 11 The goal of EM algorithm The log-likelihood of an observation x has the form: log P(x|  ) = log P(x,y|  ) – log P(y|x,  ) The goal: Starting with a current parameter vector  ’, EM’s goal is to find a new vector  such that P(x|  ) > P(x|  ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x|  ). For independent points (x i, y i ), i=1,…,m, we can similarly write:  i log P(x i |  ) =  i log P(x i,y i |  ) –  i log P(y i |x i,  ) We will stick to one observation in our derivation recalling that all derived equations can be modified by summing over x.

12 12 The Mathematics involved Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] =  y y p(y). The expectation of a function L(Y) is given by E[L(Y)] =  y L(y) p(y). A bit harder to comprehend example: E  ’ [log p(x,y|  )] =  y p(y|x,  ’) log p(x,y|  ) The expectation operator E is linear. For two random variables X,, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] Q(  |  ’) 

13 13 The Mathematics involved (Cont.) Starting with log P(x|  ) = log P(x, y|  ) – log P(y|x,  ), multiplying both sides by P(y|x,  ’), and summing over y, yields Log P(x |  ) =  P(y|x,  ’) log P(x,y|  ) -  P(y|x,  ’) log P(y |x,  ) yy = E  ’ [log p(x,y|  )] = Q(  |  ’) We now observe that  = log P(x|  ) – log P(x|  ’) = Q(  |  ’) – Q(  ’ |  ’) +  P(y|x,  ’) log [P(y |x,  ’) / P(y |x,  )] y  0 (relative entropy) So choosing  * = argmax  Q(  |  ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x|  ).

14 14 The EM algorithm itself Input: A likelihood function p(x,y|  ) parameterized by . Initialization: Fix an arbitrary starting value  ’ Repeat E-step: Compute Q(  |  ’) = E  ’ [log P(x,y|  )] M-step:  ’  argmax  Q(  |  ’) Until  = log P(x|  ) – log P(x|  ’) <  Comment: At the M-step one can actually choose any  ’ as long as  > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

15 15 Haplotyping G1G1 G2G2 G L-1 GLGL H1H1 H2H2 H L-1 HLHL HiHi GiGi H1H1 H2H2 HLHL HiHi Every G i is an unordered pair of letters {aa,ab,bb}. The source of one letter is the first chain and the source of the other letter is the second chain. Which letter comes from which chain ? (Is it a paternal or maternal DNA?(

16 16 Expectation Maximization (EM) u In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. u Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

17 17 Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters L(  |D) MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem 

18 18 MLE from Incomplete Data Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.

19 19 Gradient Ascent u Main result Theorem GA: Requires computation: P(x i,pa i |o[m],  ) for all i, m Inference replaces taking derivatives.

20 20 Gradient Ascent (cont)      m pax ii moP moP, )|][( )|][( 1        m x x iiii moPDP,, )|][(log)|(  How do we compute ? Proof:

21 21 Gradient Ascent (cont) Since: ii pax ii o xP ',' ),,','(    =1 ii pax',' ii nd i ii d paxP o Po xoP ),'|'( )|,'(),,','|(    ii ii x x nd iii ii d opaP xPo xoP, ',' )|,(),|(),,,|(     ii iiii x x ii x o xP oP, ','',' )|,,( )|(      

22 22 Gradient Ascent (cont) u Putting all together we get


Download ppt ". Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger."

Similar presentations


Ads by Google