Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Bayesian networks

Similar presentations


Presentation on theme: "Learning Bayesian networks"— Presentation transcript:

1 Learning Bayesian networks
Slides by Nir Friedman and Dan Geiger .

2 Known Structure - Complete Data
X, Y <h, t > <h, h> <h, h > <t, t > <h, t> <t, h > <h, h> <h, h> N=10 NX=h = 7 NX=t = 3 X Y NY=h | X=h = 5 NY=t | X=h = 2 Let  = {x , y|x , y|~x } be the parameters of the Bayes Net. NY=h | X=t = 1 NY=t | X=t = 2 Each parameter is estimated separately say via the maximum likelihood principle or using say a prior Beta(e,e): x =NX=h/(NX=h+NX=t) = 7/(7+3), or (7+e)/(7+3+2e), y|x =NY=h|X=h/(NY=h|X=h+NY=t|X=h) = 5/(5+2), or (5+e)/(5+2+2e), y |~x =NY=h|X=t/(NY=h|X=t+NY=t|X=t) = 1/(1+2) or (1+e)/(1+2+2e).

3 Why independent estimations?
 = {x , y|x , y|~x } are the parameters of the Bayes Net: X Y P(X,Y) = P(X) P(Y|X)  P(,X,Y) =P() P(X| ) P(Y|X, ) Definition: Global Parameter Independence means that all parameters of vertex X are marginally independent of all parameters of vertex Y for all vertices X,Y. P() = P(x , y|x , y|~x) = P(x) P(y|x , y|~x) Definition: Local Parameter Independence means that all parameters of vertex X are marginally independent of each other for every vertex X. P(y|x , y|~x) = P(y|x) P(y|~x)

4 Why independent estimations?
 = {x , y|x , y|~x } are the parameters of the Bayes Net: X Y P(,X=x,Y=y) =P() P(X=x| ) P(Y=y|X=x, ) P() = P(x , y|x , y|~x) = P(x) P(y|x) P(y|~x) P(X=x | ) = P(X=x | x) = x P(Y=y |X=x, ) = P(Y=y |X=x, y|x , y|~x) = y|x P(,X=x,Y=y) = [P(x) x ] [P(y|x) y|x ] [ P(y|~x)]

5 Why independent estimations?
 = {x , y|x , y|~x } are the parameters of the Bayes Net: X Y P(,X=~x,Y=y) =P() P(X=~x| ) P(Y=y|X=~x, ) P() = P(x , y|x , y|~x) = P(x) P(y|x) P(y|~x) P(X=~x | ) = P(X=~x | x) = 1-x P(Y=y |X=~x, ) = P(Y=y |X=~x, y|x , y|~x) = y|~x P(,X,Y) = [P(x) (1-x)] [P(y|x) ][P(y|~x) y|~x]

6 Known Structure - Complete Data
Complete Data = { (X1,Y1), (X2,Y2),…, (Xn,Yn)} P( |Data) = K P() P(Data | ) =K P(x) P(y|x) P(y|~x) 𝑖=1 𝑛 P(Xi,Yi|) P(|Data) = K P(x) P(y|x) P(y|~x) 𝑖=1 𝑛 P(Xi|x)P(Yi|Xi,y|x ,y|~x)

7 Why independent estimations?
Complete Data Points= { (X1,Y1), (X2,Y2),…, (Xn,Yn)} Counts: NX=x ,NX=~x ,NY=y|X=x ,NY=~y|X=x ,NY=y|X=~x ,NY=~y|X=~x P(|Data) =K∙P(x) ∙ P(y|x)∙P(y|~x)∙ 𝑖=1 𝑛 P(Xi|x)P(Yi|Xi,y|x ,y|~x) P(|Data) = K ∙[ P(x) ∙ 𝑖=1 𝑛 P(Xi|x)] ∙ [P(y|x) ∙ 𝑖:𝑋𝑖=𝑥 P(Yi|Xi,y|x)]∙ [P(y|~x) ∙ 𝑖:𝑋𝑖=~𝑥 P(Yi|Xi, y|~x)] P(|Data) = K ∙[ P(x) ∙[x] NX=x ∙ [1-x] NX=~x]∙ [P(y|x) ∙ [y|x] NY=y|X=x ∙ [1-y|x] NY=~y|X=x ]∙ [P(y|~x) ∙ [y|~x] NY=y|X=~x ∙ [1-y|~x] NY=~y|X=~x ]

8 Summary for independent estimations
P(|Data) = K ∙[ P(x) ∙[x] NX=x ∙ [1-x] NX=~x]∙ [P(y|x) ∙ [y|x] NY=y|X=x ∙ [1-y|x] NY=~y|X=x ]∙ [P(y|~x) ∙ [y|~x] NY=y|X=~x ∙ [1-y|~x] NY=~y|X=~x ] Three Conjugate Beta priors with hyper-parameters, Imaginary Counts: nX=x ,nX=~x ,nY=y|X=x ,nY=~y|X=x ,nY=y|X=~x ,nY=~y|X=~x P(|Data) = K∙[ Beta(x; nX=x ,nX=~x) ∙[x] NX=x ∙ [1-x] NX=~x]∙ [Beta(y|x; nY=y|X=x ,nY=~y|X=x) ∙ [y|x] NY=y|X=x ∙ [1-y|x] NY=~y|X=x ]∙ [Beta(y|~x; nY=y|X=~x ,nY=~y|X=~x) ∙ [y|~x] NY=y|X=~x ∙ [1-y|~x] NY=~y|X=~x ] Maximum Likelihood Estimates: MaximumP(Data |) =[ [x] NX=x ∙ [1-x] NX=~x]∙ [ [y|x] NY=y|X=x ∙ [1-y|x] NY=~y|X=x ]∙ [ [y|~x] NY=y|X=~x ∙ [1-y|~x] NY=~y|X=~x ]

9 Known Structure - Complete Data
X, Y <h, t > <h, h> <h, h > <t, t > <h, t> <t, h > <h, h> <h, h> N=10 NX=h = 7 NX=t = 3 X Y NY=h | X=h = 5 NY=t | X=h = 2 Let  = {x , y|x , y|~x } be the parameters of the Bayes Net. NY=h | X=t = 1 NY=t | X=t = 2 Each parameter is estimated separately say via the maximum likelihood principle or using say a prior Beta(e,e): x =NX=h/(NX=h+NX=t) = 7/(7+3), or (7+e)/(7+3+2e), y|x =NY=h|X=h/(NY=h|X=h+NY=t|X=h) = 5/(5+2), or (5+e)/(5+2+2e), y |~x =NY=h|X=t/(NY=h|X=t+NY=t|X=t) = 1/(1+2) or (1+e)/(1+2+2e).

10 Known Structure - Incomplete Data
Incomplete Data = { (?,Y1), (X2,?),…, (Xn,Yn)}

11 Learning Parameters from Incomplete Data
X Y|X=H m X[m] Y[m] Y|X=T Incomplete data: Posterior distributions can become dependent Consequence: ML parameters can not be computed separately Posterior is not a product of independent posteriors No Longer A unimodal “Nice” Likelihood function or Posterior

12 Known Structure -- Incomplete Data
E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . <?,Y,Y> E B A Inducer .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is specified Data contains missing values We will consider filling assignments to missing values

13 Expectation Maximization (EM)
A general purpose method for learning from incomplete data Intuition: If we had access to counts, then we can estimate parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment

14 Expectation Maximization (EM)
Y Z Data Expected Counts P(Y=H|X=H,Z=T,) = 0.3 X Y Z N (X,Y ) HTHHT ??HTT TT?TH X Y # Current model HTHT HHTT These numbers are placed for illustration; they have not been computed. P(Y=H|X=T, Z=T, ) = 0.4

15 EM (cont.)  Training Data Reiterate Expected Counts
Initial network (G,0) Reparameterize X1 X2 X3 H Y1 Y2 Y3 Updated network (G,1) (M-Step) Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Computation (E-Step) X1 X2 X3 H Y1 Y2 Y3 Training Data

16 MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point L(|D)

17 EM in Practice Initial parameters: Random parameters setting
“Best” guess from other source Stopping criteria: Small change in likelihood of data Small change in parameter values Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones

18 The setup of the EM algorithm
We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x1,…,xL of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|) is easy to maximize. The log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,) (Because P(x,y| ) = P(x| ) P(y|x, )).

19 The goal of EM algorithm
The log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,) The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector  such that P(x| ) > P(x| ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ).

20 The Mathematics involved
Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y). The expectation of a function L(Y) is given by E[ L(Y)] = y L(y) p(y). A bit harder to comprehend example (where we choose L(Y)  log p(x ,y|) with X, as constants): E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|) The expectation operator E is linear. For two random variables X,Y, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] Q( |’) 

21 The Mathematics involved (Cont.)
Starting with log P(x| ) = log P(x, y| ) – log P(y|x, ), multiplying both sides by P(y|x ,’), and summing over y, yields Log P(x |) =  P(y|x, ’) log P(x ,y|) -  P(y|x, ’) log P(y |x, ) y = E’[log p(x,y|)] = Q( |’) We now observe that = log P(x| ) – log P(x|’) = Q( | ’) – Q(’ | ’) +  P(y|x, ’) log [P(y |x, ’) / P(y |x, )] y 0 (relative entropy) So choosing * = argmax Q(| ’) maximizes the difference , and repeating this process leads to a local maximum of log P(x| ).

22 The EM algorithm itself
Input: A likelihood function p(x,y| ) parameterized by . Initialization: Fix an arbitrary starting value ’ Repeat E-step: Compute Q( | ’) = E’[log P(x,y| )] M-step: ’  argmax Q(| ’) Until  = log P(x| ) – log P(x|’) <  Comment: At the M-step one can actually choose any ’ as long as  > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

23 Comment on the proof of EM
We used a log-likelihood of one observation x of the form: log P(x| ) = log P(x,y| ) – log P(y|x,) For independent points (xi, yi), i=1,…,m, we can similarly write: i log P(xi| ) = i log P(xi,yi| ) – i log P(yi|xi,) We have stick to one observation in our derivation but all derived equations can be modified to set of points by summing over x.

24 Expectation Maximization (EM)
In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

25 MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters L(|D)

26 MLE from Incomplete Data
Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.

27 Gradient Ascent Main result Theorem GA:
Requires computation: P(xi,pai|o[m],) for all i, m Inference replaces taking derivatives.

28 Gradient Ascent (cont)
Proof: å Q = m pa x i o P D , ) | ] [ ( log q å Q = m pa x i o P , ) | ] [ ( 1 q How do we compute ?

29 Gradient Ascent (cont)
Since: i pa x o P , ' ) | ( q Q = å i pa x nd d o P , ' ) | ( q Q = å =1 i pa x ' , nd d P o ) | ( q Q = i pa x o P ' , ) ( q Q =

30 Gradient Ascent (cont)
Putting all together we get


Download ppt "Learning Bayesian networks"

Similar presentations


Ads by Google