Learning Bayesian networks

Learning Bayesian networks
Slides by Nir Friedman and Dan Geiger .

Known Structure - Complete Data
X, Y <h, t > <h, h> <h, h > <t, t > <h, t> <t, h > <h, h> <h, h> N=10 NX=h = 7 NX=t = 3 X Y NY=h | X=h = 5 NY=t | X=h = 2 Let  = {x , y|x , y|~x } be the parameters of the Bayes Net. NY=h | X=t = 1 NY=t | X=t = 2 Each parameter is estimated separately say via the maximum likelihood principle or using say a prior Beta(e,e): x =NX=h/(NX=h+NX=t) = 7/(7+3), or (7+e)/(7+3+2e), y|x =NY=h|X=h/(NY=h|X=h+NY=t|X=h) = 5/(5+2), or (5+e)/(5+2+2e), y |~x =NY=h|X=t/(NY=h|X=t+NY=t|X=t) = 1/(1+2) or (1+e)/(1+2+2e).

Why independent estimations?
 = {x , y|x , y|~x } are the parameters of the Bayes Net: X Y P(X,Y) = P(X) P(Y|X)  P(,X,Y) =P() P(X| ) P(Y|X, ) Definition: Global Parameter Independence means that all parameters of vertex X are marginally independent of all parameters of vertex Y for all vertices X,Y. P() = P(x , y|x , y|~x) = P(x) P(y|x , y|~x) Definition: Local Parameter Independence means that all parameters of vertex X are marginally independent of each other for every vertex X. P(y|x , y|~x) = P(y|x) P(y|~x)

 = {x , y|x , y|~x } are the parameters of the Bayes Net: X Y P(,X=x,Y=y) =P() P(X=x| ) P(Y=y|X=x, ) P() = P(x , y|x , y|~x) = P(x) P(y|x) P(y|~x) P(X=x | ) = P(X=x | x) = x P(Y=y |X=x, ) = P(Y=y |X=x, y|x , y|~x) = y|x P(,X=x,Y=y) = [P(x) x ] [P(y|x) y|x ] [ P(y|~x)]

 = {x , y|x , y|~x } are the parameters of the Bayes Net: X Y P(,X=~x,Y=y) =P() P(X=~x| ) P(Y=y|X=~x, ) P() = P(x , y|x , y|~x) = P(x) P(y|x) P(y|~x) P(X=~x | ) = P(X=~x | x) = 1-x P(Y=y |X=~x, ) = P(Y=y |X=~x, y|x , y|~x) = y|~x P(,X,Y) = [P(x) (1-x)] [P(y|x) ][P(y|~x) y|~x]

 Complete Data = { (X1,Y1), (X2,Y2),…, (Xn,Yn)} P( |Data) = K P() P(Data | ) =K P(x) P(y|x) P(y|~x) 𝑖=1 𝑛 P(Xi,Yi|) P(|Data) = K P(x) P(y|x) P(y|~x) 𝑖=1 𝑛 P(Xi|x)P(Yi|Xi,y|x ,y|~x)

Complete Data Points= { (X1,Y1), (X2,Y2),…, (Xn,Yn)} Counts: NX=x ,NX=~x ,NY=y|X=x ,NY=~y|X=x ,NY=y|X=~x ,NY=~y|X=~x P(|Data) =K∙P(x) ∙ P(y|x)∙P(y|~x)∙ 𝑖=1 𝑛 P(Xi|x)P(Yi|Xi,y|x ,y|~x) P(|Data) = K ∙[ P(x) ∙ 𝑖=1 𝑛 P(Xi|x)] ∙ [P(y|x) ∙ 𝑖:𝑋𝑖=𝑥 P(Yi|Xi,y|x)]∙ [P(y|~x) ∙ 𝑖:𝑋𝑖=~𝑥 P(Yi|Xi, y|~x)] P(|Data) = K ∙[ P(x) ∙[x] NX=x ∙ [1-x] NX=~x]∙ [P(y|x) ∙ [y|x] NY=y|X=x ∙ [1-y|x] NY=~y|X=x ]∙ [P(y|~x) ∙ [y|~x] NY=y|X=~x ∙ [1-y|~x] NY=~y|X=~x ]

X, Y <h, t > <h, h> <h, h > <t, t > <h, t> <t, h > <h, h> <h, h> N=10 NX=h = 7 NX=t = 3 X Y NY=h | X=h = 5 NY=t | X=h = 2 Let  = {x , y|x , y|~x } be the parameters of the Bayes Net. NY=h | X=t = 1 NY=t | X=t = 2 Each parameter is estimated separately say via the maximum likelihood principle or using say a prior Beta(e,e): x =NX=h/(NX=h+NX=t) = 7/(7+3), or (7+e)/(7+3+2e), y|x =NY=h|X=h/(NY=h|X=h+NY=t|X=h) = 5/(5+2), or (5+e)/(5+2+2e), y |~x =NY=h|X=t/(NY=h|X=t+NY=t|X=t) = 1/(1+2) or (1+e)/(1+2+2e).

Known Structure - Incomplete Data
 Incomplete Data = { (?,Y1), (X2,?),…, (Xn,Yn)}

Learning Parameters from Incomplete Data
X Y|X=H m X[m] Y[m] Y|X=T Incomplete data: Posterior distributions can become dependent Consequence: ML parameters can not be computed separately Posterior is not a product of independent posteriors No Longer A unimodal “Nice” Likelihood function or Posterior

Known Structure -- Incomplete Data
E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . <?,Y,Y> E B A Inducer .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is specified Data contains missing values We will consider filling assignments to missing values

Expectation Maximization (EM)
A general purpose method for learning from incomplete data Intuition: If we had access to counts, then we can estimate parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment

Y Z Data Expected Counts P(Y=H|X=H,Z=T,) = 0.3 X Y Z N (X,Y ) HTHHT ??HTT TT?TH X Y # Current model HTHT HHTT These numbers are placed for illustration; they have not been computed. P(Y=H|X=T, Z=T, ) = 0.4

EM (cont.)  Training Data Reiterate Expected Counts
Initial network (G,0) Reparameterize X1 X2 X3 H Y1 Y2 Y3 Updated network (G,1) (M-Step) Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Computation (E-Step) X1 X2 X3 H Y1 Y2 Y3  Training Data

MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point L(|D) 

EM in Practice Initial parameters: Random parameters setting
“Best” guess from other source Stopping criteria: Small change in likelihood of data Small change in parameter values Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones

The setup of the EM algorithm
We start with a likelihood function parameterized by . The observed quantity is denoted X=x. It is often a vector x1,…,xL of observations (e.g., evidence for some nodes in a Bayesian network). The hidden quantity is a vector Y=y (e.g. states of unobserved variables in a Bayesian network). The quantity y is defined such that if it were known, the likelihood of the completed data point P(x,y|) is easy to maximize. The log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,) (Because P(x,y| ) = P(x| ) P(y|x, )).

The goal of EM algorithm
The log-likelihood of an observation x has the form: log P(x| ) = log P(x,y| ) – log P(y|x,) The goal: Starting with a current parameter vector ’, EM’s goal is to find a new vector  such that P(x| ) > P(x| ’) with the highest possible difference. The result: After enough iterations EM reaches a local maximum of the likelihood P(x| ).

The Mathematics involved
Recall that the expectation of a random variable Y with a pdf P(y) is given by E[Y] = y y p(y). The expectation of a function L(Y) is given by E[ L(Y)] = y L(y) p(y). A bit harder to comprehend example (where we choose L(Y)  log p(x ,y|) with X, as constants): E’[log p(x,y|)] = y p(y|x, ’) log p(x ,y|) The expectation operator E is linear. For two random variables X,Y, and constants a,b, the following holds E[aX+bY] = a E[X] + b E[Y] Q( |’) 

The EM algorithm itself
Input: A likelihood function p(x,y| ) parameterized by . Initialization: Fix an arbitrary starting value ’ Repeat E-step: Compute Q( | ’) = E’[log P(x,y| )] M-step: ’  argmax Q(| ’) Until  = log P(x| ) – log P(x|’) <  Comment: At the M-step one can actually choose any ’ as long as  > 0. This change yields the so called Generalized EM algorithm. It is important when argmax is hard to compute.

Comment on the proof of EM
We used a log-likelihood of one observation x of the form: log P(x| ) = log P(x,y| ) – log P(y|x,) For independent points (xi, yi), i=1,…,m, we can similarly write: i log P(xi| ) = i log P(xi,yi| ) – i log P(yi|xi,) We have stick to one observation in our derivation but all derived equations can be modified to set of points by summing over x.

In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

Finding MLE parameters: nonlinear optimization problem Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters L(|D) 

Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.

Gradient Ascent Main result Theorem GA:
Requires computation: P(xi,pai|o[m],) for all i, m Inference replaces taking derivatives.

Gradient Ascent (cont)
Proof: å Q = m pa x i o P D , ) | ] [ ( log q å Q = m pa x i o P , ) | ] [ ( 1 q How do we compute ?

Since: i pa x o P , ' ) | ( q Q = å i pa x nd d o P , ' ) | ( q Q = å =1 i pa x ' , nd d P o ) | ( q Q = i pa x o P ' , ) ( q Q =

Putting all together we get

Learning Bayesian networks

Similar presentations

Presentation on theme: "Learning Bayesian networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Bayesian networks

Similar presentations

Presentation on theme: "Learning Bayesian networks"— Presentation transcript:

Similar presentations

About project

Feedback