Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Bayesian networks

Similar presentations


Presentation on theme: "Learning Bayesian networks"— Presentation transcript:

1 Learning Bayesian networks
Slides by Nir Friedman .

2 Learning Bayesian networks
Inducer E R B A C Data + Prior information .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B)

3 Known Structure -- Incomplete Data
E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . <?,Y,Y> E B A Inducer .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is specified Data contains missing values We consider assignments to missing values

4 Known Structure / Complete Data
Given a network structure G And choice of parametric family for P(Xi|Pai) Learn parameters for network from complete data Goal Construct a network that is “closest” to probability distribution that generated the data

5 Maximum Likelihood Estimation in Binomial Data
Applying the MLE principle we get (Which coincides with what one would expect) 0.2 0.4 0.6 0.8 1 L( :D) Example: (NH,NT ) = (3,2) MLE estimate is 3/5 = 0.6

6 Learning Parameters for a Bayesian Network
Training data has the form: E B A C

7 Learning Parameters for a Bayesian Network
Since we assume i.i.d. samples, likelihood function is E B A C

8 Learning Parameters for a Bayesian Network
By definition of network, we get E B A C

9 Learning Parameters for a Bayesian Network
Rewriting terms, we get E B A C

10 General Bayesian Networks
Generalizing for any Bayesian network: The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

11 General Bayesian Networks (Cont.)
Complete Data  Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other. (Not true in Genetic Linkage analysis).

12 Learning Parameters: Summary
For multinomial we collect sufficient statistics which are simply the counts N (xi,pai) Parameter estimation Bayesian methods also require choice of priors Both MLE and Bayesian are asymptotically equivalent and consistent. MLE Bayesian (Dirichlet Prior)

13 Known Structure -- Incomplete Data
E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . <?,Y,Y> E B A Inducer .9 .1 e b .7 .3 .99 .01 .8 .2 B E P(A | E,B) ? e b B E P(A | E,B) E B A Network structure is specified Data contains missing values We consider assignments to missing values

14 Learning Parameters from Incomplete Data
X Y|X=H m X[m] Y[m] Y|X=T Incomplete data: Posterior distributions can become interdependent Consequence: ML parameters can not be computed separately for each multinomial Posterior is not a product of independent posteriors

15 Learning Parameters from Incomplete Data (cont.).
In the presence of incomplete data, the likelihood can have multiple global maxima Example: We can rename the values of hidden variable H If H has two values, likelihood has two global maxima Similarly, local maxima are also replicated Many hidden variables  a serious problem H Y

16 MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters L(|D)

17 MLE from Incomplete Data
Finding MLE parameters: nonlinear optimization problem Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point L(|D)

18 MLE from Incomplete Data
Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.

19 Gradient Ascent Main result Theorem GA:
Requires computation: P(xi,pai|o[m],) for all i, m Inference replaces taking derivatives.

20 Gradient Ascent (cont)
Proof: å Q = m pa x i o P D , ) | ] [ ( log q å Q = m pa x i o P , ) | ] [ ( 1 q How do we compute ?

21 Gradient Ascent (cont)
Since: i pa x o P , ' ) | ( q Q = å i pa x nd d o P , ' ) | ( q Q = å =1 i pa x ' , nd d P o ) | ( q Q = i pa x o P ' , ) ( q Q =

22 Gradient Ascent (cont)
Putting all together we get

23 Expectation Maximization (EM)
A general purpose method for learning from incomplete data Intuition: If we had access to counts, then we can estimate parameters However, missing values do not allow to perform counts “Complete” counts using current parameter assignment

24 Expectation Maximization (EM)
Y Z Data Expected Counts P(Y=H|X=H,Z=T,) = 0.3 X Y Z N (X,Y ) HTHHT ??HTT TT?TH X Y # Current model HTHT HHTT These numbers are placed for illustration; they have not been computed. P(Y=H|X=T, Z=T, ) = 0.4

25 EM (cont.)  Training Data Reiterate Expected Counts
Initial network (G,0) Reparameterize X1 X2 X3 H Y1 Y2 Y3 Updated network (G,1) (M-Step) Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Computation (E-Step) X1 X2 X3 H Y1 Y2 Y3 Training Data

26 Expectation Maximization (EM)
In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

27 Final Homework Question 1: Develop an algorithm that given a pedigree input, provides the most probably haplotype of each individual in the pedigree. Use the Bayesian network model of superlink to formulate the problem exactly as a query. Specify the algorithm at length discussing as many details as you can. Analyze its efficiency. Devote time to illuminating notation and presentation. Question 2: Specialize the formula given in Theorem GA for  in genetic linkage analysis. In particular, assume exactly 3 loci: Marker 1, Disease 2, Marker 3, with  being the recombination between loci 2 and 1 and 0.1-  being the recombination between loci 3 and 2. Specify the formula for a pedigree with two parents and two children. Extend the formula for arbitrary pedigrees. Note that  is the same in many local probability tables.


Download ppt "Learning Bayesian networks"

Similar presentations


Ads by Google