# . Learning Bayesian networks Slides by Nir Friedman.

## Presentation on theme: ". Learning Bayesian networks Slides by Nir Friedman."— Presentation transcript:

. Learning Bayesian networks Slides by Nir Friedman

Learning Bayesian networks Inducer Data + Prior information E R B A C.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B)

Known Structure -- Incomplete Data Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l We consider assignments to missing values E, B, A.

Known Structure / Complete Data u Given a network structure G And choice of parametric family for P(X i |Pa i ) u Learn parameters for network from complete data Goal u Construct a network that is “closest” to probability distribution that generated the data

Maximum Likelihood Estimation in Binomial Data u Applying the MLE principle we get (Which coincides with what one would expect) 00.20.40.60.81 L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6

Learning Parameters for a Bayesian Network E B A C u Training data has the form:

Learning Parameters for a Bayesian Network E B A C u Since we assume i.i.d. samples, likelihood function is

Learning Parameters for a Bayesian Network E B A C u By definition of network, we get

Learning Parameters for a Bayesian Network E B A C u Rewriting terms, we get

General Bayesian Networks Generalizing for any Bayesian network: u The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

General Bayesian Networks (Cont.) Complete Data  Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other. (Not true in Genetic Linkage analysis).

Learning Parameters: Summary  For multinomial we collect sufficient statistics which are simply the counts N (x i,pa i ) u Parameter estimation u Bayesian methods also require choice of priors u Both MLE and Bayesian are asymptotically equivalent and consistent. MLE Bayesian (Dirichlet Prior)

Known Structure -- Incomplete Data Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l We consider assignments to missing values E, B, A.

Learning Parameters from Incomplete Data Incomplete data: u Posterior distributions can become interdependent u Consequence: l ML parameters can not be computed separately for each multinomial l Posterior is not a product of independent posteriors XX  Y|X=H m X[m] Y[m]  Y|X=T

Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables  a serious problem HY

Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters L(  |D) MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem 

L(  |D) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem 

MLE from Incomplete Data Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.

Gradient Ascent u Main result Theorem GA: Requires computation: P(x i,pa i |o[m],  ) for all i, m Inference replaces taking derivatives.

Gradient Ascent (cont)      m pax ii moP moP, )|][( )|][( 1        m x x iiii moPDP,, )|][(log)|(  How do we compute ? Proof:

=1 Gradient Ascent (cont) Since: ii pax ii o xP ',' ),,','(    ii x',' ii nd i ii d paxP o Po xoP ),'|'( )|,'(),,','|(    ii ii x x nd iii ii d opaP xPo xoP, ',' )|,(),|(),,,|(     ii iiii x x ii x o xP oP, ','',' )|,,( )|(      

Gradient Ascent (cont) u Putting all together we get

Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment

Expectation Maximization (EM) 1.3 0.4 1.7 1.6 X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT TT?THTT?TH HTHTHTHT HHTTHHTT P(Y=H|X=T, Z=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current model These numbers are placed for illustration; they have not been computed. X Y Z

EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G,  0 )  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G,  1 ) (M-Step) Reiterate

Expectation Maximization (EM) u In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. u Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

Final Homework Question 1: Develop an algorithm that given a pedigree input, provides the most probably haplotype of each individual in the pedigree. Use the Bayesian network model of superlink to formulate the problem exactly as a query. Specify the algorithm at length discussing as many details as you can. Analyze its efficiency. Devote time to illuminating notation and presentation. Question 2: Specialize the formula given in Theorem GA for  in genetic linkage analysis. In particular, assume exactly 3 loci: Marker 1, Disease 2, Marker 3, with  being the recombination between loci 2 and 1 and 0.1-  being the recombination between loci 3 and 2. 1. Specify the formula for a pedigree with two parents and two children. 2. Extend the formula for arbitrary pedigrees. Note that  is the same in many local probability tables.