Presentation is loading. Please wait.

Presentation is loading. Please wait.

Structure Learning in Bayesian Networks

Similar presentations


Presentation on theme: "Structure Learning in Bayesian Networks"— Presentation transcript:

1 Structure Learning in Bayesian Networks
Eran Segal Weizmann Institute

2 Structure Learning Motivation
Network structure is often unknown Purposes of structure learning Discover the dependency structure of the domain Goes beyond statistical correlations between individual variables and detects direct vs. indirect correlations Set expectations: at best, we can recover the structure up to the I-equivalence class Density estimation Estimate a statistical model of the underlying distribution and use it to reason with and predict new instances

3 Advantages of Accurate Structure
X1 X2 Y Spurious edge Missing edge X1 X2 X1 X2 Y Y Increases number of fitted parameters Wrong causality and domain structure assumptions Cannot be compensated by parameter estimation Wrong causality and domain structure assumptions

4 Structure Learning Approaches
Constraint based methods View the Bayesian network as representing dependencies Find a network that best explains dependencies Limitation: sensitive to errors in single dependencies Score based approaches View learning as a model selection problem Define a scoring function specifying how well the model fits the data Search for a high-scoring network structure Limitation: super-exponential search space Bayesian model averaging methods Average predictions across all possible structures Can be done exactly (some cases) or approximately

5 Constraint Based Approaches
Goal: Find the best minimal I-Map for the domain G is an I-Map for P if I(G)I(P) Minimal I-Map if deleting an edge from G renders it not an I-Map G is a P-Map for P if I(G)=I(P) Strategy Query the distribution for independence relationships that hold between sets of variables Construct a network which is the best minimal I-Map for P

6 Constructing Minimal I-Maps
Reverse factorization theorem  G is an I-Map of P Algorithm for constructing a minimal I-Map Fix an ordering of nodes X1,…,Xn Select parents of Xi as minimal subset of X1,…,Xi-1, such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi)) (Outline of) Proof of minimal I-map I-map since the factorization above holds by construction Minimal since by construction, removing one edge destroys the factorization Limitations Independence queries involve a large number of variables Construction involves a large number of queries (2i-1 subsets) We do not know the ordering and network is sensitive to it *** We had the reverse factorization theorem which said that if P can be decomposed according to the graph structure then G is an I-map *** Suggests an algorithm for finding minimal I-maps: fix an ordering, and construct the graph such that this property holds, but also select the minimal subset. *** Easy to show that this results in minimal I-map because we get both the I-map and the minimality from the construction of G

7 Constructing P-Maps Simplifying assumptions Algorithm
Network has bound in-degree d per node Oracle can answer Ind. queries of up to 2d+2 variables Distribution P has a P-Map Algorithm Step I: Find skeleton Step II: Find immoral set of v-structures Step III: Direct constrained edges

8 Step I: Identifying the Skeleton
For each pair X,Y query all Z for Ind(X;Y | Z) X–Y is in skeleton if no Z is found If graph in-degree bounded by d  running time O(n2d+2) Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y)) Reminder If there is no Z for which Ind(X;Y | Z) holds, then XY or YX in G* Proof: Assume no Z exists, and G* does not have XY or YX Then, can find a set Z such that the path from X to Y is blocked Then, G* implies Ind(X;Y | Z) and since G* is a P-Map Contradiction

9 Step II: Identifying Immoralities
For each pair X,Y query candidate triplets X,Y,Z XZY if no W is found that contains Z and Ind(X;Y | W) If graph in-degree bounded by d  running time O(n2d+3) If W exists, Ind(X;Y|W), and XZY not immoral, then ZW Reminder If there is no W such that Z is in W and Ind(X;Y | W), then XZY is an immorality Proof: Assume no W exists but X–Z–Y is not an immorality Then, either XZY or XZY or XZY exists But then, we can block X–Z–Y by Z Then, since X and Y are not connected, can find W that includes Z such that Ind(X;Y | W) Contradiction

10 Answering Independence Queries
Basic query Determine whether two variables are independent Well studied question in statistics Common to frame query as hypothesis testing Null hypothesis is H0 H0: Data was sampled from P*(X,Y)=P*(X)P*(Y) Need a procedure that will Accept or Reject the hypothesis 2 test to assess deviance of data from hypothesis Alternatively, use mutual information between X and Y

11 Structure Based Approaches
Strategy Define a scoring function for each candidate structure Search for a high scoring structure Key: choice of scoring function Likelihood based scores Bayesian based scores

12 Likelihood Scores Goal: find (G,) that maximize the likelihood
ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G Find G that maximizes ScoreL(G:D)

13 Example X X Y Y

14 General Decomposition
The Likelihood score decomposes as: Proof:

15 General Decomposition
The Likelihood score decomposes as: Second term does not depend on network structure and thus is irrelevant for selecting between two structures Score increases as mutual information, or strength of dependence between connected variable increases After some manipulation can show: To what extent are the implied Markov assumptions valid

16 Limitations of Likelihood Score
X X Y Y G0 G1 Since IP(X,Y)0  ScoreL(G1:D)ScoreL(G0:D) Adding arcs always helps Maximal scores attained for fully connected network Such networks overfit the data (i.e., fit the noise in the data)

17 Avoiding Overfitting Classical problem in machine learning Solutions
Restricting the hypotheses space Limits the overfitting capability of the learner Example: restrict # of parents or # of parameters Minimum description length Description length measures complexity Prefer models that compactly describes the training data Bayesian methods Average over all possible parameter values Use prior knowledge

18 Marginal probability of Data
Bayesian Score Marginal likelihood Prior over structures Marginal probability of Data P(D) does not depend on the network Bayesian Score:

19 Marginal Likelihood of Data Given G
Bayesian Score: Likelihood Prior over parameters Note similarity to maximum likelihood score, but with the key difference that ML finds maximum of likelihood and here we compute average of the terms over parameter space

20 Marginal Likelihood: Binomial Case
Assume a sequence of m coin tosses By the chain rule for probabilities Recall that for Dirichlet priors Where MmH is number of heads in first m examples

21 Marginal Likelihood: Binomial Case
Simplify using(x+1)=x(x) For multinomials with Dirichlet prior

22 Marginal Likelihood Example
Actual experiment with P(H) = 0.25 -0.6 -0.7 -0.8 -0.9 (log P(D)) / M -1 Dirichlet(.5,.5) -1.1 Dirichlet(1,1) -1.2 Dirichlet(5,5) -1.3 5 10 15 20 25 30 35 40 45 50 M

23 Marginal Likelihood: BayesNets
Network structure determines form of marginal likelihood 1 2 3 4 5 6 7 H T X Y Network Network 1: Two Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],…,Y[7]) X Y

24 Marginal Likelihood: BayesNets
Network structure determines form of marginal likelihood 1 2 3 4 5 6 7 H T X Y Network Network 2: Three Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1]Y[4]Y[6]Y[7]) P(Y[2]Y[3]Y[5]) X Y

25 Idealized Experiment P(X = H) = 0.5
P(Y = H|X = H) = p P(Y = H|X = T) = p -1.3 -1.35 -1.4 -1.45 (logP(D))/M -1.5 -1.55 -1.6 Independent -1.65 P = 0.05 P = 0.10 -1.7 P = 0.15 -1.75 P = 0.20 -1.8 1 10 M 100 1000

26 Marginal Likelihood: BayesNets
The marginal likelihood has the form: where M(..) are the counts from the data (..) are hyperparameters for each family Dirichlet Marginal Likelihood For the sequence of values of Xi when Xi’s parents have a particular value

27 Priors Structure prior P(G) Bayesian Score:
Uniform prior: P(G)  constant Prior penalizing number of edges: P(G)  c|G| (0<c<1) Normalizing constant across networks is similar and can thus be ignored Bayesian Score:

28 Priors Parameter prior P(|G) Bayesian Score: BDe prior
M0: equivalent sample size B0: network representing the prior probability of events Set (xi,paiG) = M0 P(xi,paiG| B0) Note: paiG are not the same as parents of Xi in B0 Compute P(xi,paiG| B0) using standard inference in B0 BDe has the desirable property that I-equivalent networks have the same Bayesian score when using the BDe prior for some M’ and P’ Bayesian Score:

29 Bayesian Score: Asymptotic Behavior
For M, a network G with Dirichlet priors satisfies Approximation is called BIC score Score exhibits tradeoff between fit to data and complexity Mutual information grows linearly with M while complexity grows logarithmically with M As M grows, more emphasis is given to the fit to the data Dim(G): number of independent parameters in G

30 Bayesian Score: Asymptotic Behavior
For M, a network G with Dirichlet priors satisfies Bayesian score is consistent As M, the true structure G* maximizes the score Spurious edges will not contribute to likelihood and will be penalized Required edges will be added due to linear growth of likelihood term relative to M compared to logarithmic growth of model complexity

31 Summary: Network Scores
Likelihood, MDL, (log) BDe have the form BDe requires assessing prior network Can naturally incorporate prior knowledge BDe is consistent and asymptotically equivalent (up to a constant) to BIC/MDL All are score-equivalent G I-equivalent to G’  Score(G) = Score(G’)

32 Optimization Problem Input: Output: Key Property: Training data
Scoring function (including priors, if needed) Set of possible structures Including prior knowledge about structure Output: A network (or networks) that maximize the score Key Property: Decomposability: the score of a network is a sum of terms.

33 Learning Trees Trees Why trees? At most one parent per variable
Elegant math we can solve the optimization problem efficiently (with a greedy algorithm) Sparse parameterization avoid overfitting while adapting to the data

34 Learning Trees Let p(i) denote parent of Xi, or 0 if Xi has no parent
We can write the score as Score = sum of edge scores + constant Improvement over “empty” network Score of “empty” network

35 Learning Trees Algorithm
Construct graph with vertices: 1,...n Set w(ij) = Score(Xj | Xi ) - Score(Xj) Find tree (or forest) with maximal weight This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm) Theorem: Procedure finds the tree with maximal score When score is likelihood, then w(ij) is proportional to I(Xi; Xj). This is known as the Chow & Liu method

36 Learning Trees: Example
PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP Tree learned from data of Alarm network Correct edges Spurious edges Not every edge in tree is in the the original network Tree direction is arbitrary --- we can’t learn about arc direction

37 Beyond Trees Problem is not easy for more complex networks
Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network In fact, no efficient algorithm exists Theorem: Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1

38 Fixed Ordering For any decomposable scoring function Score(G:D) and ordering  the maximal scoring network has:  For fixed ordering we have independent problems If we bound the in-degree per variable by d, then complexity is exponential in d (since choice at Xi does not constrain other choices)

39 Heuristic Search We address the problem by using heuristic search
Define a search space: nodes are possible structures edges denote adjacency of structures Traverse this space looking for high-scoring structures Search techniques: Greedy hill-climbing Best first search Simulated Annealing ...

40 Heuristic Search Typical operations: A B C A B D C D A B A B C C D D
Add B D A B D C Reverse CB Delete BC D A B A B C C D D

41 Exploiting Decomposability
Caching: To update the score after a local change, we only need to rescore the families that were changed in the last move

42 Greedy Hill Climbing Simplest heuristic local search
Start with a given network empty network best tree a random network At each iteration Evaluate all possible changes Apply change that leads to best improvement in score Reiterate Stop when no modification improves score Each step requires evaluating O(n) new changes

43 Greedy Hill Climbing Pitfalls
Greedy Hill-Climbing can get stuck in: Local Maxima All one-edge changes reduce the score Plateaus Some one-edge changes leave the score unchanged Happens because equivalent networks received the same score and are neighbors in the search space Both occur during structure search Standard heuristics can escape both Random restarts TABU search

44 Equivalence Class Search
Idea Search the space of equivalence classes Equivalence classes can be represented by PDAGs (partially ordered graph) Advantages PDAGs space has fewer local maxima and plateaus There are fewer PDAGs than DAGs

45 Equivalence Class Search
Evaluating changes is more expensive In addition to search, need to score a consistent network These algorithms are more complex to implement Original PDAG X Y Z New PDAG Add Y—Z X Y Z X Y Z Consistent DAG Score

46 Learning Example: Alarm Network
2 True Structure/BDe M' = 10 Unknown Structure/BDe M' = 10 1.5 KL Divergence 1 0.5 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 M

47 Model Selection So far, we focused on single model
Find best scoring model Use it to predict next example Implicit assumption: Best scoring model dominates the weighted sum Valid with many data instances Pros: We get a single structure Allows for efficient use in our tasks Cons: We are committing to the independencies of a particular structure Other structures might be as probable given the data

48 Model Selection Density estimation Structure discovery
One structure may suffice, if its joint distribution is similar to other high scoring structures Structure discovery Define features f(G) (e.g., edge, sub-structure, d-sep query) Compute Still requires summing over exponentially many structures

49 Model Averaging Given an Order
Assumptions Known total order of variables  Maximum in-degree for variables d Marginal likelihood Using decomposability assumption on prior P(G|) Since given ordering , parent choices are independent Cost per family: O(nk) Total cost: O(nk+1)

50 Model Averaging Given an Order
Posterior probability of a general feature Posterior probability of particular choice of parents Posterior probability of particular edge choice All terms cancel out

51 Model Averaging We cannot assume that order is known
Solution: Sample from posterior distribution of P(G|D) Estimate feature probability by Sampling can be done by MCMC Sampling can be done over orderings

52 Notes on Learning Local Structures
Define score with local structures Example: in tree CPDs, score decomposes by leaves Prior may need to be extended Example: in tree CPDs, penalty for tree structure per CPD Extend search operators to local structure Example: in tree CPDs, we need to search for tree structure Can be done by local encapsulated search or by defining new global operations

53 Search: Summary Discrete optimization problem In general, NP-Hard
Need to resort to heuristic search In practice, search is relatively fast (~100 vars in ~10 min): Decomposability Sufficient statistics In some cases, we can reduce the search problem to an easy optimization problem Example: learning trees


Download ppt "Structure Learning in Bayesian Networks"

Similar presentations


Ads by Google