Download presentation
Presentation is loading. Please wait.
1
Finding Optimal Bayesian Networks with Greedy Search
Max Chickering
2
Outline Bayesian-Network Definitions Learning
Greedy Equivalence Search (GES) Optimality of GES
3
Bayesian Networks Use B = (S,q) to represent p(X1, …, Xn)
4
Markov Conditions From factorization: I(X, ND | Par(X)) ND Par Par Par
Desc ND Desc Markov Conditions + Graphoid Axioms characterize all independencies
5
Structure/Distribution Inclusion
All distributions p X Y Z S p is included in S if there exists q s.t. B(S,q) defines p
6
Structure/Structure Inclusion T ≤ S
All distributions X Y Z X Y Z S T T is included in S if every p included in T is included in S (S is an I-map of T)
7
Structure/Structure Equivalence T S
All distributions X Y Z X Y Z S T Reflexive, Symmetric, Transitive
8
Equivalence Theorem (Verma and Pearl, 1990)
B C A B C D D Skeleton V-structure Theorem (Verma and Pearl, 1990) S T same v-structures and skeletons
9
Learning Bayesian Networks
X X Y Z 0 1 1 1 0 1 0 1 0 . iid samples Y p* Z Generative Distribution Observed Data Learned Model Learn the structure Estimate the conditional distributions
10
Learning Structure Scoring criterion Search procedure F(D, S)
Identify one or more structures with high values for the scoring function
11
Properties of Scoring Criteria
Consistent Locally Consistent Score Equivalent
12
Consistent Criterion Criterion favors (in the limit) simplest model that includes the generative distribution p* X Y Z p* S includes p*, T does not include p* F(S,D) > F(T,D) Both include p*, S has fewer parameters F(S,D) > F(T,D)
13
Locally Consistent Criterion
S and T differ by one edge: X Y X Y S T If I(X,Y|Par(X)) in p* then F(S,D) > F(T,D) Otherwise F(S,D) < F(T,D)
14
Score-Equivalent Criterion
X Y S X Y T ST F(S,D) = F(T,D)
15
(closed form w/ assumptions)
Bayesian Criterion (Consistent, locally consistent and score equivalent) Sh : generative distribution p* has same independence constraints as S. FBayes(S,D) = log p(Sh |D) = k + log p(D|Sh) + log p(Sh) Structure Prior (e.g. prefer simple) Marginal Likelihood (closed form w/ assumptions)
16
Search Procedure Set of states Representation for the states
Operators to move between states Systematic Search Algorithm
17
Greedy Equivalence Search
Set of states Equivalence classes of DAGs Representation for the states Essential graphs Operators to move between states Forward and Backward Operators Systematic Search Algorithm Two-phase Greedy
18
Representation: Essential Graphs
B C Compelled Edges Reversible Edges D E F A B C D E F
19
GES Operators Forward Direction – single edge additions
Backward Direction – single edge deletions
20
Two-Phase Greedy Algorithm
Phase 1: Forward Equivalence Search (FES) Start with all-independence model Run Greedy using forward operators Phase 2: Backward Equivalence Search (BES) Start with local max from FES Run Greedy using backward operators
21
Forward Operators Consider all DAGs in the current state
For each DAG, consider all single-edge additions (acyclic) Take the union of the resulting equivalence classes
22
Forward-Operators Example
Current State: All DAGs: B C A B C A B C A All DAGs resulting from single-edge addition: B C A B C A B C A B C A B C A B C A B C A B C A Union of corresponding essential graphs: B C A B C A B C A B C A
23
Forward-Operators Example
B C A B C A B C A B C A B C A
24
Backward Operators Consider all DAGs in the current state
For each DAG, consider all single-edge deletions Take the union of the resulting equivalence classes
25
Backward-Operators Example
Current State: All DAGs: B C A B C A B C A B C A All DAGs resulting from single-edge deletion: A B A B A B A B A B A B C C C C C C Union of corresponding essential graphs: B C A B C A
26
Backward-Operators Example
27
I(X,Y|Z) in p I(X,Y|Z) in G
DAG Perfect DAG-perfect distribution p Exists DAG G: I(X,Y|Z) in p I(X,Y|Z) in G Non-DAG-perfect distribution q A B A B A B C D C D C D I(A,D|B,C) I(B,C|A,D) I(B,C|A,D) I(A,D|B,C)
28
DAG-Perfect Consequence: Composition Axiom Holds in p*
If I(X,Y | Z) then I(X,Y | Z) for some singleton Y Y A B C D C X X
29
Optimality of GES p* If p* is DAG-perfect wrt some G* X X X Y Y Y n Z
X Y Z . Y Y Y n iid samples GES Z Z Z G* S* S p* For large n, S = S*
30
Optimality of GES Proof Outline
FES BES State includes S* State equals S* All-independence Proof Outline After first phase (FES), current state includes S* After second phase (BES), the current state = S*
31
FES Maximum Includes S*
Assume: Local Max does NOT include S* Any DAG G from S Markov Conditions characterize independencies: In p*, exists X not indep. non-desc given parents A B C I(X,{A,B,C,D} | E) in p* D E X p* is DAG-perfect composition axiom holds A B C I(X,C | E) in p* D E X Locally consistent: adding CX edge improves score, and EQ class is a neighbor
32
BES Identifies S* Current state always includes S*:
Local consistency of the criterion Local Minimum is S*: Meek’s conjecture
33
Meek’s Conjecture Any pair of DAGs G,H such that H includes G (G ≤ H)
There exists a sequence of covered edge reversals in G (2) single-edge additions to G after each change G ≤ H after all changes G=H
34
Meek’s Conjecture H G A B C D A B A B A B A B C D C D C D C D I(A,B)
I(C,B|A,D) C D H A B A B A B A B C D C D C D C D G
35
Meek’s Conjecture and BES S*≤S
Assume: Local Max S Not S* Any DAG H from S Any DAG G from S* Add Rev Rev Add Rev G H
36
Meek’s Conjecture and BES S*≤S
Assume: Local Max S Not S* Any DAG H from S Any DAG G from S* Add Rev Rev Add Rev G H Del Rev Rev Del Rev G H
37
Meek’s Conjecture and BES S*≤S
Assume: Local Max S Not S* Any DAG H from S Any DAG G from S* Add Rev Rev Add Rev G H Del Rev Rev Del Rev G H S* Neighbor of S in BES S
38
Discussion Points In practice, GES is as fast as DAG-based search
Neighborhood of essential graphs can be generated and scored very efficiently When DAG-perfect assumption fails, we still get optimality guarantees As long as composition holds in generative distribution, local maximum is inclusion-minimal
39
Thanks! My Home Page: http://research.microsoft.com/~dmax
Relevant Papers: “Optimal Structure Identification with Greedy Search” JMLR Submission Contains detailed proofs of Meek’s conjecture and optimality of GES “Finding Optimal Bayesian Networks” UAI02 Paper with Chris Meek Contains extension of optimality results of GES when not DAG perfect
41
Bayesian Criterion is Locally Consistent
Bayesian score approaches BIC + constant BIC is decomposible: Difference in score same for any DAGS that differ by YX edge if X has same parents X Y X Y Complete network (always includes p*)
42
Bayesian Criterion is Consistent
Assume Conditionals: unconstrained multinomials linear regressions Geiger, Heckerman, King and Meek (2001) Network structures = curved exponential models Haughton (1988) Bayesian Criterion is consistent
43
Bayesian Criterion is Score Equivalent
ST F(S,D) = F(T,D) X Y Sh : no independence constraints S X Y Th : no independence constraints T Sh = Th
44
Active Paths G ≤ H If Z-active path between X and Y in G
Z-active Path between X and Y: (non-standard) Neither X nor Y is in Z Every pair of colliding edges meets at a member of Z No other pair of edges meets at a member of Z X Z Y G ≤ H If Z-active path between X and Y in G then Z-active path between X and Y in H
45
Active Paths X A Z W B Y X-Y: Out-of X and In-to Y
X-W Out-of both X and W Any sub-path between A,BZ is also active A – B, B – C, at least one is out-of B Active path between A and C A B C
46
Simple Active Paths OR A B Then active path
contains YX Then active path (1) Edge appears exactly once OR A Y X B (2) Edge appears exactly twice A Y X X Y B Simplify discussion: Assume (1) only – proofs for (2) almost identical
47
Typical Argument: Combining Active Paths
X Y B X Y Z Z sink node adj X,Y G Z H A X Y B A X G≤H Y B Z G’ : Suppose AP in G’ (X not in CS) with no corresp. AP in H. Then Z not in CS.
48
Proof Sketch Two DAGs G, H with G<H Identify either:
a covered edge XY in G that has opposite orientation in H a new edge XY to be added to G such that it remains included in H
49
The Transformation Choose any node Y that is a sink in H Y X Y X Y Y X
Case 1a: Y is a sink in G X ParH(Y) X ParG(Y) Case 1b: Y is a sink in G same parents Case 2a: X s.t. YX covered Case 2b: X s.t. YX & W par of Y but not X Case 2c: Every YX, Par (Y) Par(X) Y X Y X Y Y X Y X W W Y X Y X Y Y
50
Preliminaries (G ≤ H) The adjacencies in G are a subset of the adjacencies in H If XYZ is a v-structure in G but not H, then X and Z are adjacent in H Any new active path that results from adding XY to G includes XY
51
Proof Sketch: Case 1 Y is a sink in G H: Y X G: Y X Y X A Z Y X B
Case 1a: X ParH(Y) X ParG(Y) H: Y X G: Y X Y X Suppose there’s some new active path between A and B not in H A Z Y X B Y is a sink in G, so it must be in CS Neither X nor next node Z is in CS In H, AP(A,Z), AP(X,B), ZYX Case 1b: Parents identical Remove Y from both graphs: proof similar
52
Proof Sketch: Case 2 Y is not a sink in G W W W G: G’: H: Y X Y X Y X
Case 2a: There is a covered edge YX : Reverse the edge Case 2b: There is a non-covered edge YX such that W is a parent of Y but not a parent of X W W W G: G’: H: Y X Y X Y X Suppose there’s some new active path between A and B not in H Y must be in CS, else replace WX by W Y X (not new). If X not in CS, then in H active: A-W, X-B, WYX A W B A W B G’: H: Y X Z Y X Z
53
Case 2c: The Difficult Case
All non-covered edges YZ have Par(Y) Par(Z) W1 W2 W1 W2 Y Y Z1 Z2 Z1 Z2 G H W1Y: G no longer < H (Z2-active path between W1 and W2) W2Y: G < H
54
Choosing Z G H Y Y D D Z Descendants of Y in G Descendants of Y in G
D is the maximal G-descendant in H Z is any maximal child of Y such that D is a descendant of Z in G
55
Choosing Z W1 Z1 Y Z2 W2 Add W2Y G H Descendants of Y in G: Y, Z1, Z2
Maximal descendant in H: D=Z2 Maximal child of Y in G that has D=Z2 as descendant Z2 Add W2Y
56
Difficult Case: Proof Intuition
B Y W A B Y W A Z Z B or CS B or CS D D G H 1. W not in CS 2. Y not in CS, else active in H 3. In G, next edges must be away from Y until B or CS reached 4. In G, neither Z nor desc in CS, else active before addition 5. From (1,2,4), AP (A,D) and (B,D) in H 6. Choice of D: directed path from D to B or CS in H
58
Optimality of GES Definition p is DAG-perfect wrt G:
Independence constraints in p are precisely those in G Assumption Generative distribution p* is perfect wrt some G* defined over the observable variables S* = Equivalence class containing G* Under DAG-perfect assumption, GES results in S*
59
Important Definitions
Bayesian Networks Markov Conditions Distribution/Structure Inclusion Structure/Structure Inclusion
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.