Bayesian Network Representation Continued

Bayesian Network Representation Continued
Eran Segal Weizmann Institute

Last Time Conditional independence Conditional parameterization
Naïve Bayes model Bayesian network Directed acyclic graph G Local probability models I-maps: G is an I-map of P if I(G)I(P)

Bayesian Network (Informal)
Directed acyclic graph G Nodes represent random variables Edges represent direct influences between random variables Local probability models I S G C X1 Xn X2 … Naïve Bayes Example 2 Example 1

Bayesian Network Structure
Directed acyclic graph G Nodes X1,…,Xn represent random variables G encodes local Markov assumptions Xi is independent of its non-descendants given its parents Formally: (Xi  NonDesc(Xi) | Pa(Xi)) A B C E G D F E  {A,C,D,F} | B

Independency Mappings (I-Maps)
Let P be a distribution over X Let I(P) be the independencies (X  Y | Z) in P A Bayesian network structure is an I-map (independency mapping) of P if I(G)I(P) I I S P(I,S) i0 s0 0.25 s1 i1 I S P(I,S) i0 s0 0.4 s1 0.3 i1 0.2 0.1 I S S I(G)={IS} I(P)={IS} I(P)= I(G)=

Factorization Theorem
G is an I-Map of P   G is an I-Map of P

Factorization Theorem
If G is an I-Map of P, then Proof: wlog. X1,…,Xn is an ordering consistent with G By chain rule: From assumption: Since G is an I-Map  (Xi; NonDesc(Xi)| Pa(Xi))I(P) *** The concept of I-maps can really help us to simplify the distribution we get from a BN *** We have the factorization theorem which holds if G is an I-map for P *** Proof we know from the chain rule without making any assumptions *** Assume that we have an ordering *** From the assumption we have that the parents are already contained, and we know the set of non-descendants *** Then we get that because G is an I-map then it encodes the set of independence assumptions *** that allows us to simplify things such that we get the factorization we wanted

Factorization Implies I-Map
 G is an I-Map of P Proof: Need to show (Xi; NonDesc(Xi)| Pa(Xi))I(P) or that P(Xi | NonDesc(Xi)) = P(Xi | Pa(Xi)) wlog. X1,…,Xn is an ordering consistent with G *** The reverse also holds *** Proof we need to show that for every property this will hold

Bayesian Network Definition
A Bayesian network is a pair (G,P) P factorizes over G P is specified as set of CPDs associated with G’s nodes Parameters Joint distribution: 2n Bayesian network (bounded in-degree k): n2k

Bayesian Network Design
Variable considerations Clarity test: can an omniscient being determine its value? Hidden variables? Irrelevant variables Structure considerations Causal order of variables Which independencies (approximately) hold? Probability considerations Zero probabilities Orders of magnitude Relative values *** There are many design criteria when constructing a BN *** picking variables is a problem. Variables should pass a clarity test. For instance, risk of heart attack is meaningless but heart attack is good *** when picking structure, we might want to go by the causality these structures tend to work better. We must consider connectivity and dependencies *** Probability considerations are important, zero probabilities will never allow an event to happen no matter how much evidence, relative magnitude is important

Devise a procedure to find all independencies in G
Independencies in a BN G encodes local Markov assumptions Xi is independent of its non-descendants given its parents Formally: (Xi  NonDesc(Xi) | Pa(Xi)) Does G encode other independence assumptions that hold in every distribution P that factorizes over G? Devise a procedure to find all independencies in G *** Last time we saw that G encodes specific local Markov assumptions, specifically that a variable is independent of its non descendants given its parents. *** Is there anything else? *** Important because we will use these concepts in later algorithms *** To answer we need to devise a procedure that finds all independencies that hold

d-Separation Goal: procedure that d-sep(X;Y | Z, G)
Return “true” iff Ind(X;Y | Z) follows from the Markov independence assumptions in G Strategy: since influence must “flow” along paths in G, consider reasoning patterns between X, Y, and Z, in various structures in G Active path: creates dependencies between nodes Inactive path: cannot create dependencies *** Goal is to have a procedure that we can query *** First we gain intuition by considering reasoning patterns in G

Direct Connection X and Y directly connected in G  no Z exists for which Ind(X;Y | Z) holds in any factorizing distribution Example: deterministic function X Y *** If X and Y are connected directly in G then we can always find P such that X and Y will be correlated, like in a deterministic funciton

Indirect Connection Case 1: Case 2: Case 3: Case 4: Z Y X Z X Y Y X Z
Indirect causal effect X Active Z X Case 2: Indirect evidential effect Y Active Y X Case 3: Common cause Z Active Y X Z Case 4: Common effect Active Z Y X Blocked Z X Y Blocked Y X Z Blocked Y X Z Blocked *** Case 1 active: knowing X changes our beliefs over Z and in turn our beliefs over Y *** Case 1 inactive: Y depends only on Z so if we know Z then knowing X does not help *** Case 2 active: knowing X changes our beliefs over Z and in turn our beliefs over Y *** Case 2 inactive: X depends only on Z so if we know Z then knowing Y does not help *** Case 3 active: knowing X might change our mind on the cause and thus if we know the cause then it changes our beliefs on Y *** Case 3 inactive: knowing Z then we do not have influence, like the Naïve Bayes model *** Case 4 active: explaining away goes on here. Otherwise we have that this follows from the Markov independence assumptions in G. THIS IS A V-STRUCTURE. We showed this to hold by the reverse factorization in that if P factorizes then G is an I-map

The General Case Let G be a Bayesian network structure
Let X1…Xn be a trail in G Let E be a subset of evidence nodes in G The trail X1…Xn is active given evidence E if: For every V-structure Xi-1XiXi+1, Xi or one of its descendants is observed No other nodes along the trail is in E The general case defines a trail and then we have that the trail is active if for every v-structure one of its descendants is observed and no other node in E is observed

d-Separation X and Y are d-separated in G given Z, denoted d-sepG(X;Y | Z) if there is no active trail between any node XX and any node YY in G I(G) = {(XY|Z) : d-sepG(X;Y | Z)} *** We define that X and Y are d-separated in G given Z if there is no active trail between any nodes in the two sets of X and Y *** Then we get that the set of independencies are all of those for which the d-separation holds

X X Examples A B C D E D E A D-sep(B,C)=yes B C D E A D-sep(B,C|D)=no
D-sep(B,C|A,D)=yes B C X *** Examples, v-structure blocks the path and thus there is d-sepration *** v-structure can be activated by evidence *** can be then inactivated with parent

d-Separation: Soundness
Theorem: Defer proof G is an I-map of P d-sepG(X;Y | Z) = yes P satisfies Ind(X;Y | Z) *** The above was based on intuition, we need to show this formally. *** Soundness means that if we say that d-sep holds, then P satisfies the independence assumption *** We will prove this later because we need to build up more tools

d-Separation: Completeness
Theorem: Proof outline: Construct distribution P where independence does not hold Since there is no d-sep, there is an active path For each interaction in the path, correlate the variables through the distribution in the CPDs Set all other CPDs to uniform, ensuring that influence flows only in a single path and cannot be cancelled out Detailed distribution construction quite involved There exists P such that G is an I-map of P P does not satisfy Ind(X;Y | Z) d-sepG(X;Y | Z) = no *** For completeness, we need to show that if there is no d-separation then there exists P for which the independence assumption does not hold. *** Note that this cannot be shown for ALL distributions P because a distribution P might have a spurious independence that does not have to hold from the structure *** The proof is to construct a distribution by setting a correlation at all interactions along the path that we know has to be active and by setting all other CPDs to uniform we can guarantee that we don’t have any other canceling out effects

Algorithm for d-Separation
Goal: answer whether d-sep(X;Y | Z, G) Algorithm: Mark all nodes in Z or that have descendants in Z BFS traverse G from X Stop traversal at blocked nodes: Node that is in the middle of a v-structure and not in marked set Not such a node but is in Z If we reach any node in Y then there is an active path and thus d-sep(X;Y | Z, G) does not hold Theorem: algorithm returns all nodes reachable from X via trails that are active in G *** Our goal is to answer a specific d-sep question *** We first mark the descendants for efficiency *** We then traverse the graph *** We stop at a node which is blocked. This mean that it is in a v-structure but neither it nor any of its descendants are in Z. or that it is not such a node but then in Z and thus blocks the path *** If we reach any node in Y then there is an active path and thus we can stop as we found an active path and thus d-separation will not hold in any distribution P

I-Equivalence Between Graphs
I(G) describe all conditional independencies in G Different Bayesian networks can have same Ind. Equivalence class I Equivalence class II Two BN graphs G1 and G2 are I-equivalent if I(G1) = I(G2) Ind(Y;Z | X) Ind(Y;Z | X) Ind(Y;Z | X) Ind(Y;Z) Y X Y Z Z X X X Y Z Z Y *** The notion of an I-map of a graph specifies the set of conditional independencies that hold in a graph, and thus we can use it to abstract away from the details of the graph structure, and view it purely as a specification of the conditional independence assumptions. *** Different BNs can have very the same conditional independence assumptions *** Examples show this. 3 networks on the left have the same conditional independence assumptions *** Leads to the notion of I-equivalence over graphs. We say that two BNs are I-equivalent if they encode the exact same set of independence assumptions.

I-Equivalence Between Graphs
If P factorizes over a graph in an I-equivalence class P factorizes over all other graphs in the same class P cannot distinguish one I-equivalent graph from another Implications for structure learning Test for I-equivalence: d-separation *** If P factorizes over one graph in the same class then it factorizes over any other graph in the I-equivalence class because the proof followed by the conditional independence assumptions *** This means that we cannot find the ‘correct’ structure from within the same equivalence class *** Implications for structure learning: we cannot find the correct structure --- we will discuss later in the course *** Test for i-equivalence is d-separation to see if all independence assumptions hold in both graphs

Test for I-Equivalence
Necessary condition: same graph skeleton Otherwise, can find active path in one graph but not other But, not sufficient: v-structures Sufficient condition: same skeleton and v-structures But, not necessary: complete graphs Define XZY as immoral if X, Y are not connected Necessary and Sufficient: same skeleton and immoral set of v-structures *** Necessary condition for I-equivalence is having the same graph skeleton, otherwise a direct edge in one graph that is not present in the other can imply active path that is in only one graph. But this is not sufficient due to v-structures *** Sufficient condition is having the same skeleton and v-structures but it’s not necessary, consider the case of complete graphs --- no independence assumptions are encoded but graphs can be very different *** Define v-structure with no direct connection between parents, and then graphs that have the same such structures and skeleton would be i-equivalent, and this is if and only if

Constructing Graphs for P
Can we construct a graph for a distribution P? Any graph which is an I-map for P But, this is not so useful: complete graphs A DAG is complete if adding an edge creates cycles Complete graphs imply no independence assumptions Thus, they are I-maps of any distribution *** Can we construct a graph that represents the independence assumptions that hold in P? *** Of course, any graph which is an I-map will do *** But, not so useful since complete graphs will always work

Minimal I-Maps A graph G is a minimal I-Map for P if:
G is an I-map for P Removing any edge from G renders it not an I-map Example: if is an I-map Then: not I-maps X Y W Z *** A more useful notion is that of minimal I-maps, which are DAGs that if one edge is deleted are no longer an I-map *** An example of what this means is shown: removing any edge from a minimal I-map immediately renders it not an I-map X Y X Y X Y X Y W Z W Z W Z W Z

BayesNet Definition Revisited
A Bayesian network is a pair (G,P) P factorizes over G P is specified as set of CPDs associated with G’s nodes Additional requirement: G is a minimal I-map for P *** We had this definition for a Bayesian network, requiring it to factorize over P *** It is standard to require the graph G to be a minimal I-map of P

Constructing Minimal I-Maps
Reverse factorization theorem  G is an I-Map of P Algorithm for constructing a minimal I-Map Fix an ordering of nodes X1,…,Xn Select parents of Xi as minimal subset of X1,…,Xi-1, such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi)) (Outline of) Proof of minimal I-map I-map since the factorization above holds by construction Minimal since by construction, removing one edge destroys the factorization *** We had the reverse factorization theorem which said that if P can be decomposed according to the graph structure then G is an I-map *** Suggests an algorithm for finding minimal I-maps: fix an ordering, and construct the graph such that this property holds, but also select the minimal subset. *** Easy to show that this results in minimal I-map because we get both the I-map and the minimality from the construction of G

Non-Uniqueness of Minimal I-Map
Applying the same I-Map construction process with different orders can lead to different structures Assume: I(G) = I(P) A B Order: E,C,D,A,B D E A B C C D *** Ordering can result in different I-maps and I-maps that have different conditional independence assumptions *** Example: assume that this graph really represents all the conditional independencies in P *** Assume we select an ordering of nodes E,C,D,A,B *** Then, we start by adding C, and since there is an active path C to E we must add the edge *** Then, we add D, and must again add direct edges from C and E *** Then, we add A, so must add edges from C and D *** Then add B and we must add edge from D but also from A because of the v-structure *** Resulting model does not have the same independence assumptions: skeleton not the same look at Ind(A;B) which holds in the graph on the left E Different independence assumptions (different skeletons, e.g., Ind(A;B) holds on left)

Choosing Order Drastic effects on complexity of minimal I-Map graph
Heuristic: use causal order *** Choosing the order can have dramatic effects on the complexity of the resulting graph *** We normally would want to choose an ordering that is consistent with the causal ordering of the variables --- we will revisit this issue in later classes

Perfect Maps G is a perfect map (P-Map) for P if I(P)=I(G)
Does every distribution have a P-Map? No: independencies may be encoded in CPD Ind(X;Y|Z=1) No: some structures cannot be represented in a BN Independencies in P: Ind(A;D | B,C), and Ind(B;C | A,D) D A B C Ind(B;C | A,D) does not hold D A B C Ind(A,D) also holds *** P-maps capture all the independencies in the distribution *** Does every distribution have a P-map? *** One reason why not is dependencies encoded in the CPD structure *** Another reason why not is that some structures cannot be represented in a BN *** Example on the left does not satisfy the independence assumption in P *** Example on the right satisfies the independence assumptions made but also forces another one to be present that we do not want

Finding a Perfect Map If P has a P-Map, can we find it?
Not uniquely, since I-equivalent graphs are indistinguishable Thus, represent I-equivalent graphs and return it Recall I-Equivalence Necessary and Sufficient: same skeleton and immoral set of v-structures Finding P-Maps Step I: Find skeleton Step II: Find immoral set of v-structures Step III: Direct constrained edges *** P-maps do not always exist but when they do, can we find them? *** Not uniquely, because of I-equivalence *** We need a representation of I-equivalence graph classes and then have the algorithm return this as output *** Recall that in I-equivalence we had a necessary and sufficient condition which was to have the same skeleton and immoral set of v-structures *** So algorithm for finding P-Map can start by finding skeleton and then the immoral v-structures

Step I: Identifying the Skeleton
Query P for Ind(X;Y | Z) If there is no Z for which Ind(X;Y | Z) holds, then XY or YX in G* Proof: Assume no Z exists, and G* does not have XY or YX Then, can find a set Z such that the path from X to Y is blocked Then, G* implies Ind(X;Y | Z) and since G* is a P-Map Contradiction Algorithm: For each pair X,Y query all Z X–Y is in skeleton if no Z is found If graph in-degree bounded by d  running time O(n2d+2) Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y)) *** Can query P for independence relationships *** If we queries all possible Z and found that there is no Z for which the independence holds, then X and Y must be connected in the most general case, i.e., for any P *** Otherwise, if not connected then can find a blocking path, but then that implies independence in G*, but then in P but as assumed that this does not happen *** Suggests algorithm for each pair query Z and connect X and Y if no Z is found *** If each variable has a maximum indegree of d, then the total running time will have order as shown, because it’s n^2 variables and for each consider all subsets of size d or just like choosing 2d candidates from n-2 *** NOTE: running time is what it is because we know that we will find the independence by hitting the set of parents in both assuming no adjacency between X and Y

Step II: Identifying Immoralities
Find all X–Z–Y triplets where X–Y is not in skeleton XZY is a potential immorality If there is no W such that Z is in W and Ind(X;Y | W), then XZY is an immorality Proof: Assume no W exists but X–Z–Y is not an immorality Then, either XZY or XZY or XZY exists But then, we can block X–Z–Y by Z Then, since X and Y are not connected, can find W that includes Z such that Ind(X;Y | W) Contradiction Algorithm: For each pair X,Y query candidate triplets XZY if no W is found that contains Z and Ind(X;Y | W) If graph in-degree bounded by d  running time O(n2d+3) If W exists, Ind(X;Y|W), and XZY not immoral, then ZW *** Find all potential triplets *** The claim is that if there is no W that contains Z and separates X and Y, then we must have a v-structure that is immoral *** Because otherwise, it’s one of the other 3 structures in which case by d-separation we will have a blocked path by Z and then because X and Y are not connected we can find something else that will block them *** So then we contradicted what we started with since we found a W *** Suggests algorithm which is to find all triplets that are candidates and search for the above sets which if bounded by d in-degree can give us a bound *** NOTE: reason for the bound is that if W exists and there is not an immorality then W must contain Z otherwise the path is active. So it’s enough to find any W since it will contain Z and we know that parents are enough then

Step III: Direct Constrained Edges
If skeleton has k undirected edges, at most 2k graphs Given skeleton and immoralities, are there additional constraints on the edges? A B A B A B C C C D D D *** As it turns out we can sometimes direct some of the edges even in the representation of the immortatlities *** Show example, in this case the equivalence class is a singleton Original BN I-equivalence Not equivalent Equivalence class is a singleton

Local constraints for directing edges A C B A C B A C B A C B *** First constraint arises because otherwise we get another immoral v-structure *** Second constraint because otherwise we get a cycle *** Third arises because otherwise we assume directed is from D to A. then so not to create cycles we must direct B to A and also C to A which would create an immoral BAC D A C B D A C B

Algorithm: iteratively direct edges by 3 local rules Guaranteed to converge since each step directs an edge Algorithm is sound and complete Proof strategy for completeness: show that for any single edge that is undirected, we can find two graphs, one for each possible direction *** Based on these constraints design an algorithm which just uses these rules and directs edges based on them *** Perhaps surprisingly the algorithm is complete (sound is clear because we did locally sound operations),

Summary Local Markov assumptions – basic BN independencies
d-separation – all independencies via graph structure G is an I-Map of P if and only if P factorizes over G I-equivalence – graphs with identical independencies Minimal I-Map All distributions have I-Maps (sometimes more than one) Minimal I-Map does not capture all independencies in P Perfect Map – not every distribution P has one PDAGs Compact representation of I-equivalence graphs Algorithm for finding PDAGs

Bayesian Network Representation Continued

Similar presentations

Presentation on theme: "Bayesian Network Representation Continued"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Network Representation Continued

Similar presentations

Presentation on theme: "Bayesian Network Representation Continued"— Presentation transcript:

Similar presentations

About project

Feedback