Dependency Models – abstraction of Probability distributions

Dependency Models – abstraction of Probability distributions
A dependency model M over a finite set of elements U is a rule that assigns truth values to the predicate IM(X,Z,Y) where X,Y, Z are (disjoint) subsets of U.

Important Properties of Independence
𝐼 𝑋,𝑍,𝑌 → 𝐼 𝑋,𝑍,𝑌∪𝑊 ↔𝐼(𝑋,𝑍∪𝑌,𝑊) 𝐼 𝑋,𝑍,𝑌 →{ 𝐼 𝑋,𝑍,𝑌∪𝑊 →𝐼(𝑋,𝑍∪𝑌,𝑊) ∧ [ ¬𝐼 𝑋,𝑍,𝑌∪𝑊 →¬𝐼 𝑋,𝑍∪𝑌,𝑊 ]} When learning an irrelevant fact Y, the relevance status of every fact W remains unchanged.

Undirected Graphs can represent Independence
Let G be an undirected graph (V,E). Define IG(X,Z,Y) for disjoint sets of nodes X,Y, and Z if and only if all paths between a node in X and a node in Y pass via a node in Z. In the text book another notation used is <X|Z|Y>G

M = { IG(M1,{F1,F2},M2), IG(F1,{M1,M2},F2) + symmetry }

Definitions: 1. G=(U,E) is an I-map of a model M over U if IG(X,Z,Y)  IM(X,Z,Y) for all disjoint subsets X,Y, Z of U. 2. G is a perfect map of M if IG(X,Z,Y)  IM(X,Z,Y) for all disjoint subsets X,Y, Z of U. 3. M is graph-isomorph if there exists a graph G such that G is a perfect map of M.

Undirected Graphs can not always be perfect
Strong Union: IG(X, Z, Y)  IG(X, Z∪W, Y) This property holds for graph separation but not for conditional independence in probability. So if G is an I-map of P it can represent IP(X, Z, Y) but can not represent the negation IP(X, Z∪W, Y). Needed property of separation for later Theorem 3: IG(X, S ,Y)  [ IG(X, S ∪ Y, δ) or IG(Y, S ∪X, δ) ]

Undirected Graphs as I-map
Definition: An undirected graph G is a minimal I-map of a dependency model M if deleting an edge of G would make G cease to be an I-map of M. Such a graph is called a Markov network of M.

THEOREM 3 [Pearl and Paz 1985]: Every dependency model M satisfying symmetry, decomposition, and intersection has a unique minimal I-map G0 = (U, E0) where the vertices U are the elements of M and the edges E0 are defined by α,β ∉ 𝐸0 ↔ IM(α, U-{α, β}, β) (3.11) Proof: by descending induction as done for THEOREM 2 in HMWs

Proof: First we prove that G0 is an I-map of M, namely, that
for every three disjoint non-empty subsets of U, IG(X,S,Y)  IM(X,S,Y) (Eq. 3.II) (i). Let n=|U|. For |S| = n-2 Eq. 3.II follows from (3.11). (ii). Assume the theorem holds for every S’ with size |S’| = k ≤ n-2 and let S be any set such that |S| = k-1 and IG(X,S,Y) We consider two cases: X∪S∪Y equal U or are a proper subset of U. (iii). If X∪S∪Y = U then either X or Y has two elements. Assume Y has 2 elements so Y=Y’ ∪ γ. From IG(X,S,Y) we get for graph separation IG(X,S ∪ γ,Y’) and IG(X, S ∪ Y’, γ). By the induction hypothesis: IM(X,S ∪ γ,Y’) and IM(X, S ∪ Y’, γ), which implies by Intersection IM(X,S,Y), as claimed.

(iv). X∪S∪Y ≠ U, then there exist an element δ not in X∪S∪Y.
From IG(X,S ,Y) we get IG(X, S ∪ δ, Y) and also get [ IG(X, S ∪ Y, δ) or IG(δ, S ∪X, Y) ] The separating sets are all of size k+1 and hence by the Induction Hypothesis also IM(X, S ∪ δ, Y) and IM(X, S ∪ Y, δ) or IM(X, S ∪ δ, Y) and IM(δ, S ∪X, Y) Applying Intersection and then decomposition in either cases yields IM(X,S ,Y), as claimed.

Next we prove the graph G0 is minimal, namely, no edge can be
removed from G0 without ceasing to be an I-map of M. Deleting an edge (α,β) leaves α separated from β by U-{α, β}. So if the remaining graph is still an I-map, then IM(α,U-{α, β}, β). But by definition of G0 (Eq 3.11) this edge is not part of G0. Hence no edge can be removed and G0 is edge-minimal. Finally is the claim that G0 is the unique Markov network of M. Let G be a minimal I-map. Every edge satisfying Eq. 3.11 must be removed from G to ensure a minimal I-map. No further edge can be removed without violating the I-map property. Hence G0 is formed from G and thus is the unique Markov network of M.

Pairwise Basis of a Markov Network
The set  of all independence statements defined by α,β ∉ 𝐸0 ↔ IM(α, U-{α, β}, β) (3.11) is called the pairwise basis of G. This set consists of at most n(n-1) independence statements, one per missing edge, that define the Markov network of M.

Neighboring Basis of a Markov Network
The set  of all independence statements defined by (3.12) is called the neighboring basis of G. This set consists of n independence statements, one per vertex, that define the neighbors of each vertex and hence defines a graph. Is this graph the Unique Markov network G0 of M ???

Alternative Construction of the Markov Network
THEOREM 4 [Pearl and Paz 1985]: Every element α ∈ U in a dependency model M satisfying symmetry, decomposition, intersection, and weak union has a unique Markov boundary BI(α). Moreover, BI(α) equals the set of vertices BG0(α) neighboring α in the minimal I-map G0 (The Markov Network). Proof : (i) First we show that BI(α) is unique. Take two Markov blankets B1 and B2. Hence IG(α, B1 ,U-B1) and IG (α, B2, U-B2). By Intersection also IG (α, B1 ∩ B2 , U- B1∩ B2) {HMW !}. Hence BI(α) is the intersection of the set 𝑩𝑳 ∗ I(α) all blankets.

Proof Continues : It remains to show the second claim, that the graph G1 constructed with the neighbors BI(α) of each vertex is the same as G0 that is constructed with edge definitions (Eq. 3.11). (ii) Every Markov boundary BI(α) and for every element β not in the boundary, we get due to weak union: IG(α, BI(α) , β ∪ Rest-of-vertices)  IG(α, U – {a,b} , β). Hence every edge not in G1 is also not in G0. In other words, the set of neighbors of each vertex α in G0 is a subset of the set of neighbors in G1: BG0(α) subset-of BI(α) However, equality actually holds, because BG0(α) is by itself a Markov blanket of α, due to the I-map property of G0, and the boundary BI(α) is the intersection of all blankets. Hence equality Holds. Thus, G0 = G1.

Insufficiency of Local tests for non strictly positive probability distributions
Consider the case X=Y=Z=W. What is a Markov network for it ? Is it unique ? The Intersection property is critical !

Markov Networks with probabilities
1. Define for each (maximal) clique Ci a non-negative function g(Ci) called the compatibility function. 2. Take the product i g(Ci) over all cliques. 3. Define P(X1,…,Xn) = K· i g(Ci) where K is a normalizing factor (inverse sum of the product).

P(, BG(), U-  -BG()) = f1(,BG()) f2 (U-) (*)
Theorem 6 [Hammersley and Clifford 1971]: If a probability function P is formed by a normalized product of non negative functions on the cliques of G, then G is an I-map of P. Proof: It suffices to show (Theorem 4) that the neighborhood basis of G holds in P. Namely, show that I(, BG(), U-  -BG()) hold in P, or just that: P(, BG(), U-  -BG()) = f1(,BG()) f2 (U-) (*) Let J stand for the set of indices marking all cliques in G that include . = f1(,BG()) f2 (U-) The first product contains only variables adjacent to  because Cj is a clique. The second product does not contain . Hence (*) holds.

Note: The theorem and its converse hold also for extreme probabilities but the presented proof does not apply due to the use of Intersection in Theorem 4. Theorem X: Every undirected graph G has a distribution P such that G is a perfect map of P. (In light of previous notes, it must have the form of a product over cliques).

Proof Sketch of Theorem X
Theorem Y (Completeness): Given a graph G, for every independence statement  = I(,Z,) that does NOT hold in G, there exists a probability distribution P that satisfies all independence statements that hold in the graph G and does not satisfy  = I(,Z,). Proof of Theorem Y: Pick a path in G between  and  that does not contain a node from Z. Define a probability distribution that is a perfect map of the chain and multiply it by any marginal probabilities on all other nodes not on the path, forming P . Sketch for Theorem X (Strong Completeness): “Multiply” all P (via Armstrong relation) to obtain P that is a perfect map of G. (Continue here with “Proof by intimidation” --)

Interesting conclusion of Theorem Y:
All independence statements that follow for strictly-positive probability from the neighborhood basis are derivable via symmetry, decomposition, intersection, and weak union. These axioms are (sound and) complete for neighborhood bases. These axioms are (sound and) complete also for pairwise bases. In fact for saturated statements conditional independence (whose span of variables is all of U) and vertex separation have exactly the same axioms. Isn’t that amazing ? (See paper P2).

Drawback: Interpreting the Links is not simple
Another drawback is the difficulty with extreme probabilities. There is no local test for I-mapness. Both drawbacks disappear in the class of decomposable models, which are a special case of Bayesian networks

Decomposable Models Example: Markov Chains and Markov Trees
Assume the following chain is an I-map of some P(x1,x2,x3,x4) and was constructed using the methods we just described. The “compatibility functions” on all links can be easily interpreted in the case of chains. Same also for trees. This idea actually works for all chordal graphs.

Chordal Graphs

Interpretation of the links
Clique 1 Clique 2 Clique 3 A probability distribution that can be written as a product of low order marginals divided by a product of low order marginals is said to be decomposable.

Importance of Decomposability
When assigning compatibility functions it suffices to use marginal probabilities on cliques and just make sure to be locally consistent. Marginals can be assessed from experts or estimated directly from data.

Main results on d-separation
The definition of ID(X, Z, Y) is such that: Soundness [Theorem 9]: ID(X, Z, Y) = yes implies IP(X, Z, Y) follows from the boundary Basis(D). Completeness [Theorem 10]: ID(X, Z, Y) = no implies IP(X, Z, Y) does not follow from the boundary Basis(D).

Claim 1: Each vertex Xi in a Bayesian Network is d-separated of all its non-descendants’ given its parents pai. Proof : Each vertex Xi is connected to its non-descendantsi via its parents or via its descendants. All paths via its parents are blocked because pai are given and all paths via descendants are blocked because they pass through converging edges  Z  were Z is not given. Hence by definition of d-separation the claim holds: ID(Xi, pai, non-descendantsi).

Claim 2: Each topological order d in a BN entails the same set of independence assumptions.
Proof : By Claim 1: ID(Xi, pai, non-descendandsi) holds. For each topological order d on {1,…,n}, it follows IP(Xd(i), pad(i), non-descendsd(i)) holds as well. From soundness (Theorem 9) IP(Xd(i), pad(i), non-descendsd(i)) holds as well. By the decomposition property of conditional independence IP(Xd(i), pad(i), S ) holds for every S that is a subset of non-descendsd(i) . Hence, Xi is independent given its parents also from S ={all variables before Xi in an arbitrary topological order d}.

Dependency Models – abstraction of Probability distributions

Similar presentations

Presentation on theme: "Dependency Models – abstraction of Probability distributions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dependency Models – abstraction of Probability distributions

Similar presentations

Presentation on theme: "Dependency Models – abstraction of Probability distributions"— Presentation transcript:

Similar presentations

About project

Feedback