Exact Inference on Graphical Models Samson Cheung.

Exact Inference on Graphical Models Samson Cheung

Outline What is inference? Overview Preliminaries Three general algorithms for inference Elimination Algorithm Belief Propagation Junction Tree

What is inference? Given a fully specified joint distribution (database), inference is to query information about some random variables, given knowledge about other random variables. Evidence: x E Query about X F ? Information about X F

Conditional/Marginal Prob. Ex. Visual Tracking – you want to compute the conditional to quantify the uncertainty in your tracking Evidence: x E Conditional of X F ?

Maximum A Posterior Estimate Evidence: x E Most likely value of X F ? Error Control – Care about the decoded symbol. Difficult to compute the error probability in practice due to high bandwidth.

Inferencing is not easy Computing marginals or MAP requires global communication! Marginal: P(p,q)=  G\{p,q} p(G) Potential:  (p,q) = exp(-|p-q|) Evidence

Inference Algorithms General Inference Algorithms EXACT APPROXIMATE General Graph JUNCTION TREE ELIMINATION ALGORITHM Polytrees BELIEF PROPAGATION 1.Iterative Conditional Modes 2.EM 3.Mean field 4.Variational techniques 5.Structural Variational techniques 6.Monte-Carlo 7.Expectation Propagation 8.Loopy belief propagation NP -hard >1000 nodes: Image Processing Vision Physics 10-100 nodes: Expert systems Diagnostics Simulation

Outline What is inferencing? Overview Preliminaries Three general algorithms for inferencing Elimination Algorithm Junction Tree Probability Propagation

Introducing evidence Inferencing : summing or maxing “part” of the joint distribution In order not to be sidetrack by the evidence node, we roll them into the joint by considering Hence we will be summing or maxing the entire joint distribution Calculating Marginal

Moralization Every directed graph can be represented as an undirected by linking up parents who have the same child. Deal only with undirected graph X2X2 X3X3 X1X1 X4X4 X5X5 X6X6 P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 1 ) P(X 4 |X 1 ) P(X 5 |X 2,X 3 ) 6 P(X 6 |X 3,X 4 ) X2X2 X3X3 X1X1 X4X4 X5X5 X6X6  (X 1,X 2,X 3 )  (X 1,X 3,X 4 )  (X 2,X 3,X 5 )  (X 3,X 4,X 6 ) π π

Adding edges is “okay” The pdf of an undirected graph can ALWAYS be expressed by the same graph with extra edges added. A graph with more edge Lose important conditional independence information (okay for inferencing, not good for parameter est.) Use more storage (why?) X2X2 X3X3 X1X1 X4X4 X5X5 X6X6  (X 1,X 2,X 3 )  (X 1,X 3,X 4 )  (X 2,X 3,X 5 )  (X 3,X 4,X 6 ) X2X2 X3X3 X1X1 X4X4 X5X5 X6X6  (X 1,X 2,X 3,X 4 )  (X 2,X 3,X 5 )  (X 3,X 4,X 6 ) π

Undirected graph and Clique graph Clique graph Each node is a clique from the parametrization An edge between two nodes (cliques) if the two nodes (cliques) share common variables X2X2 X3X3 X1X1 X4X4 X5X5 X6X6  C1 (X 1,X 2,X 3 )  C2 (X 1,X 3,X 4 )  C3 (X 2,X 3,X 5 )  C4 (X 3,X 4,X 6 )  C5 (X 7,X 8,X 9 )  C6 (X 1,X 7 ) X7X7 X8X8 X9X9  C1  C3  C2  C6  C4  C5 Separator:  C1   C3 ={X 2,X 3 }

Computing Marginal Need to marginalize x 2,x 3,x 4,x 5 We need to sum N 5 terms (N is the number of symbols for each r.v.) Can we do better? X1X1 X2X2 X3X3 X4X4 X5X5

Elimination (Marginalization) Order Try to marginalize in this order: x 5, x 4, x 3, x 2 Complexity: O(KN 3 ), Storage: O(N 2 ) K=# r.v.s C: O(N 3 ) S: O(N 2 ) C: O(N 2 ) S: O(N) C: O(N 3 ) S: O(N 2 ) C: O(N 2 ) S: O(N)

MAP is the same Just replace summation with max Note All the m’s are different from marginal Need to remember the best configuration as you go

Graphical Interpretation X1X1 X2X2 X3X3 X4X4 X5X5 List of active potential functions:  C1 (X 1,X 2 )  C2 (X 1,X 3 )  C3 (X 2,X 5 )  C4 (X 3,X 5 )  C5 (X 2,X 4 )  C1 (X 1,X 2 )  C2 (X 1,X 3 )  C3 (X 2,X 5 )  C4 (X 3,X 5 )  C5 (X 2,X 4 )  C1 (X 1,X 2 )  C2 (X 1,X 3 )  C5 (X 2,X 4 ) m 5 (X 2,X 3 ) Kill X 5  C1 (X 1,X 2 )  C2 (X 1,X 3 )  C5 (X 2,X 4 ) m 5 (X 2,X 3 )  C1 (X 1,X 2 )  C2 (X 1,X 3 ) m 4 (X 2 ) m 5 (X 2,X 3 ) Kill X 4  C1 (X 1,X 2 )  C2 (X 1,X 3 ) m 4 (X 2 ) m 5 (X 2,X 3 )  C1 (X 1,X 2 ) m 4 (X 2 ) m 3 (X 1,X 2 ) Kill X 3  C1 (X 1,X 2 ) m 4 (X 2 ) m 3 (X 1,X 2 ) m 2 (X 1 ) Kill X 2

First real link to graph theory Reconstituted Graph = the graph that contain all the extra edges after the elimination Depends on the elimination order! X3X3 X1X1 X2X2 X4X4 X5X5 The complexity of graph elimination is O(N W ), where W is the size of the largest clique in the reconstituted graph The complexity of graph elimination is O(N W ), where W is the size of the largest clique in the reconstituted graph Proof : Exercise

Finding the optimal order To minimize the clique size turns out to be NP-hard 1 Greedy algorithm 2 : 1. Find the node v in G that connects to the least number of neighbors 2. Eliminate v and connect all its neighbors 3. Go back to 1 until G becomes a clique Current best techniques use other simulated annealing 3 or approximated algorithm 4 1 S. Arnborg, D.G. Corneil, A. Proskurowski, Complexity of finding embeddings in a k -tree, SIAM J.Algebraic and Discrete Methods 8 (1987) 277–284. 2 D. Rose, Triangulated graphs and the elimination process, J. Math. Anal. Appl. 32 (1974) 597–609. 3 U. Kjærulff, Triangulation of graph-algorithms giving small total state space, Technical Report R 90-09, Department of Mathematics and Computer Science, Aalborg University, Denmark, 1990. 4 A. Becker, D. Geiger, “A sufficiently fast algorithm for finding close to optimal clique trees,” Arificial Intelligence 125 (2001) 3-17

This is serious One of the most commonly used graphical model in vision is Markov Random Field Try to find a elimination order of this model. Pixel: I(x,y)  (p,q) = exp(-|p-q|) Largest clique: 4 Grow linearly with dimension (?)

What about other marginals? We have just computed P(X 1 ). What if I need to compute P(X 1 ) or P(X 5 ) ? Definitely, some part of the calculation can be reused! Ex. m 5 (X 2,X 3 ) is the same for both! X1X1 X2X2 X3X3 X4X4 X5X5

Focus on trees Focus on tree like structures: Why trees? Undirected Tree Directed Tree = undirected tree after moralization

Why trees? No moralization is necessary There is a natural elimination ordering with query node as root Depth first search : all children before parent All sub-trees with no evidence nodes can be ignored (Why? Exercise for the undirected graph)

Elimination on trees When we eliminate node j, the new potential function must be A function of x i Any other nodes? nothing in the sub-tree below j (already eliminated) nothing from other sub-trees, since the graph is a tree only i, from  ij which relates i and j Think of the new potential functions as a message m ji (x i ) from node j to node i Think of the new potential functions as a message m ji (x i ) from node j to node i m ji (x i )

What is in the message? This message is created by summing over j the product of all earlier messages m kj (x j ) sent to j as well as  E (x j ) (if j is an evidence node). c(j) = children of node j  E (x j ) = δ(x j =x j ) if j is an evidence node; 1 otherwise

Elimination = Passing message upward After passing the message up to the query (root) node, we compute the conditional: What about answering other queries? = query node (need 3 messages)

Messages are reused! We can compute all possible messages in only double the amount of work it takes to do one query. Then we take the product of relevant messages to get marginals. Even though the naive approach (rerun Elimination) needs to compute N(N-1) messages to find marginals for all N query nodes, there are only 2(N-1) possible messages.

Computing all possible messages Idea: respect the following Message-Passing- Protocol: A node can send a message to a neighbour only when it has received messages from all its other neighbours. Protocol is realizable: designate one node (arbitrarily) as the root. Collect messages inward to root then distribute back out to leaves.

Belief Propagation i j k l m kj m lj m ji m ij m jk m jl

Belief Propagation (sum-product) 1. Choose a root node (arbitrarily or as first query node). 2. If j is an evidence node,  E (x j ) = (x j =x j ), else  E (x j ) = 1 3. Pass messages from leaves up to root and then back down using: 4. Given messages, compute marginals using:

MAP is the same (max-product) 1. Choose a root node arbitrarily. 2. If j is an evidence node,  E (x j ) = (x j =x j ), else  E (x j ) = 1 3. Pass messages from leaves up to root using: 4. Remember which choice of x j = x j * yielded maximum. 5. Given messages, compute max value using any node i: 6. Retrace steps from root back to leaves recalling best x j to get the maximizing argument (configuration) x.

“Tree”-like graphs work too Pearl (1988) shows that BP works for factor tree See Jordan Chapter 4 for more details This is not a directed tree After moralization Corresponding factor graph IS A TREE

What about arbitrary graphs? BP only works on tree-like graphs Question: Is there an algorithm for general graph? Also, after BP, we get the marginal for each INDVIDUAL random variables But the graph is characterized by cliques Question: Can we get the marginal for every clique?

Mini-outline Back to Reconstituted Graph Three equivalent concepts Triangulated graph – easy to validate Decomposable graph – link to probability Junction Tree – computational inference Junction Tree Algorithm Example

Back to Reconstituted graph The reconstituted graph is a very important type of graph: triangulated (chordal) graph Definition: A graph is triangulated if any loop with 4 or more nodes will have a chord. All trees are triangulated All cliques are triangulated triangulated Non- triangulated

v = first node eliminated Proof Prove for any N-node graph, the reconstituted graph after elimination is triangulated. Proof: By induction 1. N=1 : trivial 2. Assume N=k is true. 3. N=k+1 case: Reconstituted graph with k nodes  triangulated Added during elimination  chordal v

Lessons from graph theory Graph coloring problem: find the smallest number of vertex colors so that adjacent colors are different = chromatic number Sample application 1: Scheduling Node = tasks Edge = two tasks are not compatible Coloring = Number of parallel tasks Sample application 2 : Communication Node = symbols Edge = two symbols may produce the same output due to transmission error Largest set of vertices with the same color = number of symbols that can be reliably sent

Lesson from graph theory Determining the chromatic number  is NP-hard Not so for a general type of graph called Perfect Graph Definition:  = the size of the largest clique Triangulated graph is an important type of perfect graphs. Strong Perfect Graph Conjecture was proved in 2002 (148-page!) Bottom line: Triangulated graph is “algorithmically friendly” – very easy to check whether a graph is triangulated and to compute properties from such a graph.

Link to Probability: Graph Decomposition Definition: Given a graph G, a triple (A,B,S) with Vertex(G) = A  B  S is a decomposition G if 1. S separates A and B (i.e. every path from a  A to b  B must past through S. 2. S is a clique Definition: G is decomposable if 1. G is complete or 2. There exist a decomposition (A,B,S) of G such that A  S and B  S are decomposable. A B S

What’s the big deal? Decomposable graph can be parametrized by marginals! If G is decomposable, then where C 1,C 2, …,C N are cliques in G, and S 1,S 2, …,S N-1 are (special) separators between cliques. Notice there are one less separators than cliques. Equivalently, we can say that G can parameterized by marginals p(x C ) and ratios of marginals, p(x C )/p(x S )

This is not true in general If the graph can be expressed in terms of a product marginals or ratio of marginals, at least one of the potentials is a marginal. However, f ( X AB ) is not a constant A B C D

Proof : Proof by induction: G can be decomposed into A,B, and S, where A  S and B  S are decomposable; S separates A and B and is complete. A B S All cliques are subsets of either A  S or B  S

Continue Recursively apply on A  S and B  S based on induction assumption.

So what? Triangulated Graph Decomposable Graph Nice algorithmically Parametrized by marginals It turns out that Triangulated Graph  Decomposable Graph

Prove by induction: If G is complete, it is triangulated. Otherwise By IA, G A  S and G B  S are triangulated and thus all cycles in them have a chord. The case we need to consider is the cycle that span A, B and S. But S is complete, so it must have a chord! QED Decomposable  Triangulation A B S

Triangulation  Decomposable Prove by induction. Let G be a triangulated graph with N nodes. Show is G can be decomposed into (A,B,S). If G’s complete, done. If not, choose non-adjacent a and b. S = smallest set that intersects with all paths between a and b. A = all nodes in G\S reached by a B = all nodes in G\S reached by b Cleary A and B are separated by S. a b S B A

Triangulation  Decomposable Need to prove S is complete. Consider arbitrary c,d  S. There is a path a  c  b such that c is the only node in S. If not, then S is not minimum as c can be put into either A or B. Similarly, there is a path a  d  b. Now we a cycle. Since G is triangulated, this cycle must have a chord. Since S separates A and B, the chord must be entirely in A  S or B  S. Keep shrinking the cycle and eventually there must be a chord between c and d, hence S must be complete. a b S c d B A a1a1 b1b1 a2a2 b2b2

Recap Reconstituted graph is triangulated. Triangulated graph = decomposable Joint probability in decomposable graph can be factorized into marginals and ratios of marginals. Not very constructive so far: How can we get from LOCAL POTENTIALS to GLOBAL MARGINAL parametrization?

How to get from a local description to a global description? A decomposable graph (V\S,W\S,S): At the beginning, we have local representations: We want V W S

Message passing Initialization Phase 1: Collect V W S V W S  (X S )  (X S )=1  *(X S )=  V\S  (X V )  *(X W )=  (X W )  *(X S )/  (X S )  *(X S )  *(X W ) Why? P(X W )   V\S  (X W )  (X V )/  (X S ) =  (X W )  V\S  (X V )/  (X S ) =  (X W )  *(X S )/  (X S ) =  *(X W )  *(X W )  (X V )/  *(X S ) = [  (X W )  *(X S )/  (X S )]  (X V )/  *(X S ) =  (X W )  (X V )/  (X S ) = Joint distribution

Message Passing Phase 2: Distribute V W S  **(X S )  *(X V )  **(X S )=  W\S  *(X W )  P(X S )  *(X V )=  (X V )  **(X S )/  *(X S ) Why? P(X V )   W\S  (X W )  (X V )/  (X S ) =  W\S  *(X W )  (X V )/  *(X S ) =  (X V )  **(X S )/  *(X S ) =  *(X V )  *(X W )  *(X V )/  **(X S ) =  *(X W )  (X V )/  *(X S ) =  (X W )  (X V )/  (X S ) = Joint distribution

Relating Local Description to Message Passing How to extend message passing to general graph (in terms of cliques)? To extend the previous message passing algorithm to general graph, we need a recursive decomposition in terms of cliques. Answer: Junction Tree

Decomposable graph induces a tree on the clique graph Let C 1, C 2, …., C N be all the maximal cliques in G Every C i must either be in V or W Since all C i are maximal, there is an C j  V and C k  W such that S  C j and S  C k Put an edge between C j and C k Recursively decompose V and W  no loop will form because of the separation property. The final clique graph is a tree called a Junction Tree V W S CjCj CkCk S S

Properties of a Junction Tree For any two C i and C j, every clique on the unique path on the junction tree between them must contain C i  C j Each branch along the path decompose the graph. So the separator S on the branch must contain C i  C j, so must the clique nodes on either side of the branch Equivalently, for any variable X, all the clique nodes containing X induces a sub-tree from the junction tree. CiCi CjCj S A B

Junction Tree  Decomposable Graph Definition: A Junction Tree is a sub-tree of the clique graph such that all the nodes along the path between any two cliques C, D contain C  D. Prove by induction. Simple base case. For any separator S, the right and left sub-trees to S, R and L, are JT’s so they must be decomposable by IA. S is complete so it remains to show that S separates R and L. If not, there exists an edge (X,Y) with X  R and Y  L but X,Y  S. However, (X,Y) must belong to some clique  Y  R or X  L. Thus by the junction tree property, Y  S or X  S. Contradiction. If a graph has a junction tree, it must be decomposable.

How to find a junction tree? Not easy from either definition or decomposition. Define edge weight w(s) = number of variables in the separator s. Let C 1 and C 2 be the end clique nodes Total weight of a junction tree =  X [  C 1{X  C}-1] =  X  C 1{X  C}-N =  C  X 1{X  C}-N =  C |C|-N Each variable induces a subtree in a junction tree

Junction Tree is a maximal spanning clique tree Consider any clique tree, its total weight =  S |S| =  S  X 1{X  S} =  X  S 1{X  S}   X [  C 1{X  C}-1] =  C  X 1{X  C}-N =  C |C|-N = weight of a Junction Tree X All separators containing X must belong to one of the above edges. Any clique tree can thus contain at most  C 1{X  C}-1 edges from this subgraph.

Example X2X2 X3X3 X1X1 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9  C1  C3  C2  C6  C4  C5  C1  C3  C2  C6  C4  C5 C3C3 C4C4 C1C1 C2C2 C6C6 C5C5 X3X3

So what? Junction Tree Algorithm 1. Moralize if needed 2. Triangulate using any triangulation algorithm 3. Formulate the clique graph (clique nodes and separator nodes) 4. Compute the junction tree 5. Initialize all separator potentials to be 1. 6. Phase 1: Collect from children 7. Phase 2: Distribute to children Message from children C :  *(X S )=  C\S  (X C ) Update at parent P:  *(X P )=  (X P )  S  *(X S )/  (X S ) Message from parent P :  **(X P )=  P\S  **(X P ) Update at child C:  *(X C )=  (X C )  S  **(X S )/  *(X S )

CHILD Network

Step 1: Moralization

Step 2: Triangulation

Step 3: Form Junction Tree

Step 5: Two phase propagation Evidence : LVH report = Yes

Conclusions Inference: marginals and MAP Elimination – one node at a time Complexity is a function of the size of the largest clique Triangulate that results into small cliques is NP-hard Belief Propagation – all nodes, exact on trees Junction Tree  Decomposable graph  Triangulated graph

Exact Inference on Graphical Models Samson Cheung.

Similar presentations

Presentation on theme: "Exact Inference on Graphical Models Samson Cheung."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exact Inference on Graphical Models Samson Cheung.

Similar presentations

Presentation on theme: "Exact Inference on Graphical Models Samson Cheung."— Presentation transcript:

Similar presentations

About project

Feedback