Global Approximate Inference Eran Segal Weizmann Institute.

Global Approximate Inference Eran Segal Weizmann Institute

General Approximate Inference Strategy Define a class of simpler distributions Q Search for a particular instance in Q that is “close” to P Answer queries using inference in Q

Cluster Graph A cluster graph K for factors F is an undirected graph Nodes are associated with a subset of variables C i  U The graph is family preserving: each factor  F is associated with one node C i such that Scope[  ]  C i Each edge C i –C j is associated with a sepset S i,j = C i  C j A cluster tree over factors F that satisfies the running intersection property is called a clique tree

Clique Tree Inference C D I SG L J H Verify:  Tree and family preserving  Running intersection property C,DG,I,D D G,S,IG,J,S,LH,G,J G,IG,SG,J P(C) P(D|C) P(G|I,D) P(I) P(S|I) P(L|G) P(J|L,S) P(H|G,J) 1 2345

Message Passing: Belief Propagation Initialize the clique tree For each clique C i set For each edge C i —C j set While unset cliques exist Select C i —C j Send message from C i to C j Marginalize the clique over the sepset Update the belief at C j Update the sepset at C i –C j

Clique Tree Invariant Belief propagation can be viewed as reparameterizing the joint distribution Upon calibration we showed Initially this invariant holds since At each update step invariant is also maintained Message only changes  i and  i,j so most terms remain unchanged We need to show But this is exactly the message passing step  Belief propagation reparameterizes P at each step

Global Approximate Inference Inference as optimization Generalized Belief Propagation Define algorithm Constructing cluster graphs Analyze approximation guarantees Propagation with approximate messages Factorized messages Approximate message propagation Structured variational approximations

The Energy Functional Suppose we want to approximate P with Q Represent P by factors Define the energy functional Then:  Minimizing D(Q||P F ) is equivalent to maximizing F[P F,Q]  lnZ  F[P F,Q] (since D(Q||P F )  0)

Inference as Optimization We show that inference can be viewed as maximizing the energy functional F[P F,Q] Define a distribution Q over clique potentials Transform F[P F,Q] to an equivalent factored form F’[P F,Q] Show that if Q maximizes F’[P F,Q] subject to constraints in which Q represents calibrated potentials, then there exists factors that satisfy the inference message passing equations

Defining Q Recall that throughout BP Define Q as reparameterization of P such that Since D(Q||P F )=0 we show that calibrating Q is equivalent to maximizing F[P F,Q]

Factored Energy Functional Define the factored energy functional as Theorem: if Q is a set of calibrated potentials for T, then F[P F,Q]=F’[P F,Q]

Inference as Optimization Optimization task Find Q that maximizes F’[P F,Q] subject to Theorem: fixed points of Q for the above optimization Suggests iterative optimization procedure Identical to belief propagation!

Generalized Belief Propagation Perform belief propagation in a cluster graph with loops Strategy: C A BD Bayesian network A,B,D B,C,D B,D Cluster tree A,B B,C B Cluster graph A,D C,D D C A

Generalized Belief Propagation Perform belief propagation in a cluster graph with loops Strategy: A,B B,C B Cluster graph A,D C,D D C A Inference may be incorrect: double counting evidence Unlike in BP on trees: Convergence is not guaranteed Potentials in calibrated tree are not guaranteed to be marginals in P

Generalized Belief Propagation Perform belief propagation in a cluster graph with loops Strategy: A,B B,C B Cluster graph A,D C,D D C A

Generalized Cluster Graph A cluster graph K for factors F is an undirected graph Nodes are associated with a subset of variables C i  U The graph is family preserving: each factor  F is associated with one node C i such that Scope[  ]  C i Each edge C i –C j is associated with a sepset S i,j = C i  C j A generalized cluster graph K for factors F is an undirected graph Nodes are associated with a subset of variables C i  U The graph is family preserving: each factor  F is associated with one node C i such that Scope[  ]  C i Each edge C i –C j is associated with a subset S i,j  C i  C j

Generalized Cluster Graph A generalized cluster graph obeys the running intersection property if for each X  C i and X  C j, there is exactly one path between C i and C j for which X  S for each subset S along the path  All edges associated with X form a tree that spans all the clusters that contain X Note: some of these clusters may be connected with more than one path A,B B,C B A,D C,D D C A

Calibrated Cluster Graph A generalized cluster graph is calibrated if for each edge C i – C j we have: Weaker than in clique trees, since S i,j is a subset of the intersection between C i and C j If a cluster graph satisfies the running intersection property, then the marginal on any variable X is the same in every cluster that contains X

GBP is Efficient X 11 X 12 X 13 X 21 X 22 X 23 X 31 X 32 X 33 X 11,X 12 X 12 X 12,X 13 X 12,X 22 X 13,X 23 X 11,X 21 X 21,X 22 X 22,X 23 X 22,X 32 X 23,X 33 X 21,X 31 X 31,X 32 X 32,X 33 X 11 X 21 X 31 X 32 X 33 X 23 X 22 X 23 X 13 X 12 Cluster graph Markov grid network Note: clique tree in a n x n grid is exponential in n Round of GBP is O(n)

Constructing Cluster Graphs When constructing clique trees, all constructions give the same result but differ in computational complexity In GBP, different cluster graphs can vary in both computational complexity and approximation quality

Transforming Pairwise MNs A pairwise Markov network over a graph H has: A set of node potentials {  [X i ]:i=1,...n} A set of edge potentials {  [X i,X j ]: X i,X j  H} Example: X 11 X 12 X 13 X 21 X 22 X 23 X 31 X 32 X 33 X 11,X 21 X 11 X 12 X 13 X 23 X 33 X 32 X 31 X 21 X 22 X 12,X 22 X 13,X 23 X 21,X 31 X 22,X 32 X 23,X 33 X 21,X 22 X 11,X 12 X 31,X 32 X 22,X 23 X 12,X 13 X 32,X 33

Transforming Bayesian Networks Example: “Large” cluster per each CPD Single nodes for each variable Connect node and large cluster if node in CPD  Graph obeys running intersection property A,B,C A DF C B A,B,DB,D,F A B DC F Bethe approximation

Generalized Belief Propagation GBP maintains distribution invariance (since message passing maintains invariance)

Generalized Belief Propagation If GBP converges (K is calibrated) Each subtree T is calibrated with edge potentials corresponding to marginals of P T (U) (since P T (U) is a calibrated tree)

Generalized Belief Propagation  Calibrated graph potentials are not P F (U) marginals A,B B,C B A,D C,D D C A 1 23 4 A,B B,C B C,D C 1 23

Inference as Optimization Optimization task Find Q that maximizes F’[P F,Q] subject to Theorem: fixed points of Q for the above optimization Suggests iterative optimization procedure Identical to belief propagation!

GBP as Optimization Optimization task Find Q that maximizes F’[P F,Q] subject to Theorem: fixed points of Q for the above optimization Note: S i,j is only a subset of intersection between C i and C j Iterative optimization procedure is GBP

GBP as Optimization Clique trees F[P F,Q]=F’[P F,Q] Iterative procedure (BP) guaranteed to converge Convergence point represents marginal distributions of P F Cluster graphs F[P F,Q]=F’[P F,Q] does not hold! Iterative procedure (GBP) not guaranteed to converge Convergence point does not represent marginal distributions of P F

GBP in Practice Dealing with non-convergence Often small portions of the network do not converge  stop inference and use current beliefs Use intelligent message passing scheduling Tree reparameterization (TRP) selects entire trees, and calibrates them while keeping all other beliefs fixed Focus attention on uncalibrated regions of the graph

Propagation w. Approximate Msgs General idea Perform BP (or GBP) as before, but propagate messages that are only approximate Modular approach General inference scheme remains the same Can plug in many different approximate message computations

Factorized Messages X 11 X 12 X 13 X 21 X 22 X 23 X 31 X 32 X 33 X 21 X 11 X 12 X 21 X 31 X 32 X 22 X 31 X 11 X 13 X 22 X 33 X 23 X 32 X 12 123 Markov networkClique tree Keep internal structure of the clique tree cliques Calibration involves sending messages that are joint over three variables Idea: simplify messages using factored representation Example:

Computational Savings Answering queries in Cluster 2 Exact inference: Exponential in joint space of cluster 2 Approximate inference with factored messages Notice that subnetwork with factored messages is a tree Perform efficient exact inference on subtree to answer queries X 21 X 11 X 12 X 21 X 31 X 32 X 22 X 31 X 11 X 22 X 32 X 12 123

Factor Sets A factor set  ={  1,...,  k } provides a compact representation for high-dimensional factor  1 ,...,  k Belief propagation Multiplication of factor sets Easy: simply the union of the factors in each factor set multiplied Marginalization of factor set: inference in simplified network Example: compute  2  3 X 21 X 11 X 12 X 21 X 31 X 32 X 22 X 31 X 11 X 22 X 32 X 12 123

Approximate Message Propagation Input Clique tree (or cluster graph) Assignments of original factors to clusters/cliques The factorized form of each cluster/clique Can be represented by a network for each edge C i —C j that specifies the factorization (in previous examples we assumed empty network) Two strategies for approximate message propagation Sum-product message passing scheme Belief update messages

Sum-Product Propagation Same propagation scheme as in exact inference Select a root Propagate messages towards the root Each cluster collects messages from its neighbors and sends outgoing messages when possible Propagate messages from the root Each message passing performs inference on cluster Terminates in a fixed number of iterations Note: final marginals at each variable are not exact

Message Passing: Belief Propagation Same as BP but with approximate messages Initialize the clique tree For each clique C i set For each edge C i —C j set While unset cliques exist Select C i —C j Send message from C i to C j Marginalize the clique over the sepset Update the belief at C j Update the sepset at C i –C j Approximation Two message passing schemes differ in approximate inference

Structured Variational Approx. Select a simple family of distributions Q Find Q  Q that maximizes F[P F,Q]

Mean Field Approximation Q(x) =  Q(X i ) Q loses much of the information of P F Approximation is computationally attractive Every query in Q is simple to compute Q is easy to represent X 11 X 12 X 13 X 21 X 22 X 23 X 31 X 32 X 33 P F – Markov grid network X 11 X 12 X 13 X 21 X 22 X 23 X 31 X 32 X 33 Q – Mean field network

Mean Field Approximation The energy functional is easy to compute, even for networks where inference is complex

Mean Field Maximization Maximizing the Energy Functional of Mean-Field Find Q(x) =  Q(X i ) that maximizes F[P F,Q] Subject to for all i:  x i Q(x i )=1

Mean Field Maximization Theorem: Q(X i ) is a stationary point of the mean field given Q(X 1 ),...Q(X i-1 ),Q(X i+1 ),...Q(X n ) if and only if Proof: To optimize Q(X i ) define the Lagrangian corresponds to the constraint that Q(X i ) is a distribution We now compute the derivative of L i

Mean Field Maximization

Setting the derivative to zero, and rearranging terms, we get: Taking exponents of both sides we get:

Mean Field Maximization: Intuition We can thus rewrite Q(x i ) as:

Mean Field Maximization: Intuition Q(x i ) is the geometric average of P(x i |V) Relative to the probability distribution Q In this sense, marginal is “consistent” with other marginals In P F we can also represent marginals Arithmetic average with respect to P F

Mean Field: Algorithm Simplify: To: Since terms that do not involve x i can be added to constant Note: Q(x i ) does not appear on right hand side Can solve and reach optimal Q(x i ) in one step Note: step is only optimal given all other Q(X i ) Suggests an iterative algorithm Convergence guaranteed to local maxima since each step improves

Markov Network Approximations Can use Q that are increasingly complex As long as Q is easy (=inference feasible) efficient update equations can be derived X 11 X 12 X 13 X 21 X 22 X 23 X 31 X 32 X 33 P F – Markov grid network X 11 X 12 X 13 X 21 X 22 X 23 X 31 X 32 X 33 Q – Mean field network

Global Approximate Inference Eran Segal Weizmann Institute.

Similar presentations

Presentation on theme: "Global Approximate Inference Eran Segal Weizmann Institute."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Global Approximate Inference Eran Segal Weizmann Institute.

Similar presentations

Presentation on theme: "Global Approximate Inference Eran Segal Weizmann Institute."— Presentation transcript:

Similar presentations

About project

Feedback