Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eric Xing © Eric CMU, 2006-20101 Machine Learning Algorithms and Theory of Approximate Inference Eric Xing Lecture 15, August 15, 2010 Reading:

Similar presentations


Presentation on theme: "Eric Xing © Eric CMU, 2006-20101 Machine Learning Algorithms and Theory of Approximate Inference Eric Xing Lecture 15, August 15, 2010 Reading:"— Presentation transcript:

1 Eric Xing © Eric Xing @ CMU, 2006-20101 Machine Learning Algorithms and Theory of Approximate Inference Eric Xing Lecture 15, August 15, 2010 Reading:

2 Eric Xing © Eric Xing @ CMU, 2006-20102 Inference Problems Compute the likelihood of observed data Compute the marginal distribution over a particular subset of nodes Compute the conditional distribution for disjoint subsets A and B Compute a mode of the density

3 Eric Xing © Eric Xing @ CMU, 2006-20103 Inference in GM HMM A general BN AAAA x2x2 x3x3 x1x1 xTxT y2y2 y3y3 y1y1 yTyT... B A D C E F GH

4 Eric Xing © Eric Xing @ CMU, 2006-20104 Inference Problems Compute the likelihood of observed data Compute the marginal distribution over a particular subset of nodes Compute the conditional distribution for disjoint subsets A and B Compute a mode of the density Methods we have Brute forceElimination Message Passing (Forward-backward, Max-product /BP, Junction Tree) Sharing intermediate terms Individual computations independent

5 Eric Xing © Eric Xing @ CMU, 2006-20105 Recall forward-backward on HMM Forward algorithm Backward algorithm AAAA x2x2 x3x3 x1x1 xTxT y2y2 y3y3 y1y1 yTyT...

6 Eric Xing © Eric Xing @ CMU, 2006-20106 Message passing for trees Let m ij (x i ) denote the factor resulting from eliminating variables from bellow up to i, which is a function of x i : This is reminiscent of a message sent from j to i. m ij (x i ) represents a "belief" of x i from x j ! f i j k l

7 Eric Xing © Eric Xing @ CMU, 2006-20107 The General Sum-Product Algorithm Tree-structured GMs Message Passing on Trees: On trees, converge to a unique fixed point after a finite number of iterations

8 Eric Xing © Eric Xing @ CMU, 2006-20108 Junction Tree Revisited General Algorithm on Graphs with Cycles Steps: BC S => Triangularization=> Construct JTs => Message Passing on Clique Trees

9 Eric Xing © Eric Xing @ CMU, 2006-20109 Local Consistency Given a set of functions associated with the cliques and separator sets They are locally consistent if: For junction trees, local consistency is equivalent to global consistency!

10 Eric Xing © Eric Xing @ CMU, 2006-201010 An Ising model on 2-D image Nodes encode hidden information (patch- identity). They receive local information from the image (brightness, color). Information is propagated though the graph over its edges. Edges encode ‘compatibility’ between nodes. ? air or water ?

11 Eric Xing © Eric Xing @ CMU, 2006-201011 Why Approximate Inference? Why can’t we just run junction tree on this graph? If NxN grid, tree width at least N N can be a huge number(~1000s of pixels) If N~O(1000), we have a clique with 2 100 entries

12 Eric Xing © Eric Xing @ CMU, 2006-201012 i k k k k i j k k k M ki Solution 1: Belief Propagation on loopy graphs BP Message-update Rules May not converge or converge to a wrong solution Compatibilities (interactions) external evidence

13 Eric Xing © Eric Xing @ CMU, 2006-201013 i k k k k i j k k k M ki Compatibilities (interactions) external evidence Recall BP on trees BP Message-update Rules BP on trees always converges to exact marginals

14 Eric Xing © Eric Xing @ CMU, 2006-201014 mean field equation: XiXi Approximate p(X) by fully factorized q(X)=  i q i (X i ) For Boltzmann distribution p(X)=exp{  i < j q ij X i X j +q io X i }/Z : XiXi   x j  q j resembles a “message” sent from node j to i  {  x j  q j : j  N i } forms the “mean field” applied to X i from its neighborhood Solution 2: The naive mean field approximation

15 Eric Xing © Eric Xing @ CMU, 2006-201015 Gibbs predictive distribution: XiXi Approximate p(X) by fully factorized q(X)=  i q i (X i ) For Boltzmann distribution p(X)=exp{  i < j q ij X i X j +q io X i }/Z : XiXi Recall Gibbs sampling

16 Eric Xing © Eric Xing @ CMU, 2006-201016 Summary So Far Exact inference methods are limited to tree-structured graphs Junction Tree methods is exponentially expensive to the tree- width Message Passing methods can be applied for loopy graphs, but lack of analysis! Mean-field is convergent, but can have local optimal Where do these two algorithm come from? Do they make sense?

17 Eric Xing © Eric Xing @ CMU, 2006-201017 Next Step … Develop a general theory of variational inference Introduce some approximate inference methods Provide deep understandings to some popular methods

18 Eric Xing © Eric Xing @ CMU, 2006-201018 Exponential Family GMs Canonical Parameterization Effective canonical parameters Regular family: Minimal representation: if there does not exist a nonzero vector such that is a constant Canonical Parameters Sufficient Statistics Log-normalization Function

19 Eric Xing © Eric Xing @ CMU, 2006-201019 Examples Ising Model (binary r.v.: {-1, +1}) Gaussian MRF

20 Eric Xing © Eric Xing @ CMU, 2006-201020 Mean Parameterization The mean parameter associated with a sufficient statistic is defined as Realizable mean parameter set A convex subset of Convex hull for discrete case Convex polytope when is finite

21 Eric Xing © Eric Xing @ CMU, 2006-201021 Convex Polytope Convex hull representation Half-plane based representation Minkowski-Weyl Theorem: any polytope can be characterized by a finite collection of linear inequality constraints

22 Eric Xing © Eric Xing @ CMU, 2006-201022 Example Two-node Ising Model Convex hull representation Half-plane representation Probability Theory:

23 Eric Xing © Eric Xing @ CMU, 2006-201023 Marginal Polytope Canonical Parameterization Mean parameterization Marginal distributions over nodes and edges Marginal Polytope

24 Eric Xing © Eric Xing @ CMU, 2006-201024 Conjugate Duality Duality between MLE and Max-Ent: For all, a unique canonical parameter satisfying The log-partition function has the variational form For all, the supremum in (*) is attained uniquely at specified by the moment-matching conditions Bijection for minimal exponential family

25 Eric Xing © Eric Xing @ CMU, 2006-201025 Roles of Mean Parameters Forward Mapping: From to the mean parameters A fundamental class of inference problems in exponential family models Backward Mapping: Parameter estimation to learn the unknown

26 Eric Xing © Eric Xing @ CMU, 2006-201026 Example Bernoulli If Reverse mapping: No gradient stationary point in the Opt. problem (**) Unique!

27 Eric Xing © Eric Xing @ CMU, 2006-201027 Variational Inference In General An umbrella term that refers to various mathematical tools for optimization-based formulations of problems, as well as associated techniques for their solution General idea: Express a quantity of interest as the solution of an optimization problem The optimization problem can be relaxed in various ways Approximate the functions to be optimized Approximate the set over which the optimization takes place Goes in parallel with MCMC

28 Eric Xing © Eric Xing @ CMU, 2006-201028 A Tree-Based Outer-Bound to a Local Consistent (Pseudo-) Marginal Polytope normalization marginalization Relation to holds for any graph holds for tree-structured graphs

29 Eric Xing © Eric Xing @ CMU, 2006-201029 A Example A three node graph (binary r.v.) For any, we have For, we have an exercise? 32 1

30 Eric Xing © Eric Xing @ CMU, 2006-201030 Bethe Entropy Approximation Approximate the negative entropy, which doesn’t has a closed-form in general graph. Entropy on tree (Marginals) recall: entropy Bethe entropy approximation (Pseudo-marginals )

31 Eric Xing © Eric Xing @ CMU, 2006-201031 Bethe Variational Problem (BVP) We already have: a convex (polyhedral) outer bound the Bethe approximate entropy Combining the two ingredients, we have a simple structured problem (differentiable & constraint set is a simple polytope) Max-product is the solver! Nobel Prize in PhysicsNobel Prize in Physics (1967)

32 Eric Xing © Eric Xing @ CMU, 2006-201032 Connection to Sum-Product Alg. Lagrangian method for BVP: Sum-product and Bethe Variational (Yedidia et al., 2002) For any graph G, any fixed point of the sum-product updates specifies a pair of such that For a tree-structured MRF, the solution is unique, where correspond to the exact singleton and pairwise marginal distributions of the MRF, and the optimal value of BVP is equal to

33 Eric Xing © Eric Xing @ CMU, 2006-201033 Proof

34 Eric Xing © Eric Xing @ CMU, 2006-201034 Discussions The connection provides a principled basis for applying the sum-product algorithm for loopy graphs However, this connection provides no guarantees on the convergence of the sum-product alg. on loopy graphs the Bethe variational problem is usually non-convex. Therefore, there are no guarantees on the global optimum Generally, there are no guarantees that is a lower bound of However, however the connection and understanding suggest a number of avenues for improving upon the ordinary sum-product alg., via progressively better approximations to the entropy function and outer bounds on the marginal polytope!

35 Eric Xing © Eric Xing @ CMU, 2006-201035 Inexactness of Bethe and Sum- Product From Bethe entropy approximation Example From pseudo-marginal outer bound strict inclusion 32 14

36 Eric Xing © Eric Xing @ CMU, 2006-201036 Summary of LBP Variational methods in general turn inference into an optimization problem However, both the objective function and constraint set are hard to deal with Bethe variational approximation is a tree-based approximation to both objective function and marginal polytope Belief propagation is a Lagrangian-based solver for BVP Generalized BP extends BP to solve the generalized hyper-tree based variational approximation problem

37 Eric Xing © Eric Xing @ CMU, 2006-201037 Tractable Subgraph Given a GM with a graph G, a subgraph F is tractable if We can perform exact inference on it Example:

38 Eric Xing © Eric Xing @ CMU, 2006-201038 Mean Parameterization For an exponential family GM defined with graph G and sufficient statistics, the realizable mean parameter set For a given tractable subgraph F, a subset of mean parameters is of interest Inner Approximation

39 Eric Xing © Eric Xing @ CMU, 2006-201039 Optimizing a Lower Bound Any mean parameter yields a lower bound on the log- partition function Moreover, equality holds iff and are dually coupled, i.e., Proof Idea: (Jensen’s Inequality) Optimizing the lower bound gives This is an inference!

40 Eric Xing © Eric Xing @ CMU, 2006-201040 Mean Field Methods In General However, the lower bound can’t explicitly evaluated in general Because the dual function typically lacks an explicit form Mean Field Methods Approximate the lower bound Approximate the realizable mean parameter set The MF optimization problem Still a lower bound?

41 Eric Xing © Eric Xing @ CMU, 2006-201041 KL-divergence Kullback-Leibler Divergence For two exponential family distributions with the same STs: Primal Form Mixed Form Dual Form

42 Eric Xing © Eric Xing @ CMU, 2006-201042 Mean Field and KL-divergence Optimizing a lower bound Equivalent to minimize a KL-divergence Therefore, we are doing minimization

43 Eric Xing © Eric Xing @ CMU, 2006-201043 Naïve Mean Field Fully factorized variational distribution

44 Eric Xing © Eric Xing @ CMU, 2006-201044 Naïve Mean Field for Ising Model Sufficient statistics and Mean Parameters Naïve Mean Field Realizable mean parameter subset Entropy Optimization Problem

45 Eric Xing © Eric Xing @ CMU, 2006-201045 Naïve Mean Field for Ising Model Optimization Problem Update Rule resembles “message” sent from node to forms the “mean field” applied to from its neighborhood

46 Eric Xing © Eric Xing @ CMU, 2006-201046 Non-Convexity of Mean Field Mean field optimization is always non-convex for any exponential family in which the state space is finite Finite convex hull contains all the extreme points If is a convex set, then Mean field has been used successfully

47 Eric Xing © Eric Xing @ CMU, 2006-201047 Structured Mean Field Mean field theory is general to any tractable sub-graphs Naïve mean field is based on the fully unconnected sub-graph Variants based on structured sub-graphs can be derived

48 Eric Xing © Eric Xing @ CMU, 2006-201048 Topic models β  z w μ Σ Optimization Problem Approximate the Posterior Approximate the Integral Solve μ*,Σ*,φ 1:n *  z w μ* Σ* Φ* β

49 Eric Xing © Eric Xing @ CMU, 2006-201049 Fully Factored Distribution β  z e μ Σ Fixed Point Equations β  z e μ Σ Laplace approximation Variational Inference With no Tears [Ahmed and Xing, 2006, Xing et al 2003]

50 Eric Xing © Eric Xing @ CMU, 2006-201050 Summary of GMF Message-passing algorithms (e.g., belief propagation, mean field) are solving approximate versions of exact variational principle in exponential families There are two distinct components to approximations: Can use either inner or outer bounds to Various approximation to the entropy function BP: polyhedral outer bound and non-convex Bethe approximation MF: non-convex inner bound and exact form of entropy Kikuchi: tighter polyhedral outer bound and better entropy approximation © Eric Xing @ CMU, 2005-2009


Download ppt "Eric Xing © Eric CMU, 2006-20101 Machine Learning Algorithms and Theory of Approximate Inference Eric Xing Lecture 15, August 15, 2010 Reading:"

Similar presentations


Ads by Google