Inference in Gaussian and Hybrid Bayesian Networks ICS 275B.

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
. Exact Inference in Bayesian Networks Lecture 9.
Exact Inference in Bayes Nets
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Dynamic Bayesian Networks (DBNs)
Local structures; Causal Independence, Context-sepcific independance COMPSCI 276 Fall 2007.
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)
Pearl’s Belief Propagation Algorithm Exact answers from tree-structured Bayesian networks Heavily based on slides by: Tomas Singliar,
Overview of Inference Algorithms for Bayesian Networks Wei Sun, PhD Assistant Research Professor SEOR Dept. & C4I Center George Mason University, 2009.
Junction Tree Algorithm Brookes Vision Reading Group.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
From Variable Elimination to Junction Trees
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
On the Role of MSBN to Cooperative Multiagent Systems By Y. Xiang and V. Lesser Presented by: Jingshan Huang and Sharon Xi.
1 Exact Inference Algorithms Bucket-elimination and more COMPSCI 179, Spring 2010 Set 8: Rina Dechter (Reading: chapter 14, Russell and Norvig.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graph.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
1 Exact Inference Algorithms for Probabilistic Reasoning; COMPSCI 276 Fall 2007.
December Marginal and Joint Beliefs in BN1 A Hybrid Algorithm to Compute Marginal and Joint Beliefs in Bayesian Networks and its complexity Mark.
Belief Propagation, Junction Trees, and Factor Graphs
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Tch-prob1 Chapter 4. Multiple Random Variables Ex Select a student’s name from an urn. S In some random experiments, a number of different quantities.
Tutorial #9 by Ma’ayan Fishelson
Continuous Random Variables and Probability Distributions
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graphs.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
1 CMSC 471 Fall 2002 Class #19 – Monday, November 4.
Lecture II-2: Probability Review
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
Introduction to Bayesian Networks
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Generalizing Variable Elimination in Bayesian Networks 서울 시립대학원 전자 전기 컴퓨터 공학과 G 박민규.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Probabilistic Networks Chapter 14 of Dechter’s CP textbook Speaker: Daniel Geschwender April 1, 2013 April 1&3, 2013DanielG--Probabilistic Networks1.
Two Approximate Algorithms for Belief Updating Mini-Clustering - MC Robert Mateescu, Rina Dechter, Kalev Kask. "Tree Approximation for Belief Updating",
Conditional Probability Distributions Eran Segal Weizmann Institute.
Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.
Lecture 2: Statistical learning primer for biologists
Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
Random Variables. Numerical Outcomes Consider associating a numerical value with each sample point in a sample space. (1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
Continuous Random Variables and Probability Distributions
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Pattern Recognition and Machine Learning
1 Tutorial #9 by Ma’ayan Fishelson. 2 Bucket Elimination Algorithm An algorithm for performing inference in a Bayesian network. Similar algorithms can.
Today Graphical Models Representing conditional dependence graphically
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Knowledge Representation & Reasoning Lecture #5 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro.
Inference in Bayesian Networks
Pattern Recognition and Image Analysis
Class #19 – Tuesday, November 3
Class #16 – Tuesday, October 26
Junction Trees 3 Undirected Graphical Models
Presentation transcript:

Inference in Gaussian and Hybrid Bayesian Networks ICS 275B

Gaussian Distribution

gaussian(x,0,1) gaussian(x,1,1) N( ,  )

gaussian(x,0,1) gaussian(x,0,2) N( ,  )

Multivariate Gaussian Definition: Let X 1,…,X n. Be a set of random variables. A multivariate Gaussian distribution over X 1,…,X n is a parameterized by an n-dimensional mean vector  and an n x n positive definitive covariance matrix . It defines a joint density via:

Multivariate Gaussian

Linear Gaussian Distribution Definition: Let Y be a continuous node with continuous parents X 1,…,X k. We say that Y has a linear Gaussian model if it can be described using parameters  0, …,  k and  2 such that: P(y| x 1,…,x k )=N (μ y +  1 x 1 +…,  k x k ;  ) =N([μ y,  1,…,  k ],  )

AB ABAB

Linear Gaussian Network Definition Linear Gaussian Bayesian network is a Bayesian network all of whose variables are continuous and where all of the CPTs are linear Gaussians. Linear Gaussian BN  Multivariate Gaussian =>Linear Gaussian BN has a compact representation

Inference in Continuous Networks AB

Marginalization

Problems: When we Multiply two arbitrary Gaussians! Inverse of K and M is always well defined. However, this inverse is not!

Theoretical explanation: Why this is the case ? Inverse of a matrix of size n x n exists when the matrix is of rank n. If all sigmas and w’s are assumed to be 1. (K -1 +M -1 ) has rank 2 and so is not invertible.

Density vs conditional However,  Theorem: If the product of the gaussians represents a multi-variate gaussian density, then the inverse always exists. For example, For P(A|B)*P(B)=P(A,B) = N(c,C) then inverse of C always exists. P(A,B) is a multi-variate gaussian (density). But P(A|B)*P(B|X)=P(A,B|X) = N(c,C) then inverse of C may not exist. P(A,B|X) is a conditional gaussian.

Inference: A general algorithm Computing marginal of a given variable, say Z. Step 1: Convert all conditional gaussians to canonical form

Inference: A general algorithm Computing marginal of a given variable, say Z. Step 2:  Extend all g’s,h’s and k’s to the same domain by adding 0’s.

Inference: A general algorithm Computing marginal of a given variable, say Z. Step 3: Add all g’s, all h’s and all k’s. Step 4: Let the variables involved in the computation be: P(X1,X2,…,Xk,Z)= N(μ,∑)

Inference: A general algorithm Computing marginal of a given variable, say Z. Step 5: Extract the marginal

Inference: Computing marginal of a given variable For a continuous Gaussian Bayesian Network, inference is polynomial O(N 3 ).  Complexity of matrix inversion So algorithms like belief propagation are not generally used when all variables are Gaussian. Can we do better than N^3?  Use Bucket elimination.

Bucket elimination Algorithm elim-bel (Dechter 1996) Multiplication operator P(a|e=0) W*=4 ”induced width” (max clique size) bucket B: P(a) P(c|a) P(b|a) P(d|b,a) P(e|b,c) bucket C: bucket D: bucket E: bucket A: e=0 B C D E A Marginalization operator

Multiplication Operator Convert all functions to canonical form if necessary. Extend all functions to the same variables (g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2)

Again our problem! Multiplication operator P(a) W*=4 ”induced width” (max clique size) bucket B: P(a) P(c|a) P(b|a) P(d|b,a) P(e|b,c) bucket C: bucket D: bucket E: bucket A: P(e) B C D E A Marginalization operator h(a,d,c,e) does not represent a density and so cannot be computed in our usual form N(μ,σ)

Solution: Marginalize in canonical form Although intermediate functions computed in bucket elimination are conditional, we can marginalize in canonical form, so we can eliminate the problem of non-existence of inverse completely.

Algorithm In each bucket, convert all functions in canonical form if necessary, multiply them and marginalize out the variable in the bucket as shown in the previous slide. Theorem: P(A) is a density and is correct. Complexity: Time and space: O((w+1)^3) where w is the width of the ordering used.

Continuous Node, Discrete Parents Definition: Let X be a continuous node, and let U={U 1,U 2,…,U n } be its discrete parents and Y={Y 1,Y 2,…,Y k } be its continuous parents. We say that X has a conditional linear Gaussian (CLG) CPT if, for every value u  D(U), we have a a set of (k+1) coefficients a u,0, a u,1, …, a u,k+1 and a variance  u 2 such that:

CLG Network Definition: A Bayesian network is called a CLG network if every discrete node has only discrete parents, and every continuous node has a CLG CPT.

Inference in CLGs Can we use the same algorithm?  Yes, but the algorithm is unbounded if we are not careful. Reason:  Marginalizing out discrete variables from any arbitrary function in CLGs is not bounded. If we marginalize out y and k from f(x,y,i,k), the result is a mixture of 4 gaussians instead of 2.  X and y are continuous variables  I and k are discrete binary variables.

Solution: Approximate the mixture of Gaussians by a single gaussian

Multiplication and Marginalization Convert all functions to canonical form if necessary. Extend all functions to the same variables (g1,h1,k1)*(g2,h2,k2) =(g1+g2,h1+h2,k1+k2) Multiplication Strong marginal when marginalizing continuous variables Weak marginal when marginalizing discrete variables

Problem while using this marginalization in bucket elimination Requires computing ∑ and μ which is not possible due to non-existence of inverse. Solution: Use an ordering such that you never have to marginalize out discrete variables from a function that has both discrete and continuous gaussian variables. Special case: Compute marginal at a discrete node Homework: Derive a bucket elimination algorithm for computing marginal of a continuous variable.

Multiplication operator P(a) W*=4 ”induced width” (max clique size) bucket B: P(a) P(c|a) P(b|a,e) P(d|b,a) P(d|b,c) bucket C: bucket D: bucket E: bucket A: P(e) Marginalization operator Special Case: A marginal on a discrete variable in a CLG is to be computed. B,C and D are continuous variables and A and E is discrete

Complexity of the special case Discrete-width (wd): Maximum number of discrete variables in a clique Continuous-width (wc): Maximum number of continuous variables in a clique Time: O(exp(wd)+wc^3) Space: O(exp(wd)+wc^3)

Algorithm for the general case:Computing Belief at a continuous node of a CLG Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques (Same as assigning functions to buckets) Select a Strong Root Perform message passing

Creating a Special-tree decomposition Moralize the Bayesian Network. Select an ordering such that all continuous variables are ordered before discrete variables (Increases induced width).

Elimination order w y x z Strong elimination order: First eliminate continuous variables Eliminate discrete variable when no available continuous variables Moralized graph has this edge W and X are discrete variables and Y and Z are continuous.

Elimination order (1) w y x z dim: 2 1

Elimination order (2) w y x z dim: 2 2 1

Elimination order (3) w y x z 3dim: 2 2 1

Elimination order (4) w y x z w y z w y x 34 2 w y 3 2 Cliques 1 Cliques 2 separator

Bucket tree or Junction tree (1) w y z w y x w y Cliques 1 Cliques 2: root separator

Algorithm for the general case:Computing Belief at a continuous node of a CLG Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques (Same as assigning functions to buckets) Select a Strong Root Perform message passing

Assigning Functions to cliques Select a function and place it in an arbitrary clique that mentions all variables in the function.

Algorithm for the general case:Computing Belief at a continuous node of a CLG Convert all functions to canonical form. Create a special tree-decomposition Assign functions to appropriate cliques (Same as assigning functions to buckets) Select a Strong Root Perform message passing

Strong Root We define a strong root as any node R in the bucket-tree which satisfies the following property: for any pair (V,W) which are neighbors on the tree with W closer to R than V, we have

Example Strong root Strong Root

Algorithm for the general case:Computing Belief at a continuous node of a CLG Create a special tree-decomposition Assign functions to appropriate cliques (Same as assigning functions to buckets) Select a Strong Root Perform message passing

Message passing at a typical node x2 oNode “a” contains functions assigned to it according to the tree-decomposition scheme denoted by p j (a) a b x1

Message Passing root Collect root Distribute Figure from P. Green Two pass algorithm: Bucket-tree propagation

Lets look at the messages Collect Evidence ∫C∫C ∫ L ∫ Mout ∫ Min ∫ D ∫D∫D Strong Root

Distribute Evidence ∫ E ∑ W,B ∑W∑W ∫E∑B∫E∑B ∑F∑F Strong Root

Lauritzens theorem When you perform message passing such that collect evidence contains only strong marginals and distribute evidence may contain weak marginals, the junction-tree algorithm in exact in the sense that:  The first (mean) and second moments (variance) computed are true moments

Complexity Polynomial in #of continuous variables in a clique (n 3 ) Exponential in #of discrete variables in a clique Possible options for approximation  Ignore the strong root assumption and use approximation like MBTE, IJGP, Sampling  Respect the strong root assumption and use approximation like MBTE, IJGP, Sampling Inaccuracies only due to discrete variables if done in one pass of MBTE.

W=0W=1 X=0X=1 Initialization (1) w y x z dim: 2 w= w= x=00.4 x=10.6

Initialization (2) wyzwyzwxy wy Cliques 1 Cliques 2 (root) w=0g=log(0.5),h=[],K=[] w=1g=log(0.5),h=[],K=[] x=0g=log(0.4),h=[],K=[] x=1g=log(0.6),h=[],K=[] X=0X=1 g = h = [ ]’ K = [0.1 0; 0 0.1] g = h = [ ]’ K = [ ; ] W=0W=1 g = h = [ ] K = g = h = [ ] K =

W=0W=1 g = h = K = g = h = K = Initialization (3) wyzwyzwxy wy Cliques 1 Cliques 2 (root) wx=00wx=10 g = h = [ ]’ K = [0.1 0; 0 0.1] g = h = [ ]’ K = [0.1 0; 0 0.1] wx=01wx=11 g = h = [ ]’ K = [ ; ] g = h = [ ]’ K = [ ; ] empty

Message Passing wyzwyzwxy wy Cliques 1 Cliques 2 (root) empty Collect evidence Distribute evidence

Collect evidence (1) wyzwyzwxy wy Cliques 1 Cliques 2 (root) empty y2y3y2y3 y1y2y1y2 y2y2 (y 1,y 2 ) (y2)

Collect evidence (2) wyzwyzwxy wy Cliques 1 Cliques 2 (root) empty W=0W=1 g = h = K = g = h = K = W=0W=1 g = h = [ ]’ *1.0e-16 K = [ ; ]*1.0e-16 g = h = [0 0]’ K = [ ] marginalization

Collect evidence (3) wyzwyzwxy wy Cliques 1 Cliques 2 (root) empty W=0W=1 g = h = [ ]’ *1.0e-16 K = [ ; ]*1.0e-16 g = h = [0 0]’ K = [ ] wx=00wx=10 g = h = [ ]’ K = [0.1 0; 0 0.1] g = h = [ ]’ K = [0.1 0; 0 0.1] wx=01wx=11 g = h = [ ]’ K = [ ; ] g = h = [ ]’ K = [ ; ] multiplication wx=00wx=10 g = h = [ ]’ K = [0.1 0; 0 0.1] g = h = [ ]’ K = [0.1 0; 0 0.1] wx=01wx=11 g = h = [ ]’ K = [ ; ] g = h = [ ]’ K = [ ; ]

Distribute evidence (1) wyzwyzwxy wy Cliques 1 Cliques 2 (root) W=0W=1 g = h = K = g = h = K = W=0W=1 g = h = [ ]’ *1.0e-16 K = [ ; ]*1.0e-16 g = h = [0 0]’ K = [ ] division

Distribute evidence (2) wyzwyzwxy wy Cliques 1 Cliques 2 (root) W=0W=1 g = h = K = g = h = K =

Distribute evidence (3) wyzwyzwxy wy Cliques 1 Cliques 2 (root) wx=00wx=10 g = h = [ ]’ K = [0.1 0; 0 0.1] g = h = [ ]’ K = [0.1 0; 0 0.1] wx=01wx=11 g = h = [ ]’ K = [ ; ] g = h = [ ]’ K = [ ; ] Marginalize over x w=0w=1 logp = mu = [ ]’ Sigma = logp = mu = [ ]’ Sigma =

Distribute evidence (4) wyzwyzwxy wy Cliques 1 Cliques 2 (root) W=0W=1 g = h = K = g = h = K = w=0w=1 logp = mu = [ ]’ Sigma = logp = mu = [ ]’ Sigma = multiplication w=0w=1 g = h = [ ]’ K = g = h = [ ]’ K = Canonical form

Distribute evidence (5) wyzwyzwxy wy Cliques 1 Cliques 2 (root) W=0W=1 g = h = K = g = h = K =

After Message Passing p(wyz)p(wxy) p(wy) Cliques 1 Cliques 2 (root) Local marginal distributions