Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exact Inference Eran Segal Weizmann Institute. Course Outline WeekTopicReading 1Introduction, Bayesian network representation1-3 2Bayesian network representation.

Similar presentations


Presentation on theme: "Exact Inference Eran Segal Weizmann Institute. Course Outline WeekTopicReading 1Introduction, Bayesian network representation1-3 2Bayesian network representation."— Presentation transcript:

1 Exact Inference Eran Segal Weizmann Institute

2 Course Outline WeekTopicReading 1Introduction, Bayesian network representation1-3 2Bayesian network representation cont.1-3 3Local probability models4 4Undirected graphical models5 5Exact inference7,8 6Exact inference cont.9 7Approximate inference10 8Approximate inference cont.11 9Learning: Parameters13,14 10Learning: Parameters cont.14 11Learning: Structure15 12Partially observed data16 13Learning undirected graphical models17 14Template models18 15Dynamic Bayesian networks18

3 Inference Markov networks and Bayesian networks represent a joint probability distribution Networks contain information needed to answer any query about the distribution Inference is the process of answering such queries Direction between variables does not restrict queries Inference combines evidence from all network parts

4 Likelihood Queries Compute probability (=likelihood) of the evidence Evidence: subset of variables E and an assignment e Task: compute P(E=e) Computation

5 Conditional Probability Queries Conditional probability queries Evidence: subset of variables E and an assignment e Query: a subset of variables Y Task: compute P(Y | E=e) Applications Medical and fault diagnosis Genetic inheritance Computation

6 Maximum A Posteriori Assignment Maximum A Posteriori Assignment (MAP) Evidence: subset of variables E and an assignment e Query: a subset of variables Y Task: compute MAP(Y|E=e) = argmax y P(Y=y | E=e) Note 1: there may be more than one possible solution Note 2: equivalent to computing argmax y P(Y=y, E=e) since P(Y=y | E=e) = P(Y=y, E=e) / P(E=e) Computation

7 Most Probable Assignment: MPE Most Probable Explanation (MPE) Evidence: subset of variables E and an assignment e Query: all other variables Y (Y=U-E) Task: compute MPE(Y|E=e) = argmax y P(Y=y | E=e) Note: there may be more than one possible solution Applications Decoding messages: find the most likely transmitted bits Diagnosis: find a single most likely consistent hypothesis

8 Most Probable Assignment: MPE Note: We are searching for the most likely joint assignment to all variables May be different than most likely assignment to each RV B AB0B0 B1B1 a0a0 0.10.9 a1a1 0.5 I a0a0 a1a1 0.40.6 P(B|A)P(A) A B P(a 1 )>P(a 0 )  MAP(A) = a 1 MPE(A,B) = {a 0, b 1 } P(a 0, b 0 ) = 0.04 P(a 0, b 1 ) = 0.36 P(a 1, b 0 ) = 0.3 P(a 1, b 1 ) = 0.3

9 Exact Inference in Graphical Models Graphical models can be used to answer Conditional probability queries MAP queries MPE queries Naïve approach Generate joint distribution Depending on query, compute sum/max  Exponential blowup Exploit independencies for efficient inference

10 Complexity of Bayesnet Inference Assume encoding specifies DAG structure Assume CPD representation as table CPD Decision problem: Given a network G, a variable X and a value x  Val(X), decide whether P G (X=x)>0

11 Complexity of Bayesnet Inference Theorem: Decision problem is NP-complete Proof: Decision problem is in NP: for an assignment e to all network variables, check whether X=x in e and P(e)>0 Reduction from 3-SAT Binary valued variables Q 1,...,Q n Clauses C 1,...,C k where C i =L i,1  L i,2  L i,3 L i,j for i=1,...k and j=1,2,3 is a literal which is Q i or  Q i  = C 1  ,...,  C k Decision problem: is there an assignment to Q 1,...,Q n satisfying  ? Construct network such that P(X=1)>0 iff  satisfiable

12 Complexity of Bayesnet Inference Q1Q1 Q2Q2 Q3Q3 QnQn C1C1 C2C2 C3C3 CkCk A1A1 A2A2 X... P(Q i =1)=0.5 P(C i =1 | Pa(C i )) =  Pa(C i ) CPD of A 1,...,A k-2,X is a deterministic AND P(X=1|q 1,...q n )=1 iff q 1,...q n satisfies  P(X=1)>0 iff there is a satisfying assignment

13 Complexity of Bayesnet Inference Easy to check Polynomial number of variables CPDs can be described by a small table (max 8 parameters) P(X = 1) > 0 if and only if there exists a satisfying assignment to Q 1,…,Q n Conclusion: polynomial reduction of 3-SAT Implications Cannot find a general efficient procedure for all networks Can find provably efficient procedures for particular families of networks Exploit network structure and independencies Dynamic programming

14 Approximate Inference Rather than computing the exact answer, compute an approximation to it Approximation metrics for computing P(y|e) Estimate p has absolute error  if |P(y|e)-p|   Estimate p has relative error  if p / (1+  )  P(y|e)  p(1+  ) Absolute error is not very useful in probability distributions since often probabilities are small

15 Approximate Inference Complexity Theorem: the following is NP-Hard Given a network G over n variables, a variable X and a value x  Val(X), find a number p that has relative error  (n) for the query P G (X=x) Proof: Based on the hardness result for exact inference We showed that computing P G (X=x)>0 is hard  An algorithm that returns an estimate p to the original query would return p>0 iff P G (X=x)>0  The approximate inference with relative error is as NP-hard as the original exact inference problem

16 Approximate Inference Complexity Theorem: the following is NP-Hard for  <0.5 Given a network G over n variables, a variable X and a value x  Val(X), and observation e  Val(E) for variables E find a number p that has absolute error  for P G (X=x|E=e) Proof: Consider the same construction for the network as above Strategy of proof: show that given an approximation as above, we can determine satisfiability in polynomial time

17 Approximate Inference Complexity Proof cont.: Construction Use approximate algorithm to compute the query P(Q 1 |X=1) Assign Q 1 to the value q which has higher posterior probability Generate new network without Q 1 and with modified CPDs Repeat this process for all Q i Claim: Assignment generated in the process satisfies  iff  has a satisfiable assignment Proving the claim Easy case: if  does not have a satisfiable assignment, then obviously the resulting assignment will not satisfy  Harder case: if  has a satisfiable assignment we show that it has a satisfiable assignment with Q 1 =q

18 Approximate Inference Complexity Proof cont.: Proving the claim Easy case: if  does not have a satisfiable assignment, then obviously the resulting assignment will not satisfy  Harder case: if  has a satisfiable assignment we show that it has a satisfiable assignment with Q 1 =q If  is satisfiable with both q and  q then done If  is not satisfiable with Q 1 =q, then P(Q 1 =q|X=1)=0 but then we have P(Q 1 =  q|X=1)=1, and then since our approximation has absolute error <0.5, we will necessarily choose  q which has a satisfying assignment By induction on all Q variables, we have that the assignment we find must satisfy  Construction process is polynomial

19 Inference Complexity Summary NP-Hard Exact inference Approximate inference with relative error with absolute error < 0.5 (given evidence) Hopeless? No, we will see many network structures that have provably efficient algorithms and we will see cases when approximate inference works efficiently with high accuracy

20 Exact Inference Variable Elimination Inference in a simple chain Computing P(X 2 ) X1X1 X2X2 All the numbers for this computation are in the CPDs of the original Bayesian network O(|X 1 ||X 2 |) operations X3X3

21 Exact Inference Variable Elimination Inference in a simple chain Computing P(X 2 ) Computing P(X 3 ) X1X1 X2X2 X3X3  P(X 3 |X 2 ) is a given CPD  P(X 2 ) was computed above  O(|X 1 ||X 2 |+|X 2 ||X 3 |) operations

22 Exact Inference Variable Elimination Inference in a general chain Computing P(X n ) Compute each P(X i+1 ) from P(X i ) k 2 operations for each computation (assuming |X i |=k) O(nk 2 ) operations for the inference Compare to k n operations required in summing over all possible entries in the joint distribution over X 1,...X n Inference in a general chain can be done in linear time! X1X1 X2X2 X3X3 XnXn...

23 Exact Inference Variable Elimination X1X1 X2X2 X3X3 X4X4 Pushing summations = Dynamic programming

24 Inference With a Loop Computing P(X 4 ) X1X1 X2X2 X3X3 X4X4

25 Efficient Inference in Bayesnets Properties that allow us to avoid exponential blowup in the joint distribution Bayesian network structure – some subexpressions depend on a small number of variables Computing these subexpressions and caching the results avoids generating them exponentially many times

26 Variable Elimination Inference algorithm defined in terms of factors A factor is a function from value assignments of a set of random variables D to real positive numbers  + The set of variables D is the scope of the factor Factors generalize the notion of CPDs Thus, the algorithm we describe applies both to Bayesian networks and Markov networks

27 Variable Elimination: Factors Let X, Y, Z be three sets of disjoint sets of RVs, and let  1 (X,Y) and  2 (Y,Z) be two factors We define the factor product  1 x  2 operation to be a factor  :Val(X,Y,Z)   as  (X,Y,Z)=  1 (X,Y)  2 (Y,Z)

28 Variable Elimination: Factors Let X be a set of RVs, Y  X a RV, and  (X,Y) a factor We define the factor marginalization of Y in X to be a factor  :Val(X)   as  (X)=  Y  (X,Y) Also called summing out In a Bayesian network, summing out all variables = 1 In a Markov network, summing out all variables is the partition function

29 Variable Elimination: Factors Factors are commutative  1 x  2 =  1 x  2  X  Y  (X,Y) =  Y  X  (X,Y) Products are associative (  1 x  2 )x  3 =  1 x(  2 x  3 ) If X  Scope[  1 ] (we used this in elimination above)  X  1 x  2 =  1 x  X  2

30 Inference in Chain by Factors X1X1 X2X2 X3X3 X4X4 Scope of  X 3 and  X 4 does not contain X 1 Scope of  X 4 does not contain X 2

31 Sum-Product Inference Let Y be the query RVs and Z be all other RVs The general inference task is Effective computation Since scope factors is limited, “push in” some of the summations and perform them over the product of only a subset of factors

32 Sum-Product Variable Elimination Algorithm Sum out the variables one at a time When summing out a variable multiply all the factors that mention the variable, generating a product factor Sum out the variable from the combined factor, generating a new factor without the variable

33 Sum-Product Variable Elimination Theorem Let X be a set of RVs Let F be a set of factors such that for each  Scope[  ]  X Let Y  X be a set of query RVs Let Z=X-Y  For any ordering  over Z, the above algorithm returns a factor  (Y) such that Instantiation for Bayesian network query P G (Y) F consists of all CPDs in G Each  X i = P(X i | Pa(X i )) Apply variable elimination for U-Y

34 A More Complex Network Goal: P(J) Eliminate: C,D,I,H,G,S,L C D I SG L J H D

35 A More Complex Network Goal: P(J) Eliminate: C,D,I,H,G,S,L Compute: C D I SG L J H D

36 A More Complex Network Goal: P(J) Eliminate: D,I,H,G,S,L Compute: C D I SG L J H D

37 A More Complex Network Goal: P(J) Eliminate: I,H,G,S,L Compute: C D I SG L J H D

38 A More Complex Network Goal: P(J) Eliminate: H,G,S,L Compute: C D I SG L J H D

39 A More Complex Network Goal: P(J) Eliminate: G,S,L Compute: C D I SG L J H D

40 A More Complex Network Goal: P(J) Eliminate: S,L Compute: C D I SG L J H D

41 A More Complex Network Goal: P(J) Eliminate: L Compute: C D I SG L J H D

42 A More Complex Network Goal: P(J) Eliminate: G,I,S,L,H,C,D C D I SG L J H D Note: intermediate factor large: f 1 (I,D,L,J,H)

43 Inference With Evidence Let Y be the query RVs Let E be the evidence RVs and e their assignment Let Z be all other RVs (U-Y-E) The general inference task is

44 Inference With Evidence Goal: P(J|H=h,I=i) Eliminate: C,D,G,S,L Below, compute f(J,H=h,I=i) C D I SG L J H D

45 Complexity of Variable Elimination Variable elimination consists of Generating the factors that will be summed out Summing out Generating the factor f i =  1 x,...x  k i Let X i be the scope of f i Each entry requires k i multiplications to generate  Generating factor f i is O(k i |Val(X i )|) Summing out Addition operations, at most |Val(X i )| Per factor: O(kN) where N=max i |Val(X i )|, k=max i k i

46 Complexity of Variable Elimination Start with n factors (n=number of variables) Generate exactly one factor at each iteration  there are at most 2n factors Generating factors At most  i |Val(X i )|k i  N  i k i  N  2n (since each factor is multiplied in exactly once and there are 2n factors) Summing out  i |Val(X i )|  N  n (since we have n summing outs to do) Total work is linear in N and n Exponential blowup can be in N i which for factor i can be v m if factor i has m variables with v values each

47 VE as Graph Transformation At each step we are computing Plot a graph where there is an undirected edge X—Y if variables X and Y appear in the same factor Note: this is the Markov network of the probability on the variables that were not eliminated yet

48 VE as Graph Transformation Goal: P(J) Eliminate: C,D,I,H,G,S,L C D I SG L J H D

49 VE as Graph Transformation Goal: P(J) Eliminate: C,D,I,H,G,S,L Compute: C D I SG L J H D

50 VE as Graph Transformation Goal: P(J) Eliminate: D,I,H,G,S,L Compute: C D I SG L J H

51 VE as Graph Transformation Goal: P(J) Eliminate: I,H,G,S,L Compute: C D I SG L J H

52 VE as Graph Transformation Goal: P(J) Eliminate: H,G,S,L Compute: C D I SG L J H

53 VE as Graph Transformation Goal: P(J) Eliminate: G,S,L Compute: C D I SG L J H

54 VE as Graph Transformation Goal: P(J) Eliminate: S,L Compute: C D I SG L J H

55 VE as Graph Transformation Goal: P(J) Eliminate: L Compute: C D I SG L J H

56 The Induced Graph The induced graph I F,  over factors F and ordering  : Undirected X i and X j are connected if they appeared in the same factor throughout the VE algorithm using  as the ordering C D I SG L J H C D I SG L J H D Original graph Induced graph

57 The Induced Graph The induced graph I F,  over factors F and ordering  : Undirected X i and X j are connected if they appeared in the same factor throughout the VE algorithm using  as the ordering The width of an induced graph is the number of nodes in the largest clique in the graph minus 1 Minimal induced width of a graph K is min  width(I K,  ) Minimal induced width provides a lower bound on best performance by applying VE to a model that factorized on K

58 The Induced Graph Finding the optimal ordering is NP-hard Theorem: For a graph H, determining whether any elimination ordering achieves an induced width  K is NP-complete Note: this NP-hard result is distinct from the NP-hard result of inference – Even given the minimal induced graph, inference may still be exponential Hopeless? No, heuristic techniques can find good elimination orderings

59 Finding Elimination Orderings Reduce to finding triangulation with small cliques Valid since Theorem: Every induced graph is chordal Proof: Assume by contradiction that we have a cycle X 1 —X 2 —...—X k —X 1 for k  4. If X i is the first of the cycle to be eliminated, then because X i is eliminated and no other edges will be added to it later, we have the X i —X i+1 and X i-1 —X i exist at this point, and thus X i-1 —X i+1 will be added at this point, contradicting X 1 —X 2 —...—X k —X 1 being a chordless cycle Theorem: Every chordal graph corresponds to an elimination ordering that does not introduce new fill edges Use graph theoretic algorithms for triangulation Greedy search using heuristic cost function At each point, add node with smallest cost Possible costs: neighbors in current graph, neighbors of neighbors, number of filling edges

60 Elimination On Trees Tree Bayesian network Each variable has at most one parent All factors involve at most two variables Elimination Eliminate leaf variables Maintains tree structure Induced width = 1 D I S G LH D I S G LH D I S G LH

61 Elimination on PolyTrees PolyTree Bayesian network At most one path between any two variables Theorem: inference is linear in the network representation size C D I SG L J H D

62 Inference By Conditioning General idea Enumerate the possible values of a variable Apply Variable Elimination in a simplified network Aggregate the results C S J D G L I Ind(G;S | I) C S J D G L Observe I Transform CPDs of G and S to eliminate I as parent I

63 Inference By Conditioning Compute P G (J) using C S J D G L I Ind(G;S | I) C S J D G L Observe I Transform CPDs of G and S to eliminate I as parent I How do we compute P G (I)?  Inference in G (simple here)  Restrict factors in undirected

64 Inference By Conditioning Compute P G (J) using C S J D G L I C S J D G L I How do we compute P G (I)?  Inference in G (simple here)  Restrict factors in undirected  Factor for each CPD  Partition function is P(I=i)  Compute by inference

65 Cutset Conditioning Select a subset of nodes X  U Define the conditional Bayesian network G X=x G X=x has the same variables as G G X=x has the same structure as G except that all outgoing edges of nodes in X are deleted, and CPDs of nodes in which edges were deleted are updated to X is a cutset in G if G X=x is a polytree Compute original P(Y) query by Exponential in cutset

66 Cutset Conditioning Examples C S J D G L I H Original network C S J D G L I H I is not a cutset C S J D G L I H G is a cutset

67 Inference with Structured CPDs Idea: structured CPDs have additional structure which can be exploited for more efficient inference

68 Independence of Causal Influence X1X1 Y X2X2 XnXn... Causes: X 1,…X n Effect: Y General case: Y has a complex dependency on X 1,…X n Common case Each X i influences Y separately Influence of X 1,…X n is combined to an overall influence on Y

69 Example 1: Noisy OR Two independent effects X 1, X 2 Y=y 1 cannot happen unless one of X 1, X 2 occurs P(Y=y 0 | X 1 =x 1 0, X 2 =x 2 0 ) = P(X 1 =x 1 0 )P(X 2 =x 2 0 ) Y X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 0.20.8 x11x11 x20x20 0.10.9 x11x11 x21x21 0.020.98 X1X1 Y X2X2

70 Noisy OR: Elaborate Representation Y X’ 1 X’ 2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 01 x11x11 x20x20 01 x11x11 x21x21 01 X’ 1 Y X’ 2 X1X1 X2X2 Deterministic OR X’ 1 X1X1 x10x10 x11x11 x10x10 10 x11x11 0.10.9 Noisy CPD 1 X’ 2 X2X2 x20x20 x21x21 x20x20 10 x21x21 0.20.8 Noisy CPD 2 Noise parameter  X 1 =0.9 Noise parameter  X 1 =0.8

71 Noisy OR Decomposition Y X1X1 X2X2 X3X3 X4X4 Goal: Compute P(Y) Naïve approach  4 multiplications – P(X 1 ) x P(X 2 )  8 multiplications – P(X 1,X 2 ) x P(X 3 )  16 multiplications – P(X 1,X 2,X 3 ) x P(X 4 )  32 multiplications – P(X 1,X 2,X 3,X 4 ) x P(Y|X 1,X 2,X 3,X 4 )  30 additions to extract P(Y) from P(Y,X1,X2,X3,X4)  60 multiplications, 30 additions

72 Noisy OR Decomposition Y X1X1 X2X2 X3X3 X4X4 Goal: Compute P(Y) Y Z1Z1 Z2Z2 Z3Z3 Z4Z4 X1X1 X2X2 X3X3 X4X4 Y Z1Z1 Z2Z2 Z3Z3 Z4Z4 X1X1 X2X2 X3X3 X4X4 O1O1 O2O2 Y X1X1 X2X2 X3X3 X4X4 O1O1 O2O2

73 Noisy OR Decomposition Goal: Compute P(Y) Y X1X1 X2X2 X3X3 X4X4 O1O1 O2O2  8 multiplications – P(X 1 ) x P(O|X 1,X 2 )  4 additions to sum out X 1  4 multiplications for f(O,X 2 ) x P(X 2 )  2 additions to sum out X 2  Similar cost for eliminating X 3,X 4 and then subsequently for O 1 and O 2  Total 3 * (8+4) = 36 multiplications and 3 * (4+2) = 18 additions

74 Noisy OR Decomposition Goal: Compute P(Y) Y X1X1 X2X2 X3X3 X4X4 O1O1 O2O2  4 multiplications and 2 additions to eliminate each X i  8 multiplications and 4 additions to eliminate each O i  4 multiplications for f(O,X 2 ) x P(X 2 )  Total cost is 4*4 + 3*8 = 40 multiplications and 4*2 + 3*4 = 20 additions O1O1 O2O2 O3O3 Y X1X1 X2X2 X3X3 X4X4

75 General Formulation Let Y be a random variable with parents X 1,...X n The CPD P(Y | X 1,...X n ) exhibits independence of causal influence (ICI) if it can be described by The CPD P(Z | Z 1,...Z n ) is deterministic Z1Z1 Z Z2Z2 X1X1 X2X2 Y ZnZn XnXn... Noisy OR Z i has noise model Z is an OR function Y is the identity CPD Logistic Z i = w i 1(X i =1) Z =  Z i Y = logit (Z)

76 General Decomposition Independence of causal influence Network with variable Y with parents X 1,...X n Decompose Y by introducing n-1 intermediate variables O 1,...O n-1 Variable Y and each of the O i ’s has exactly two parents in Z 1,...Z n,O 1,...O n-1 The CPD of Y and of O i is deterministic on its two parents Each Z i and each O i is a parent of at most one variable in O 1,...O n-1 and Y

77 Context Specific Independence Idea: exploit structure in tree CPD or rule CPD Approach 1: Decompose the CPD in a modified network structure Approach 2: modify the variable elimination algorithm to perform operations on structured factors

78 Context Specific Independence X1X1 X2X2 X3X3 X4X4 A Y X1X1 X2X2 X3X3 X4X4 A Y YA1YA1 YA2YA2 X1X1 X2X2 X3X3 X4X4 A Y YA1YA1 YA2YA2 X1X1 (0.4,0.6)(0.7,0.3) (0.9,0.1) a0a0 a1a1 x1x1 x0x0 x1x1 x0x0 X3X3 A X2X2 X4X4 (0.2,0.8) (0.3,0.7) x1x1 x0x0 A “selects” Y A 1 or Y A 2

79 General Decomposition Let Y be a variable Let A be one parent of Y Let X be the remaining parents For each a  Val(A) define a new variable Y a Parents of Y a are those variables X  X such that the edge from X to Y is not spurious in the context A=a The CPD of Y a is P(Y a |Pa(Y a )) = P(Y|a,Pa(Y a )) Y is a deterministic multiplexer CPD, with A as the selector

80 Tree CPD Decomposition a0a0 a1a1 b1b1 b0b0 d1d1 d0d0 B A D C c1c1 c0c0 D d1d1 d0d0 ABCD Y DCB A Y Ya0Ya0 YA1YA1  Add A as selector

81 Tree CPD Decomposition a0a0 a1a1 b1b1 b0b0 d1d1 d0d0 B A D C c1c1 c0c0 D d1d1 d0d0 ABCD Y DC B A Y Ya0Ya0 YA1YA1  Add A as selector  Add B as selector Ya1b1Ya1b1 Ya1b0Ya1b0

82 Tree CPD Decomposition a0a0 a1a1 b1b1 b0b0 d1d1 d0d0 B A D C c1c1 c0c0 D d1d1 d0d0 ABCD Y D C B A Y Ya0Ya0 YA1YA1  Add A as selector  Add B as selector  Add C as selector Ya1b1Ya1b1 Ya1b0Ya1b0 Ya1b0c0Ya1b0c0 Ya1b0c1Ya1b0c1

83 MPE and MAP Queries Conditional probability queries Evidence: subset of variables E and an assignment e Query: a subset of variables Y Task: compute P(Y | E=e) Most Probable Explanation (MPE) Evidence: subset of variables E and an assignment e Query: all other variables Y (Y=U-E) Task: compute MPE(Y|E=e) = argmax y P(Y=y | E=e) Note: there may be more than one possible solution Maximum A Posteriori Assignment (MAP) Evidence: subset of variables E and an assignment e Query: a subset of variables Y Task: compute MAP(Y|E=e) = argmax y P(Y=y | E=e) Sum-productMax-productMax-sum-product


Download ppt "Exact Inference Eran Segal Weizmann Institute. Course Outline WeekTopicReading 1Introduction, Bayesian network representation1-3 2Bayesian network representation."

Similar presentations


Ads by Google