1 Bayesian Networks Slides from multiple sources: Weng-Keen Wong, School of Electrical Engineering and Computer Science, Oregon State University.

Presentation on theme: "1 Bayesian Networks Slides from multiple sources: Weng-Keen Wong, School of Electrical Engineering and Computer Science, Oregon State University."— Presentation transcript:

1 Bayesian Networks Slides from multiple sources: Weng-Keen Wong, School of Electrical Engineering and Computer Science, Oregon State University

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T0.80.1 F0.20.9 Train Strike TF Martin Late T0.60.5 F0.40.5 Questions: P(“Martin Late” | “Norman Late ”)=? P(“Martin Late”)=? P (“Martin Late”, “Norman Late”, “Train Strike”)=? Joint distribution Marginal distribution Conditional distribution

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T0.80.1 F0.20.9 Train Strike TF Martin Late T0.60.5 F0.40.5 Questions: P (“Martin Late”, “Norman Late”, “Train Strike” )=? Joint distribution BA C ABCProbability TTT0.048 FTT0.032 TFT0.012 FFT0.008 TTF0.045 FTF TFF0.405 FFF e.g., Demo

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T0.80.1 F0.20.9 Train Strike TF Martin Late T0.60.5 F0.40.5 Questions: P (“Martin Late”, “Norman Late” )=? Marginal distribution BA C ABCProbability TTT0.048 FTT0.032 TFT0.012 FFT0.008 TTF0.045 FTF TFF0.405 FFF e.g., Demo ABProbability TT0.093 FT0.077 TF0.417 FF0.413

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T0.80.1 F0.20.9 Train Strike TF Martin Late T0.60.5 F0.40.5 Questions: P (“Martin Late” )=? Marginal distribution BA C ABCProbability TTT0.048 FTT0.032 TFT0.012 FFT0.008 TTF0.045 FTF TFF0.405 FFF e.g., Demo ABProbability TT0.093 FT0.077 TF0.417 FF0.413 AProbability T0.51 F0.49

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T0.80.1 F0.20.9 Train Strike TF Martin Late T0.60.5 F0.40.5 Questions: P (“Martin Late” | “Norman Late” )=? Conditional distribution BA C e.g., ABCProbability TTT0.048 FTT0.032 TFT0.012 FFT0.008 TTF0.045 FTF TFF0.405 FFF ABProbability TT0.093 FT0.077 TF0.417 FF0.413 AProbability T0.51 F0.49 BProbability T0.17 F0.83 Demo

7 The Joint Probability Distribution Joint probabilities can be between any number of variables eg. P(A = true, B = true, C = true) For each combination of variables, we need to say how probable that combination is The probabilities of these combinations need to sum to 1 ABCP(A,B,C) false 0.1 false true0.2 falsetruefalse0.05 falsetrue 0.05 truefalse 0.3 truefalsetrue0.1 true false0.05 true 0.15 Sums to 1

8 The Joint Probability Distribution Once you have the joint probability distribution, you can calculate any probability involving A, B, and C May need to use marginalization and Bayes rule. ABCP(A,B,C) false 0.1 false true0.2 falsetruefalse0.05 falsetrue 0.05 truefalse 0.3 truefalsetrue0.1 true false0.05 true 0.15 Examples of things you can compute: P(A=true) = sum of P(A,B,C) in rows with A=true P(A=true, B = true | C=true) = P(A = true, B = true, C = true) / P(C = true)

Computing with Probabilities: Law of Total Probability Law of Total Probability (aka “summing out” or marginalization) P(a) =  b P(a, b) =  b P(a | b) P(b) where B is any random variable Why is this useful? given a joint distribution (e.g., P(a,b,c,d)) we can obtain any “marginal” probability (e.g., P(b)) by summing out the other variables, e.g., P(b) =  a  c  d P(a, b, c, d) Less obvious: we can also compute any conditional probability of interest given a joint distribution, e.g., P(c | b) =  a  d P(a, c, d | b) = 1 / P(b)  a  d P(a, c, d, b) where 1 / P(b) is just a normalization constant Thus, the joint distribution contains the information we need to compute any probability of interest.

Computing with Probabilities: The Chain Rule or Factoring We can always write P(a, b, c, … z) = P(a | b, c, …. z) P(b, c, … z) (by definition of joint probability) Repeatedly applying this idea, we can write P(a, b, c, … z) = P(a | b, c, …. z) P(b | c,.. z) P(c|.. z)..P(z) This factorization holds for any ordering of the variables This is the chain rule for probabilities

11 The Problem with the Joint Distribution Lots of entries in the table to fill up! For k Boolean random variables, you need a table of size 2 k How do we use fewer numbers? Need the concept of independence ABCP(A,B,C) false 0.1 false true0.2 falsetruefalse0.05 falsetrue 0.05 truefalse 0.3 truefalsetrue0.1 true false0.05 true 0.15

A Bayesian Network A Bayesian network is made up of: AP(A) false0.4/1 = 2/5 true0.6/1 = 3/5 A BC ABP(B|A) false 3/4 falsetrue1/4 truefalse2/3 true 1/3 ACP(C|B) false 3/8 falsetrue5/8 truefalse7/12 true 5/12 1. A Directed Acyclic Graph (DAG), e.g. 2. A set of tables for each node in the graph Any DAG will work

A Bayesian Network A Bayesian network is made up of: AP(A) false0.6 true0.4 A B CD ABP(B|A) false 0.01 falsetrue0.99 truefalse0.7 true 0.3 BCP(C|B) false 0.4 falsetrue0.6 truefalse0.9 true 0.1 BDP(D|B) false 0.02 falsetrue0.98 truefalse0.05 true 0.95 1. A Directed Acyclic Graph 2. A set of tables for each node in the graph

14 A Directed Acyclic Graph A B CD Each node in the graph is a random variable A node X is a parent of another node Y if there is an arrow from node X to node Y eg. A is a parent of B Informally, an arrow from node X to node Y means X has a direct influence on Y

A Set of Tables for Each Node Each node X i has a conditional probability distribution P(X i | Parents(X i )) that quantifies the effect of the parents on the node The parameters are the probabilities in these conditional probability tables (CPTs) AP(A) false0.6 true0.4 ABP(B|A) false 0.01 falsetrue0.99 truefalse0.7 true 0.3 BCP(C|B) false 0.4 falsetrue0.6 truefalse0.9 true 0.1 BDP(D|B) false 0.02 falsetrue0.98 truefalse0.05 true 0.95 A B CD

16 A Set of Tables for Each Node Conditional Probability Distribution for C given B If you have a Boolean variable with k Boolean parents, this table has 2 k+1 probabilities (but only 2 k need to be stored) BCP(C|B) false 0.4 falsetrue0.6 truefalse0.9 true 0.1 For a given combination of values of the parents (B in this example), the entries for P(C=true | B) and P(C=false | B) must add up to 1 eg. P(C=true | B=false) + P(C=false |B=false )=1

17 Using a Bayesian Network Example Using the network in the example, suppose you want to calculate: P(A = true, B = true, C = true, D = true) = P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95) A B CD

18 Using a Bayesian Network Example Using the network in the example, suppose you want to calculate: P(A = true, B = true, C = true, D = true) = P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95) A B CD This is from the graph structure These numbers are from the conditional probability tables

19 Inference Using a Bayesian network to compute probabilities is called inference In general, inference involves queries of the form: P( X | E ) X = The query variable(s) E = The evidence variable(s)

20 The Bad News Exact inference is feasible in small to medium-sized networks Exact inference in large networks takes a very long time We resort to approximate inference techniques which are much faster and give pretty good results

21 One last unresolved issue… We still haven’t said where we get the Bayesian network from. There are two options: Get an expert to design it Learn it from data

22 Probabilities The sum of the red and blue areas is 1 P(A = false) P(A = true) We will write P(A = true) to mean the probability that A = true. What is probability? It is the relative frequency with which an outcome would be obtained if the process were repeated a large number of times under similar conditions * * Ahem…there’s also the Bayesian definition which says probability is your degree of belief in an outcome

23 Conditional Probability P(A = true | B = true) = Out of all the outcomes in which B is true, how many also have A equal to true Read this as: “Probability of A conditioned on B” or “Probability of A given B” P(F = true) P(H = true) H = “Have a headache” F = “Coming down with Flu” P(H = true) = 1/10 P(F = true) = 1/40 P(H = true | F = true) = 1/2 “Headaches are rare and flu is rarer, but if you’re coming down with flu there’s a 50- 50 chance you’ll have a headache.”

Weng-Keen Wong, Oregon State University ©2005 24 The Joint Probability Distribution We will write P(A = true, B = true) to mean “the probability of A = true and B = true” Notice that: P(H=true|F=true) In general, P(X|Y)=P(X,Y)/P(Y) P(F = true) P(H = true)

Example I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometime it’s set off by a minor earthquake. Is there a burglar? Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects “causal” knowledge: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call What is P(B | ~M, J) ? (for example) We can use the full joint distribution to answer this question Requires 2 5 = 32 probabilities Can we use prior domain knowledge to come up with a Bayesian network that requires fewer probabilities?

Constructing a Bayesian Network: Step 1 Order the variables in terms of causality (may be a partial order) e.g., {E, B} -> {A} -> {J, M} P(J, M, A, E, B) = P(J, M | A, E, B) P(A| E, B) P(E, B) ~ P(J, M | A) P(A| E, B) P(E) P(B) ~ P(J | A) P(M | A) P(A| E, B) P(E) P(B) These CI assumptions are reflected in the graph structure of the Bayesian network

A Simple Belief Network BurglaryEarthquake Alarm MaryCallsJohnCalls causes effects Directed acyclic graph (DAG) Intuitive meaning of arrow from x to y: “x has direct influence on y” Nodes are random variables

Constructing this Bayesian Network: Step 2 P(J, M, A, E, B) = P(J | A) P(M | A) P(A | E, B) P(E) P(B) There are 3 conditional probability tables (CPDs) to be determined: P(J | A), P(M | A), P(A | E, B) –Requiring 2 + 2 + 4 = 8 probabilities And 2 marginal probabilities P(E), P(B)  2 more probabilities Where do these probabilities come from? –Expert knowledge –From data (relative frequency estimates) –Or a combination of both

Assigning Probabilities to Roots BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002

Conditional Probability Tables BEP(A|B,E) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 Size of the CPT for a node with k parents: ?

Conditional Probability Tables BEP(A|B,E) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|A) TFTF 0.90 0.05 AP(M|A) TFTF 0.70 0.01

What the BN Means BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|A) TFTF 0.90 0.05 AP(M|A) TFTF 0.70 0.01 P(x 1,x 2,…,x n ) =  i=1,…,n P(x i |Parents(X i ))

Calculation of Joint Probability BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 P(J  M  A   B   E) = P(J|A)P(M|A)P(A|  B,  E)P(  B)P(  E) = 0.9 x 0.7 x 0.001 x 0.999 x 0.998 = 0.00062

What The BN Encodes Each of the beliefs JohnCalls and MaryCalls is independent of Burglary and Earthquake given Alarm or  Alarm The beliefs JohnCalls and MaryCalls are independent given Alarm or  Alarm BurglaryEarthquake Alarm MaryCallsJohnCalls For example, John does not observe any burglaries directly

What The BN Encodes Each of the beliefs JohnCalls and MaryCalls is independent of Burglary and Earthquake given Alarm or  Alarm The beliefs JohnCalls and MaryCalls are independent given Alarm or  Alarm BurglaryEarthquake Alarm MaryCallsJohnCalls For instance, the reasons why John and Mary may not call if there is an alarm are unrelated Note that these reasons could be other beliefs in the network. The probabilities summarize these non-explicit beliefs

36 Conditional Independence Variables A and B are conditionally independent given C if any of the following hold: P(A, B | C) = P(A | C) P(B | C) P(A | B, C) = P(A | C) P(B | A, C) = P(B | C) Knowing C tells me everything about B. I don’t gain anything by knowing A (either because A doesn’t influence B or because knowing C provides all the information knowing A would give)

Bayesian Networks A Bayesian network specifies a joint distribution in a structured form Represent dependence/independence via a directed graph –Nodes = random variables –Edges = direct dependence Structure of the graph  Conditional independence relations Requires the graph be acyclic (no directed cycles) 2 components to a Bayesian network –The graph structure (conditional independence assumptions) –The numerical probabilities (for each variable given its parents) In general, p(X 1, X 2,....X N ) = ∏ p(X i | parents(X i ) ) The graph-structured approximation The full joint distribution

Number of Probabilities in Bayesian Networks Consider n binary variables Unconstrained joint distribution requires O(2 n ) probabilities If we have a Bayesian network, with a maximum of k parents for any node, then we need O(n 2 k ) probabilities Example –Full unconstrained joint distribution n = 30: need 10 9 probabilities for full joint distribution –Bayesian network n = 30, k = 4: need 480 probabilities

Example (done the simple, marginalization way) So, what is P(B | M, J) ? E.g., say, P(b | m,  j), i.e., P(B=true | M=true  J=false) P(b | m,  j) = P(b, m,  j) / P(m,  j); by definition P(b, m,  j) =  A  {a,  a}  E  {e,  e} P(  j, m, A, E, b); marginal P(J, M, A, E, B) ≈ P(J | A) P(M | A) P(A| E, B) P(E) P(B); conditional indep. P(  j, m, A, E, b) ≈ P(  j | A) P(m | A) P(A| E, b) P(E) P(b) Say, work the case A=a  E=  e P(  j, m, a,  e, b) ≈ P(  j | a) P(m | a) P(a|  e, b) P(  e) P(b) ≈ 0.10 x 0.70 x 0.94 x 0.998 x 0.001 Similar for the cases of a  e,  a  e,  a  e. Similar for P(m,  j). Then just divide to get P(b | m,  j).

Example: Tree-Structured Bayesian Network D A B C F E G p(a, b, c, d, e, f, g) is modeled as p(a|b)p(c|b)p(f|e)p(g|e)p(b|d)p(e|d)p(d)

Example D A B c F E g Say we want to compute p(a | c, g)

Example D A B c F E g Direct calculation: p(a|c,g) =  bdef p(a,b,d,e,f | c,g) Complexity of the sum is O(m 4 )

Example D A B c F E g Reordering:  d p(a|b)  d p(b|d,c)  e p(d|e)  f p(e,f |g)

Example D A B c F E g Reordering:  b  p(a|b)  d p(b|d,c)  e p(d|e)  f p(e,f |g) p(e|g)

Example D A B c F E g Reordering:  b  p(a|b)  d p(b|d,c)  e p(d|e) p(e|g) p(d|g)

Example D A B c F E g Reordering:  b  p(a|b)  d p(b|d,c) p(d|g) p(b|c,g)

Example D A B c F E g Reordering:  b  p(a|b) p(b|c,g) p(a|c,g) Complexity is O(m), compared to O(m 4 )

General Strategy for inference Want to compute P(q | e) Step 1: P(q | e) = P(q,e)/P(e) =  P(q,e), since P(e) is constant wrt Q Step 2: P(q,e) =  a..z P(q, e, a, b, …. z), by the law of total probability Step 3:  a..z P(q, e, a, b, …. z) =  a..z  i P(variable i | parents i)  (using Bayesian network factoring) Step 4: Distribute summations across product terms for efficient computation

Summary Bayesian networks represent a joint distribution using a graph The graph encodes a set of conditional independence assumptions Answering queries (or inference or reasoning) in a Bayesian network amounts to efficient computation of appropriate conditional probabilities Probabilistic inference is intractable in the general case –But can be carried out in linear time for certain classes of Bayesian networks

Download ppt "1 Bayesian Networks Slides from multiple sources: Weng-Keen Wong, School of Electrical Engineering and Computer Science, Oregon State University."

Similar presentations