1 Bayesian Networks Slides from multiple sources: Weng-Keen Wong, School of Electrical Engineering and Computer Science, Oregon State University.

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Belief networks Conditional independence Syntax and semantics Exact inference Approximate inference CS 460, Belief Networks1 Mundhenk and Itti Based.
1 Knowledge Engineering for Bayesian Networks. 2 Probability theory for representing uncertainty l Assigns a numerical degree of belief between 0 and.
1 A Tutorial on Bayesian Networks Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon State University.
1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Reasoning under Uncertainty: Conditional Prob., Bayes and Independence Computer Science cpsc322, Lecture 25 (Textbook Chpt ) March, 17, 2010.
Review: Bayesian learning and inference
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
Weng-Keen Wong, Oregon State University © Bayesian Networks: A Tutorial Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Bayesian Networks Russell and Norvig: Chapter 14 CMCS424 Fall 2003 based on material from Jean-Claude Latombe, Daphne Koller and Nir Friedman.
Belief Networks Russell and Norvig: Chapter 15 CS121 – Winter 2002.
1 A Tutorial on Bayesian Networks Modified by Paul Anderson from slides by Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
1 Probabilistic Belief States and Bayesian Networks (Where we exploit the sparseness of direct interactions among components of a world) R&N: Chap. 14,
Bayesian Networks Russell and Norvig: Chapter 14 CMCS421 Fall 2006.
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Bayesian Networks CS 271: Fall 2007 Instructor: Padhraic Smyth.
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
Read R&N Ch Next lecture: Read R&N
Bayesian networks Chapter 14. Outline Syntax Semantics.
Bayesian networks Chapter 14 Section 1 – 2. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
B AYESIAN N ETWORKS. S IGNIFICANCE OF C ONDITIONAL INDEPENDENCE Consider Grade(CS101), Intelligence, and SAT Ostensibly, the grade in a course doesn’t.
Digital Statisticians INST 4200 David J Stucki Spring 2015.
Probabilistic Belief States and Bayesian Networks (Where we exploit the sparseness of direct interactions among components of a world) R&N: Chap. 14, Sect.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
1 Monte Carlo Artificial Intelligence: Bayesian Networks.
Introduction to Bayesian Networks
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
Probabilistic Reasoning [Ch. 14] Bayes Networks – Part 1 ◦Syntax ◦Semantics ◦Parameterized distributions Inference – Part2 ◦Exact inference by enumeration.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
1 Bayesian Networks: A Tutorial. 2 Introduction Suppose you are trying to determine if a patient has tuberculosis. You observe the following symptoms:
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.
Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) Nov, 13, 2013.
Belief Networks CS121 – Winter Other Names Bayesian networks Probabilistic networks Causal networks.
PROBABILISTIC REASONING Heng Ji 04/05, 04/08, 2016.
Weng-Keen Wong, Oregon State University © Bayesian Networks: A Tutorial Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon.
Bayesian Networks Read R&N Ch Next lecture: Read R&N
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
A Brief Introduction to Bayesian networks
CS 2750: Machine Learning Directed Graphical Models
Bayesian Networks Chapter 14 Section 1, 2, 4.
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
Qian Liu CSE spring University of Pennsylvania
Read R&N Ch Next lecture: Read R&N
Bayesian Networks: A Tutorial
Learning Bayesian Network Models from Data
Read R&N Ch Next lecture: Read R&N
Read R&N Ch Next lecture: Read R&N
Probabilistic Reasoning; Network-based reasoning
CS 271: Fall 2007 Instructor: Padhraic Smyth
Read R&N Ch Next lecture: Read R&N
Read R&N Ch Next lecture: Read R&N
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

1 Bayesian Networks Slides from multiple sources: Weng-Keen Wong, School of Electrical Engineering and Computer Science, Oregon State University

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T F Train Strike TF Martin Late T F Questions: P(“Martin Late” | “Norman Late ”)=? P(“Martin Late”)=? P (“Martin Late”, “Norman Late”, “Train Strike”)=? Joint distribution Marginal distribution Conditional distribution

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T F Train Strike TF Martin Late T F Questions: P (“Martin Late”, “Norman Late”, “Train Strike” )=? Joint distribution BA C ABCProbability TTT0.048 FTT0.032 TFT0.012 FFT0.008 TTF0.045 FTF TFF0.405 FFF e.g., Demo

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T F Train Strike TF Martin Late T F Questions: P (“Martin Late”, “Norman Late” )=? Marginal distribution BA C ABCProbability TTT0.048 FTT0.032 TFT0.012 FFT0.008 TTF0.045 FTF TFF0.405 FFF e.g., Demo ABProbability TT0.093 FT0.077 TF0.417 FF0.413

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T F Train Strike TF Martin Late T F Questions: P (“Martin Late” )=? Marginal distribution BA C ABCProbability TTT0.048 FTT0.032 TFT0.012 FFT0.008 TTF0.045 FTF TFF0.405 FFF e.g., Demo ABProbability TT0.093 FT0.077 TF0.417 FF0.413 AProbability T0.51 F0.49

Example Train Strike Train Strike Martin Late Martin Late Norman Late Norman Late Train Strike Probability T0.1 F0.9 Train Strike TF Norman Late T F Train Strike TF Martin Late T F Questions: P (“Martin Late” | “Norman Late” )=? Conditional distribution BA C e.g., ABCProbability TTT0.048 FTT0.032 TFT0.012 FFT0.008 TTF0.045 FTF TFF0.405 FFF ABProbability TT0.093 FT0.077 TF0.417 FF0.413 AProbability T0.51 F0.49 BProbability T0.17 F0.83 Demo

7 The Joint Probability Distribution Joint probabilities can be between any number of variables eg. P(A = true, B = true, C = true) For each combination of variables, we need to say how probable that combination is The probabilities of these combinations need to sum to 1 ABCP(A,B,C) false 0.1 false true0.2 falsetruefalse0.05 falsetrue 0.05 truefalse 0.3 truefalsetrue0.1 true false0.05 true 0.15 Sums to 1

8 The Joint Probability Distribution Once you have the joint probability distribution, you can calculate any probability involving A, B, and C May need to use marginalization and Bayes rule. ABCP(A,B,C) false 0.1 false true0.2 falsetruefalse0.05 falsetrue 0.05 truefalse 0.3 truefalsetrue0.1 true false0.05 true 0.15 Examples of things you can compute: P(A=true) = sum of P(A,B,C) in rows with A=true P(A=true, B = true | C=true) = P(A = true, B = true, C = true) / P(C = true)

Computing with Probabilities: Law of Total Probability Law of Total Probability (aka “summing out” or marginalization) P(a) =  b P(a, b) =  b P(a | b) P(b) where B is any random variable Why is this useful? given a joint distribution (e.g., P(a,b,c,d)) we can obtain any “marginal” probability (e.g., P(b)) by summing out the other variables, e.g., P(b) =  a  c  d P(a, b, c, d) Less obvious: we can also compute any conditional probability of interest given a joint distribution, e.g., P(c | b) =  a  d P(a, c, d | b) = 1 / P(b)  a  d P(a, c, d, b) where 1 / P(b) is just a normalization constant Thus, the joint distribution contains the information we need to compute any probability of interest.

Computing with Probabilities: The Chain Rule or Factoring We can always write P(a, b, c, … z) = P(a | b, c, …. z) P(b, c, … z) (by definition of joint probability) Repeatedly applying this idea, we can write P(a, b, c, … z) = P(a | b, c, …. z) P(b | c,.. z) P(c|.. z)..P(z) This factorization holds for any ordering of the variables This is the chain rule for probabilities

11 The Problem with the Joint Distribution Lots of entries in the table to fill up! For k Boolean random variables, you need a table of size 2 k How do we use fewer numbers? Need the concept of independence ABCP(A,B,C) false 0.1 false true0.2 falsetruefalse0.05 falsetrue 0.05 truefalse 0.3 truefalsetrue0.1 true false0.05 true 0.15

A Bayesian Network A Bayesian network is made up of: AP(A) false0.4/1 = 2/5 true0.6/1 = 3/5 A BC ABP(B|A) false 3/4 falsetrue1/4 truefalse2/3 true 1/3 ACP(C|B) false 3/8 falsetrue5/8 truefalse7/12 true 5/12 1. A Directed Acyclic Graph (DAG), e.g. 2. A set of tables for each node in the graph Any DAG will work

A Bayesian Network A Bayesian network is made up of: AP(A) false0.6 true0.4 A B CD ABP(B|A) false 0.01 falsetrue0.99 truefalse0.7 true 0.3 BCP(C|B) false 0.4 falsetrue0.6 truefalse0.9 true 0.1 BDP(D|B) false 0.02 falsetrue0.98 truefalse0.05 true A Directed Acyclic Graph 2. A set of tables for each node in the graph

14 A Directed Acyclic Graph A B CD Each node in the graph is a random variable A node X is a parent of another node Y if there is an arrow from node X to node Y eg. A is a parent of B Informally, an arrow from node X to node Y means X has a direct influence on Y

A Set of Tables for Each Node Each node X i has a conditional probability distribution P(X i | Parents(X i )) that quantifies the effect of the parents on the node The parameters are the probabilities in these conditional probability tables (CPTs) AP(A) false0.6 true0.4 ABP(B|A) false 0.01 falsetrue0.99 truefalse0.7 true 0.3 BCP(C|B) false 0.4 falsetrue0.6 truefalse0.9 true 0.1 BDP(D|B) false 0.02 falsetrue0.98 truefalse0.05 true 0.95 A B CD

16 A Set of Tables for Each Node Conditional Probability Distribution for C given B If you have a Boolean variable with k Boolean parents, this table has 2 k+1 probabilities (but only 2 k need to be stored) BCP(C|B) false 0.4 falsetrue0.6 truefalse0.9 true 0.1 For a given combination of values of the parents (B in this example), the entries for P(C=true | B) and P(C=false | B) must add up to 1 eg. P(C=true | B=false) + P(C=false |B=false )=1

17 Using a Bayesian Network Example Using the network in the example, suppose you want to calculate: P(A = true, B = true, C = true, D = true) = P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95) A B CD

18 Using a Bayesian Network Example Using the network in the example, suppose you want to calculate: P(A = true, B = true, C = true, D = true) = P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95) A B CD This is from the graph structure These numbers are from the conditional probability tables

19 Inference Using a Bayesian network to compute probabilities is called inference In general, inference involves queries of the form: P( X | E ) X = The query variable(s) E = The evidence variable(s)

20 The Bad News Exact inference is feasible in small to medium-sized networks Exact inference in large networks takes a very long time We resort to approximate inference techniques which are much faster and give pretty good results

21 One last unresolved issue… We still haven’t said where we get the Bayesian network from. There are two options: Get an expert to design it Learn it from data

22 Probabilities The sum of the red and blue areas is 1 P(A = false) P(A = true) We will write P(A = true) to mean the probability that A = true. What is probability? It is the relative frequency with which an outcome would be obtained if the process were repeated a large number of times under similar conditions * * Ahem…there’s also the Bayesian definition which says probability is your degree of belief in an outcome

23 Conditional Probability P(A = true | B = true) = Out of all the outcomes in which B is true, how many also have A equal to true Read this as: “Probability of A conditioned on B” or “Probability of A given B” P(F = true) P(H = true) H = “Have a headache” F = “Coming down with Flu” P(H = true) = 1/10 P(F = true) = 1/40 P(H = true | F = true) = 1/2 “Headaches are rare and flu is rarer, but if you’re coming down with flu there’s a chance you’ll have a headache.”

Weng-Keen Wong, Oregon State University © The Joint Probability Distribution We will write P(A = true, B = true) to mean “the probability of A = true and B = true” Notice that: P(H=true|F=true) In general, P(X|Y)=P(X,Y)/P(Y) P(F = true) P(H = true)

Example I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometime it’s set off by a minor earthquake. Is there a burglar? Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects “causal” knowledge: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call What is P(B | ~M, J) ? (for example) We can use the full joint distribution to answer this question Requires 2 5 = 32 probabilities Can we use prior domain knowledge to come up with a Bayesian network that requires fewer probabilities?

Constructing a Bayesian Network: Step 1 Order the variables in terms of causality (may be a partial order) e.g., {E, B} -> {A} -> {J, M} P(J, M, A, E, B) = P(J, M | A, E, B) P(A| E, B) P(E, B) ~ P(J, M | A) P(A| E, B) P(E) P(B) ~ P(J | A) P(M | A) P(A| E, B) P(E) P(B) These CI assumptions are reflected in the graph structure of the Bayesian network

A Simple Belief Network BurglaryEarthquake Alarm MaryCallsJohnCalls causes effects Directed acyclic graph (DAG) Intuitive meaning of arrow from x to y: “x has direct influence on y” Nodes are random variables

Constructing this Bayesian Network: Step 2 P(J, M, A, E, B) = P(J | A) P(M | A) P(A | E, B) P(E) P(B) There are 3 conditional probability tables (CPDs) to be determined: P(J | A), P(M | A), P(A | E, B) –Requiring = 8 probabilities And 2 marginal probabilities P(E), P(B)  2 more probabilities Where do these probabilities come from? –Expert knowledge –From data (relative frequency estimates) –Or a combination of both

Assigning Probabilities to Roots BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) 0.002

Conditional Probability Tables BEP(A|B,E) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) Size of the CPT for a node with k parents: ?

Conditional Probability Tables BEP(A|B,E) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|A) TFTF AP(M|A) TFTF

What the BN Means BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|A) TFTF AP(M|A) TFTF P(x 1,x 2,…,x n ) =  i=1,…,n P(x i |Parents(X i ))

Calculation of Joint Probability BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF P(J  M  A   B   E) = P(J|A)P(M|A)P(A|  B,  E)P(  B)P(  E) = 0.9 x 0.7 x x x =

What The BN Encodes Each of the beliefs JohnCalls and MaryCalls is independent of Burglary and Earthquake given Alarm or  Alarm The beliefs JohnCalls and MaryCalls are independent given Alarm or  Alarm BurglaryEarthquake Alarm MaryCallsJohnCalls For example, John does not observe any burglaries directly

What The BN Encodes Each of the beliefs JohnCalls and MaryCalls is independent of Burglary and Earthquake given Alarm or  Alarm The beliefs JohnCalls and MaryCalls are independent given Alarm or  Alarm BurglaryEarthquake Alarm MaryCallsJohnCalls For instance, the reasons why John and Mary may not call if there is an alarm are unrelated Note that these reasons could be other beliefs in the network. The probabilities summarize these non-explicit beliefs

36 Conditional Independence Variables A and B are conditionally independent given C if any of the following hold: P(A, B | C) = P(A | C) P(B | C) P(A | B, C) = P(A | C) P(B | A, C) = P(B | C) Knowing C tells me everything about B. I don’t gain anything by knowing A (either because A doesn’t influence B or because knowing C provides all the information knowing A would give)

Bayesian Networks A Bayesian network specifies a joint distribution in a structured form Represent dependence/independence via a directed graph –Nodes = random variables –Edges = direct dependence Structure of the graph  Conditional independence relations Requires the graph be acyclic (no directed cycles) 2 components to a Bayesian network –The graph structure (conditional independence assumptions) –The numerical probabilities (for each variable given its parents) In general, p(X 1, X 2,....X N ) = ∏ p(X i | parents(X i ) ) The graph-structured approximation The full joint distribution

Number of Probabilities in Bayesian Networks Consider n binary variables Unconstrained joint distribution requires O(2 n ) probabilities If we have a Bayesian network, with a maximum of k parents for any node, then we need O(n 2 k ) probabilities Example –Full unconstrained joint distribution n = 30: need 10 9 probabilities for full joint distribution –Bayesian network n = 30, k = 4: need 480 probabilities

Example (done the simple, marginalization way) So, what is P(B | M, J) ? E.g., say, P(b | m,  j), i.e., P(B=true | M=true  J=false) P(b | m,  j) = P(b, m,  j) / P(m,  j); by definition P(b, m,  j) =  A  {a,  a}  E  {e,  e} P(  j, m, A, E, b); marginal P(J, M, A, E, B) ≈ P(J | A) P(M | A) P(A| E, B) P(E) P(B); conditional indep. P(  j, m, A, E, b) ≈ P(  j | A) P(m | A) P(A| E, b) P(E) P(b) Say, work the case A=a  E=  e P(  j, m, a,  e, b) ≈ P(  j | a) P(m | a) P(a|  e, b) P(  e) P(b) ≈ 0.10 x 0.70 x 0.94 x x Similar for the cases of a  e,  a  e,  a  e. Similar for P(m,  j). Then just divide to get P(b | m,  j).

Example: Tree-Structured Bayesian Network D A B C F E G p(a, b, c, d, e, f, g) is modeled as p(a|b)p(c|b)p(f|e)p(g|e)p(b|d)p(e|d)p(d)

Example D A B c F E g Say we want to compute p(a | c, g)

Example D A B c F E g Direct calculation: p(a|c,g) =  bdef p(a,b,d,e,f | c,g) Complexity of the sum is O(m 4 )

Example D A B c F E g Reordering:  d p(a|b)  d p(b|d,c)  e p(d|e)  f p(e,f |g)

Example D A B c F E g Reordering:  b  p(a|b)  d p(b|d,c)  e p(d|e)  f p(e,f |g) p(e|g)

Example D A B c F E g Reordering:  b  p(a|b)  d p(b|d,c)  e p(d|e) p(e|g) p(d|g)

Example D A B c F E g Reordering:  b  p(a|b)  d p(b|d,c) p(d|g) p(b|c,g)

Example D A B c F E g Reordering:  b  p(a|b) p(b|c,g) p(a|c,g) Complexity is O(m), compared to O(m 4 )

General Strategy for inference Want to compute P(q | e) Step 1: P(q | e) = P(q,e)/P(e) =  P(q,e), since P(e) is constant wrt Q Step 2: P(q,e) =  a..z P(q, e, a, b, …. z), by the law of total probability Step 3:  a..z P(q, e, a, b, …. z) =  a..z  i P(variable i | parents i)  (using Bayesian network factoring) Step 4: Distribute summations across product terms for efficient computation

Summary Bayesian networks represent a joint distribution using a graph The graph encodes a set of conditional independence assumptions Answering queries (or inference or reasoning) in a Bayesian network amounts to efficient computation of appropriate conditional probabilities Probabilistic inference is intractable in the general case –But can be carried out in linear time for certain classes of Bayesian networks