Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008.

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

BAYESIAN NETWORKS Ivan Bratko Faculty of Computer and Information Sc. University of Ljubljana.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
A question Homophily: similar nodes ~= connected nodes Which is cause and which is effect? –Do birds of a feather flock together? (Associative sorting)
Identifying Conditional Independencies in Bayes Nets Lecture 4.
1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Review: Bayesian learning and inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 12 Jim Martin.
Bayesian networks Chapter 14 Section 1 – 2.
Bayes Nets Rong Jin. Hidden Markov Model  Inferring from observations (o i ) to hidden variables (q i )  This is a general framework for representing.
Bayesian Belief Networks
Slide 1 A Quick Overview of Probability William W. Cohen Machine Learning Jan 2008.
Bayesian Networks Alan Ritter.
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Probabilistic Reasoning
Bayesian Networks Material used 1 Random variables
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
Bayes’ Nets  A Bayes’ net is an efficient encoding of a probabilistic model of a domain  Questions we can ask:  Inference: given a fixed BN, what is.
Made by: Maor Levy, Temple University  Probability expresses uncertainty.  Pervasive in all of Artificial Intelligence  Machine learning 
Bayesian networks Chapter 14. Outline Syntax Semantics.
Bayes Net Lab. 1. Absolute Independence vs. Conditional Independence For any three random variables X, Y, and Z, a)is it true that if X  Y, then X 
A [somewhat] Quick Overview of Probability Shannon Quinn CSCI 6900.
Bayesian networks Chapter 14 Section 1 – 2. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
1 Monte Carlo Artificial Intelligence: Bayesian Networks.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Bayesian Networks: Independencies and Inference Scott Davies and Andrew Moore Note to other teachers and users of these slides. Andrew and Scott would.
Christoph F. Eick Dr. Eick: Introduction to Belief Networks 1 Bayesian Belief Networks Structure and Concepts D-Separation How do they compute probabilities?
Announcements Project 4: Ghostbusters Homework 7
INTERVENTIONS AND INFERENCE / REASONING. Causal models  Recall from yesterday:  Represent relevance using graphs  Causal relevance ⇒ DAGs  Quantitative.
Probabilistic Reasoning [Ch. 14] Bayes Networks – Part 1 ◦Syntax ◦Semantics ◦Parameterized distributions Inference – Part2 ◦Exact inference by enumeration.
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2008 Readings: K&F: 3.1, 3.2, –  Carlos.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Topic Models.
Lecture 29 Conditional Independence, Bayesian networks intro Ch 6.3, 6.3.1, 6.5, 6.5.1,
Inference Algorithms for Bayes Networks
1 Bayesian Networks: A Tutorial. 2 Introduction Suppose you are trying to determine if a patient has tuberculosis. You observe the following symptoms:
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CPSC 7373: Artificial Intelligence Lecture 5: Probabilistic Inference Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Great Theoretical Ideas in Computer Science for Some.
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
Belief Networks Kostas Kontogiannis E&CE 457. Belief Networks A belief network is a graph in which the following holds: –A set of random variables makes.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) Nov, 13, 2013.
KEY CONCEPTS IN PROBABILITY: SMOOTHING, MLE, AND MAP.
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
PROBABILISTIC REASONING Heng Ji 04/05, 04/08, 2016.
Slide 1 Directed Graphical Probabilistic Models: inference William W. Cohen Machine Learning Feb 2008.
Weng-Keen Wong, Oregon State University © Bayesian Networks: A Tutorial Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
CS 2750: Machine Learning Directed Graphical Models
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence
Directed Graphical Probabilistic Models: the sequel
Class #16 – Tuesday, October 26
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Presentation transcript:

Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning Feb 2008

Slide 2 Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1* *0.25 = 0.2 What is P(A=fair|B=criticalHit)? “Loaded” means P(19 or 20)=0.5

Slide 3 Definition of Conditional Probability P(A ^ B) P(A|B) = P(B) Corollaries: P(B|A) = P(A|B) P(B) / P(A) P(A ^ B) = P(A|B) P(B) Chain rule Bayes rule

Slide 4 The (highly practical) Monty Hall problem You’re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing…a goat! 3 You now can either stick with your guess always change doors flip a coin and pick a new door randomly according to the coin

Slide 5 Some practical problems I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick two d6 uniformly at random (A) and roll them. What is more likely – rolling a seven or rolling doubles? Three combinations: HL, HF, FL P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)

Slide 6 A brute-force solution ARoll 1Roll 2P FL111/3 * 1/6 * ½ FL121/3 * 1/6 * 1/10 FL1…… … … ……… 66 HL11 12 ……… HF11 … Comment doubles seven doubles A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g. P(doubles) P(seven or eleven) P(total is higher than 5) ….

Slide 7 The Joint Distribution Recipe for making a joint distribution of M variables: Example: Boolean variables A, B, C

Slide 8 The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). Example: Boolean variables A, B, C ABC

Slide 9 The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. Example: Boolean variables A, B, C ABCProb

Slide 10 The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C ABCProb A B C

Slide 11 Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute

Slide 12 Using the Joint P(Poor Male) =

Slide 13 Using the Joint P(Poor) =

Slide 14 Inference with the Joint

Slide 15 Inference with the Joint P(Male | Poor) = / = 0.612

Slide 16 Inference is a big deal I’ve got this evidence. What’s the chance that this conclusion is true? I’ve got a sore neck: how likely am I to have meningitis? I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? … Lots and lots of important problems can be solved algorithmically using joint inference. How do you get from the statement of the problem to the joint probability table? (What is a “statement of the problem”?) The joint probability table is usually too big to represent explicitly, so how do you represent it compactly?

Slide 17 Problem 1: A description of the experiment I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? A B P(A=fair)=0.75 P(A=loaded)=0.25 P(B=critical | A=fair=0.1 P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5 AP(A) Fair0.75 Loaded0.25 ABP(B|A) FairCritical0.1 FairNoncritical0.9 LoadedCritical0.5 LoadedNoncritical0.5

Slide 18 Problem 1: A description of the experiment A B AP(A) Fair0.75 Loaded0.25 ABP(B|A) FairCritical0.1 FairNoncritical0.9 LoadedCritical0.5 LoadedNoncritical0.5 This is an “influence diagram” (informal: what influences what) a Bayes net / belief network / … (formal: more later!) a description of the experiment (i.e., how data could be generated) everything you need to reconstruct the joint distribution Chain rule: P(A,B)=P(B|A) P(A) problem with probs:Bayes net::word problem:algebra

Slide 19 Problem 2: a description I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick two d6 without replacement (D1,D2) and roll them. What is more likely – rolling a seven or rolling doubles? D1 D2 D1P(D1) Fair0.33 LoadedLo0.33 LoadedHi0.33 D1D2P(D2|D1) FairLoadedLo0.5 FairLoadedHi0.5 LoadedLoFair0.5 LoadedLoLoadedHi0.5 LoadedHiFair0.5 LoadedHiLoadedLo0.5 R D1D2RP(R|D1,D2) FairLo1,11/6 * 1/2 FairLo1,21/6 * 1/10..………

Slide 20 Problem 2: a description D1 D2 D1P(D2) Fair0.33 LoadedLo0.33 LoadedHi0.33 D1D2P(D2|D1) FairLoadedLo0.5 FairLoadedHi0.5 LoadedLoFair0.5 LoadedLoLoadedHi0.5 LoadedHiFair0.5 LoadedHiLoadedLo0.5 R D1D2RP(R|D1,D2) FairLo1,11/6 * 1/2 FairLo1,21/6 * 1/10..……… P(A ^ B) = P(A|B) P(B) Chain rule P(R,D1,D2) = P(R|D1,D2) P(D1,D2) P(R,D1,D2) = P(R|D1,D2) P(D2|D1) P(D1)

Slide 21 Problem 2: another description D1 D2 R1 R2 Is7 R2=1,2,…,6 R1=1,2,…,6 Is7=yes,no

Slide 22 The (highly practical) Monty Hall problem You’re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing…a goat! You now can either stick with your guess or change doors AB First guess The money C The goat D Stick, or swap? E Second guess AP(A) BP(B) DP(D) Stick0.5 Swap0.5 ABCP(C|A,B) …………

Slide 23 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) BP(B) ABCP(C|A,B) ………… ACDP(E|A,C,D) …………

Slide 24 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) BP(B) ABCP(C|A,B) ………… ACDP(E|A,C,D) ………… …again by the chain rule: P(A,B,C,D,E) = P(E|A,C,D) * P(D) * P(C | A,B ) * P(B ) * P(A) We could construct the joint and compute P(E=B|D=swap)

Slide 25 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) BP(B) ABCP(C|A,B) ………… ACDP(E|A,C,D) ………… The joint table has…? 3*3*3*2*3 = 162 rows The conditional probability tables (CPTs) shown have … ? *3*3 + 2*3*3 = 51 rows < 162 rows Big questions: why are the CPTs smaller? how much smaller are the CPTs than the joint? can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs?

Slide 26 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) BP(B) ABCP(C|A,B) ………… ACDP(E|A,C,D) ………… Why is the CPTs representation smaller? Follow the money! (B) E is conditionally independent of B given A,D,C

Slide 27 Conditional Independence formalized Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments)

Slide 28 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess What are the conditional indepencies? I ? …

Slide 29 Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V, E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: The name of a random variable A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.

Slide 30 Building a Bayes Net 1.Choose a set of relevant variables. 2.Choose an ordering for them 3.Assume they’re called X 1.. X m (where X 1 is the first in the ordering, X 1 is the second, etc) 4.For i = 1 to m: 1.Add the X i node to the network 2.Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) 3.Define the probability table of P(X i =k  Assignments of Parents(X i ) ).

Slide 31 The general case P(X 1 = x 1 ^ X 2 =x 2 ^ ….X n-1 =x n-1 ^ X n =x n ) = P(X n =x n ^ X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1 ^…. X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1  …. X 2 =x 2 ^ X 1 =x 1 ) * P(X n-2 =x n-2 ^…. X 2 =x 2 ^ X 1 =x 1 ) = : = So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

Slide 32 Building a Bayes Net: Example 1 1.Choose a set of relevant variables. 2.Choose an ordering for them 3.Assume they’re called X 1.. X m (where X 1 is the first in the ordering, X 1 is the second, etc) 4.For i = 1 to m: Add the X i node to the network Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) Define the probability table of P(X i =k  values of Parents(X i ) ) D1 – first die; D2 – second; R – roll, Is7 D1, D2, R, Is7 D1 D2 R Is7 D1P(D2) Fair0.33 Lo0.33 Hi0.33 D1D2P(D2|D1) FairLo0.5 FairHi0.5 LoFair0.5 LoHi0.5 HiFair0.5 HiLo0.5 D1D2RP(R|D1,D2) FairLo1,11/6 * 1/2 FairLo1,21/6 * 1/10..……… RIs7P(D2|D1) 1,1No1.0 1,2No1.0 ……… 1,6Yes1.0 2,1No… … … P(D1,D2,R,Is7)= P(Is7|R)*P(R|D1,D2)*P(D2|D1)*P(D1)

Slide 33 Another construction D1 D2 R1 R2 Is7 R2=1,2,…,6 R1=1,2,…,6 Is7=yes,no Pick the order D1, D2, R1, R2, Is7 Network will be: What if I pick other orders? What about R2,R1,D2,D1,Is7? What about the first network A  B? suppose I picked the order B,A ?

Slide 34 Another simple example Bigram language model: Pick word w 0 from a multinomial 1 P(w 0 ) For t=1 to 100 Pick word w t from multinomial 2 P(w t |w t-1 ) w0w0 w1w1 w2w2 … w 100 w0w0 P(w 0 ) aardvark …… the0.179 …… zymurgy0.097 W t-1 WtWt P(w 0 ) aardvark aardvark…… tastes ……… aardvarkzymurgy

Slide 35 A simple example Bigram language model: Pick word w 0 from a multinomial 1 P(w 0 ) For t=1 to 100 Pick word w t from multinomial 2 P(w t |w t-1 ) w0w0 w1w1 w2w2 … w 100 w0w0 P(w 0 ) aardvark …… the0.179 …… zymurgy0.097 W t-1 WtWt P(w t | w t-1 ) aardvark aardvark…… tastes ……… aardvarkzymurgy Size of CPTs: |V| + |V| 2 Size of joint table: |V| ^ 101 |V|=20 for amino acids (proteins)

Slide 36 A simple example Trigram language model: Pick word w 0 from a multinomial M1 P(w 0 ) Pick word w 1 from a multinomial M2 P(w 1 |w 0 ) For t=2 to 100 Pick word w t from multinomial M3 P(w t |w t-1, w t-2 ) w0w0 w1w1 w2w2 … w 99 Size of CPTs: |V| + |V| 2 + | V | 3 Size of joint table: |V| ^ 101 |V|=20 for amino acids (proteins) w 100

Slide 37 What Independencies does a Bayes Net Model? In order for a Bayesian network to model a probability distribution, the following must be true by definition: Each variable is conditionally independent of all its non- descendants in the graph given the value of all its parents. This implies But what else does it imply?

Slide 38 What Independencies does a Bayes Net Model? Example: Z Y X Given Y, does learning the value of Z tell us nothing new about X? I.e., is P(X|Y, Z) equal to P(X | Y)? Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y).

Slide 39 Quick proof that independence is symmetric Assume: P(X|Y, Z) = P(X|Y) Then: (Bayes’s Rule) (Chain Rule) (By Assumption) (Bayes’s Rule)

Slide 40 What Independencies does a Bayes Net Model? Let I represent X and Z being conditionally independent given Y. I ? Yes, just as in previous example: All X’s parents given, and Z is not a descendant. Y XZ

Slide 41 What Independencies does a Bayes Net Model? I ? No. I ? Yes. Maybe I iff S acts a cutset between X and Z in an undirected version of the graph…? Z VU X

Slide 42 Things get a little more confusing X has no parents, so we know all its parents’ values trivially Z is not a descendant of X So, I, even though there’s a undirected path from X to Z through an unknown variable Y. What if we do know the value of Y, though? Or one of its descendants? ZX Y

Slide 43 The “Burglar Alarm” example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn’t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh! BurglarEarthquake Alarm Phone Call

Slide 44 Things get a lot more confusing But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. Earthquake “explains away” the hypothetical burglar. But then it must not be the case that I, even though I ! BurglarEarthquake Alarm Phone Call

Slide 45 d-separation to the rescue Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation. Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true:... ie. X and Z are dependent iff there exists an unblocked path

Slide 46 A path is “blocked” when... There exists a variable V on the path such that it is in the evidence set E the arcs putting Y in the path are “tail-to-tail” Or, there exists a variable V on the path such that it is in the evidence set E the arcs putting Y in the path are “tail-to-head” Or,... Y Y unknown “common causes” of X and Z impose dependency unknown “causal chains” connecting X an Z impose dependency

Slide 47 A path is “blocked” when… (the funky case) … Or, there exists a variable Y on the path such that it is NOT in the evidence set E neither are any of its descendants the arcs putting Y on the path are “head-to-head” Y Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z

Slide 48 d-separation to the rescue, cont’d Theorem [Verma & Pearl, 1998]: If a set of evidence variables E d-separates X and Z in a Bayesian network’s graph, then I. d-separation can be computed in linear time using a depth-first-search- like algorithm. Great! We now have a fast algorithm for automatically inferring whether learning the value of one variable might give us any additional hints about some other variable, given what we already know. “ Might”: Variables may actually be independent when they’re not d- separated, depending on the actual probabilities involved