Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008.

Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008

Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2 What is P(A=fair|B=criticalHit)? “Loaded” means P(19 or 20)=0.5

Definition of Conditional Probability P(A ^ B) P(A|B) = ----------- P(B) Corollaries: P(B|A) = P(A|B) P(B) / P(A) P(A ^ B) = P(A|B) P(B) Chain rule Bayes rule

The (highly practical) Monty Hall problem You’re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing…a goat! 3 You now can either stick with your guess always change doors flip a coin and pick a new door randomly according to the coin

Some practical problems I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick two d6 uniformly at random (A) and roll them. What is more likely – rolling a seven or rolling doubles? Three combinations: HL, HF, FL P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)

A brute-force solution ARoll 1Roll 2P FL111/3 * 1/6 * ½ FL121/3 * 1/6 * 1/10 FL1…… …16 21 2… ……… 66 HL11 12 ……… HF11 … Comment doubles seven doubles A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g. P(doubles) P(seven or eleven) P(total is higher than 5) ….

The Joint Distribution Recipe for making a joint distribution of M variables: Example: Boolean variables A, B, C

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). Example: Boolean variables A, B, C ABC 000 001 010 011 100 101 110 111

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. Example: Boolean variables A, B, C ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10

The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10 A B C 0.05 0.25 0.100.05 0.10 0.30

Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute

Using the Joint P(Poor Male) = 0.4654

Using the Joint P(Poor) = 0.7604

Inference with the Joint

Inference with the Joint P(Male | Poor) = 0.4654 / 0.7604 = 0.612

Inference is a big deal I’ve got this evidence. What’s the chance that this conclusion is true? I’ve got a sore neck: how likely am I to have meningitis? I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? … Lots and lots of important problems can be solved algorithmically using joint inference. How do you get from the statement of the problem to the joint probability table? (What is a “statement of the problem”?) The joint probability table is usually too big to represent explicitly, so how do you represent it compactly?

Problem 1: A description of the experiment I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? A B P(A=fair)=0.75 P(A=loaded)=0.25 P(B=critical | A=fair=0.1 P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5 AP(A) Fair0.75 Loaded0.25 ABP(B|A) FairCritical0.1 FairNoncritical0.9 LoadedCritical0.5 LoadedNoncritical0.5

Problem 1: A description of the experiment A B AP(A) Fair0.75 Loaded0.25 ABP(B|A) FairCritical0.1 FairNoncritical0.9 LoadedCritical0.5 LoadedNoncritical0.5 This is an “influence diagram” (informal: what influences what) a Bayes net / belief network / … (formal: more later!) a description of the experiment (i.e., how data could be generated) everything you need to reconstruct the joint distribution Chain rule: P(A,B)=P(B|A) P(A) problem with probs:Bayes net::word problem:algebra

Problem 2: a description I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick two d6 without replacement (D1,D2) and roll them. What is more likely – rolling a seven or rolling doubles? D1 D2 D1P(D1) Fair0.33 LoadedLo0.33 LoadedHi0.33 D1D2P(D2|D1) FairLoadedLo0.5 FairLoadedHi0.5 LoadedLoFair0.5 LoadedLoLoadedHi0.5 LoadedHiFair0.5 LoadedHiLoadedLo0.5 R D1D2RP(R|D1,D2) FairLo1,11/6 * 1/2 FairLo1,21/6 * 1/10..………

Problem 2: a description D1 D2 D1P(D2) Fair0.33 LoadedLo0.33 LoadedHi0.33 D1D2P(D2|D1) FairLoadedLo0.5 FairLoadedHi0.5 LoadedLoFair0.5 LoadedLoLoadedHi0.5 LoadedHiFair0.5 LoadedHiLoadedLo0.5 R D1D2RP(R|D1,D2) FairLo1,11/6 * 1/2 FairLo1,21/6 * 1/10..……… P(A ^ B) = P(A|B) P(B) Chain rule P(R,D1,D2) = P(R|D1,D2) P(D1,D2) P(R,D1,D2) = P(R|D1,D2) P(D2|D1) P(D1)

Problem 2: another description D1 D2 R1 R2 Is7 R2=1,2,…,6 R1=1,2,…,6 Is7=yes,no

The (highly practical) Monty Hall problem You’re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing…a goat! You now can either stick with your guess or change doors AB First guess The money C The goat D Stick, or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 DP(D) Stick0.5 Swap0.5 ABCP(C|A,B) 1120.5 113 1231.0 132 …………

The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 ABCP(C|A,B) 1120.5 113 1231.0 132 ………… ACDP(E|A,C,D) …………

The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 ABCP(C|A,B) 1120.5 113 1231.0 132 ………… ACDP(E|A,C,D) ………… …again by the chain rule: P(A,B,C,D,E) = P(E|A,C,D) * P(D) * P(C | A,B ) * P(B ) * P(A) We could construct the joint and compute P(E=B|D=swap)

The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 ABCP(C|A,B) 1120.5 113 1231.0 132 ………… ACDP(E|A,C,D) ………… The joint table has…? 3*3*3*2*3 = 162 rows The conditional probability tables (CPTs) shown have … ? 3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows Big questions: why are the CPTs smaller? how much smaller are the CPTs than the joint? can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs?

The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 ABCP(C|A,B) 1120.5 113 1231.0 132 ………… ACDP(E|A,C,D) ………… Why is the CPTs representation smaller? Follow the money! (B) E is conditionally independent of B given A,D,C

Conditional Independence formalized Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments)

The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess What are the conditional indepencies? I ? …

Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V, E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: The name of a random variable A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.

Building a Bayes Net 1.Choose a set of relevant variables. 2.Choose an ordering for them 3.Assume they’re called X 1.. X m (where X 1 is the first in the ordering, X 1 is the second, etc) 4.For i = 1 to m: 1.Add the X i node to the network 2.Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) 3.Define the probability table of P(X i =k  Assignments of Parents(X i ) ).

The general case P(X 1 = x 1 ^ X 2 =x 2 ^ ….X n-1 =x n-1 ^ X n =x n ) = P(X n =x n ^ X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1 ^…. X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1  …. X 2 =x 2 ^ X 1 =x 1 ) * P(X n-2 =x n-2 ^…. X 2 =x 2 ^ X 1 =x 1 ) = : = So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

Building a Bayes Net: Example 1 1.Choose a set of relevant variables. 2.Choose an ordering for them 3.Assume they’re called X 1.. X m (where X 1 is the first in the ordering, X 1 is the second, etc) 4.For i = 1 to m: Add the X i node to the network Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) Define the probability table of P(X i =k  values of Parents(X i ) ) D1 – first die; D2 – second; R – roll, Is7 D1, D2, R, Is7 D1 D2 R Is7 D1P(D2) Fair0.33 Lo0.33 Hi0.33 D1D2P(D2|D1) FairLo0.5 FairHi0.5 LoFair0.5 LoHi0.5 HiFair0.5 HiLo0.5 D1D2RP(R|D1,D2) FairLo1,11/6 * 1/2 FairLo1,21/6 * 1/10..……… RIs7P(D2|D1) 1,1No1.0 1,2No1.0 ……… 1,6Yes1.0 2,1No… … … P(D1,D2,R,Is7)= P(Is7|R)*P(R|D1,D2)*P(D2|D1)*P(D1)

Another construction D1 D2 R1 R2 Is7 R2=1,2,…,6 R1=1,2,…,6 Is7=yes,no Pick the order D1, D2, R1, R2, Is7 Network will be: What if I pick other orders? What about R2,R1,D2,D1,Is7? What about the first network A  B? suppose I picked the order B,A ?

Another simple example Bigram language model: Pick word w 0 from a multinomial 1 P(w 0 ) For t=1 to 100 Pick word w t from multinomial 2 P(w t |w t-1 ) w0w0 w1w1 w2w2 … w 100 w0w0 P(w 0 ) aardvark0.0002 …… the0.179 …… zymurgy0.097 W t-1 WtWt P(w 0 ) aardvark 0.00000001 aardvark…… tastes0.1275 ……… aardvarkzymurgy0.0000076

A simple example Bigram language model: Pick word w 0 from a multinomial 1 P(w 0 ) For t=1 to 100 Pick word w t from multinomial 2 P(w t |w t-1 ) w0w0 w1w1 w2w2 … w 100 w0w0 P(w 0 ) aardvark0.0002 …… the0.179 …… zymurgy0.097 W t-1 WtWt P(w t | w t-1 ) aardvark 0.00000001 aardvark…… tastes0.1275 ……… aardvarkzymurgy0.0000076 Size of CPTs: |V| + |V| 2 Size of joint table: |V| ^ 101 |V|=20 for amino acids (proteins)

A simple example Trigram language model: Pick word w 0 from a multinomial M1 P(w 0 ) Pick word w 1 from a multinomial M2 P(w 1 |w 0 ) For t=2 to 100 Pick word w t from multinomial M3 P(w t |w t-1, w t-2 ) w0w0 w1w1 w2w2 … w 99 Size of CPTs: |V| + |V| 2 + | V | 3 Size of joint table: |V| ^ 101 |V|=20 for amino acids (proteins) w 100

What Independencies does a Bayes Net Model? In order for a Bayesian network to model a probability distribution, the following must be true by definition: Each variable is conditionally independent of all its non- descendants in the graph given the value of all its parents. This implies But what else does it imply?

What Independencies does a Bayes Net Model? Example: Z Y X Given Y, does learning the value of Z tell us nothing new about X? I.e., is P(X|Y, Z) equal to P(X | Y)? Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y).

Quick proof that independence is symmetric Assume: P(X|Y, Z) = P(X|Y) Then: (Bayes’s Rule) (Chain Rule) (By Assumption) (Bayes’s Rule)

What Independencies does a Bayes Net Model? Let I represent X and Z being conditionally independent given Y. I ? Yes, just as in previous example: All X’s parents given, and Z is not a descendant. Y XZ

What Independencies does a Bayes Net Model? I ? No. I ? Yes. Maybe I iff S acts a cutset between X and Z in an undirected version of the graph…? Z VU X

Things get a little more confusing X has no parents, so we know all its parents’ values trivially Z is not a descendant of X So, I, even though there’s a undirected path from X to Z through an unknown variable Y. What if we do know the value of Y, though? Or one of its descendants? ZX Y

The “Burglar Alarm” example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn’t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh! BurglarEarthquake Alarm Phone Call

Things get a lot more confusing But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. Earthquake “explains away” the hypothetical burglar. But then it must not be the case that I, even though I ! BurglarEarthquake Alarm Phone Call

d-separation to the rescue Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation. Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true:... ie. X and Z are dependent iff there exists an unblocked path

A path is “blocked” when... There exists a variable V on the path such that it is in the evidence set E the arcs putting Y in the path are “tail-to-tail” Or, there exists a variable V on the path such that it is in the evidence set E the arcs putting Y in the path are “tail-to-head” Or,... Y Y unknown “common causes” of X and Z impose dependency unknown “causal chains” connecting X an Z impose dependency

A path is “blocked” when… (the funky case) … Or, there exists a variable Y on the path such that it is NOT in the evidence set E neither are any of its descendants the arcs putting Y on the path are “head-to-head” Y Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z

d-separation to the rescue, cont’d Theorem [Verma & Pearl, 1998]: If a set of evidence variables E d-separates X and Z in a Bayesian network’s graph, then I. d-separation can be computed in linear time using a depth-first-search- like algorithm. Great! We now have a fast algorithm for automatically inferring whether learning the value of one variable might give us any additional hints about some other variable, given what we already know. “ Might”: Variables may actually be independent when they’re not d- separated, depending on the actual probabilities involved

Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008.

Similar presentations

Presentation on theme: "Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008.

Similar presentations

Presentation on theme: "Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008."— Presentation transcript:

Similar presentations

About project

Feedback