Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008.

Similar presentations


Presentation on theme: "Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008."— Presentation transcript:

1 Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008

2 Slide 2 Some practical problems I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(B) = P(B|A) P(A) + P(B|~A) P(~A) = 0.1*0.75 + 0.5*0.25 = 0.2 What is P(A=fair|B=criticalHit)? “Loaded” means P(19 or 20)=0.5

3 Slide 3 Definition of Conditional Probability P(A ^ B) P(A|B) = ----------- P(B) Corollaries: P(B|A) = P(A|B) P(B) / P(A) P(A ^ B) = P(A|B) P(B) Chain rule Bayes rule

4 Slide 4 The (highly practical) Monty Hall problem You’re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing…a goat! 3 You now can either stick with your guess always change doors flip a coin and pick a new door randomly according to the coin

5 Slide 5 Some practical problems I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick two d6 uniformly at random (A) and roll them. What is more likely – rolling a seven or rolling doubles? Three combinations: HL, HF, FL P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL) = P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)

6 Slide 6 A brute-force solution ARoll 1Roll 2P FL111/3 * 1/6 * ½ FL121/3 * 1/6 * 1/10 FL1…… …16 21 2… ……… 66 HL11 12 ……… HF11 … Comment doubles seven doubles A joint probability table shows P(X1=x1 and … and Xk=xk) for every possible combination of values x1,x2,…., xk With this you can compute any P(A) where A is any boolean combination of the primitive events (Xi=Xk), e.g. P(doubles) P(seven or eleven) P(total is higher than 5) ….

7 Slide 7 The Joint Distribution Recipe for making a joint distribution of M variables: Example: Boolean variables A, B, C

8 Slide 8 The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). Example: Boolean variables A, B, C ABC 000 001 010 011 100 101 110 111

9 Slide 9 The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. Example: Boolean variables A, B, C ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10

10 Slide 10 The Joint Distribution Recipe for making a joint distribution of M variables: 1.Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2.For each combination of values, say how probable it is. 3.If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C ABCProb 0000.30 0010.05 0100.10 0110.05 100 1010.10 1100.25 1110.10 A B C 0.05 0.25 0.100.05 0.10 0.30

11 Slide 11 Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute

12 Slide 12 Using the Joint P(Poor Male) = 0.4654

13 Slide 13 Using the Joint P(Poor) = 0.7604

14 Slide 14 Inference with the Joint

15 Slide 15 Inference with the Joint P(Male | Poor) = 0.4654 / 0.7604 = 0.612

16 Slide 16 Inference is a big deal I’ve got this evidence. What’s the chance that this conclusion is true? I’ve got a sore neck: how likely am I to have meningitis? I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep? … Lots and lots of important problems can be solved algorithmically using joint inference. How do you get from the statement of the problem to the joint probability table? (What is a “statement of the problem”?) The joint probability table is usually too big to represent explicitly, so how do you represent it compactly?

17 Slide 17 Problem 1: A description of the experiment I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? A B P(A=fair)=0.75 P(A=loaded)=0.25 P(B=critical | A=fair=0.1 P(B=noncritical | A=fair)=0.9 P(B=critical | A=loaded)=0.5 P(B=noncritical | A=loaded)=0.5 AP(A) Fair0.75 Loaded0.25 ABP(B|A) FairCritical0.1 FairNoncritical0.9 LoadedCritical0.5 LoadedNoncritical0.5

18 Slide 18 Problem 1: A description of the experiment A B AP(A) Fair0.75 Loaded0.25 ABP(B|A) FairCritical0.1 FairNoncritical0.9 LoadedCritical0.5 LoadedNoncritical0.5 This is an “influence diagram” (informal: what influences what) a Bayes net / belief network / … (formal: more later!) a description of the experiment (i.e., how data could be generated) everything you need to reconstruct the joint distribution Chain rule: P(A,B)=P(B|A) P(A) problem with probs:Bayes net::word problem:algebra

19 Slide 19 Problem 2: a description I have 1 standard d6 die, 2 loaded d6 die. Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50 Experiment: pick two d6 without replacement (D1,D2) and roll them. What is more likely – rolling a seven or rolling doubles? D1 D2 D1P(D1) Fair0.33 LoadedLo0.33 LoadedHi0.33 D1D2P(D2|D1) FairLoadedLo0.5 FairLoadedHi0.5 LoadedLoFair0.5 LoadedLoLoadedHi0.5 LoadedHiFair0.5 LoadedHiLoadedLo0.5 R D1D2RP(R|D1,D2) FairLo1,11/6 * 1/2 FairLo1,21/6 * 1/10..………

20 Slide 20 Problem 2: a description D1 D2 D1P(D2) Fair0.33 LoadedLo0.33 LoadedHi0.33 D1D2P(D2|D1) FairLoadedLo0.5 FairLoadedHi0.5 LoadedLoFair0.5 LoadedLoLoadedHi0.5 LoadedHiFair0.5 LoadedHiLoadedLo0.5 R D1D2RP(R|D1,D2) FairLo1,11/6 * 1/2 FairLo1,21/6 * 1/10..……… P(A ^ B) = P(A|B) P(B) Chain rule P(R,D1,D2) = P(R|D1,D2) P(D1,D2) P(R,D1,D2) = P(R|D1,D2) P(D2|D1) P(D1)

21 Slide 21 Problem 2: another description D1 D2 R1 R2 Is7 R2=1,2,…,6 R1=1,2,…,6 Is7=yes,no

22 Slide 22 The (highly practical) Monty Hall problem You’re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing…a goat! You now can either stick with your guess or change doors AB First guess The money C The goat D Stick, or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 DP(D) Stick0.5 Swap0.5 ABCP(C|A,B) 1120.5 113 1231.0 132 …………

23 Slide 23 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 ABCP(C|A,B) 1120.5 113 1231.0 132 ………… ACDP(E|A,C,D) …………

24 Slide 24 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 ABCP(C|A,B) 1120.5 113 1231.0 132 ………… ACDP(E|A,C,D) ………… …again by the chain rule: P(A,B,C,D,E) = P(E|A,C,D) * P(D) * P(C | A,B ) * P(B ) * P(A) We could construct the joint and compute P(E=B|D=swap)

25 Slide 25 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 ABCP(C|A,B) 1120.5 113 1231.0 132 ………… ACDP(E|A,C,D) ………… The joint table has…? 3*3*3*2*3 = 162 rows The conditional probability tables (CPTs) shown have … ? 3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows Big questions: why are the CPTs smaller? how much smaller are the CPTs than the joint? can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs?

26 Slide 26 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess AP(A) 10.33 2 3 BP(B) 10.33 2 3 ABCP(C|A,B) 1120.5 113 1231.0 132 ………… ACDP(E|A,C,D) ………… Why is the CPTs representation smaller? Follow the money! (B) E is conditionally independent of B given A,D,C

27 Slide 27 Conditional Independence formalized Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 ’s assignments  S 2 ’s assignments & S 3 ’s assignments)= P(S1’s assignments  S3’s assignments)

28 Slide 28 The (highly practical) Monty Hall problem AB First guess The money C The goat D Stick or swap? E Second guess What are the conditional indepencies? I ? …

29 Slide 29 Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V, E where: V is a set of vertices. E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: The name of a random variable A probability distribution table indicating how the probability of this variable’s values depends on all possible combinations of parental values.

30 Slide 30 Building a Bayes Net 1.Choose a set of relevant variables. 2.Choose an ordering for them 3.Assume they’re called X 1.. X m (where X 1 is the first in the ordering, X 1 is the second, etc) 4.For i = 1 to m: 1.Add the X i node to the network 2.Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) 3.Define the probability table of P(X i =k  Assignments of Parents(X i ) ).

31 Slide 31 The general case P(X 1 = x 1 ^ X 2 =x 2 ^ ….X n-1 =x n-1 ^ X n =x n ) = P(X n =x n ^ X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1 ^…. X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n  X n-1 =x n-1 ^ ….X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1  …. X 2 =x 2 ^ X 1 =x 1 ) * P(X n-2 =x n-2 ^…. X 2 =x 2 ^ X 1 =x 1 ) = : = So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

32 Slide 32 Building a Bayes Net: Example 1 1.Choose a set of relevant variables. 2.Choose an ordering for them 3.Assume they’re called X 1.. X m (where X 1 is the first in the ordering, X 1 is the second, etc) 4.For i = 1 to m: Add the X i node to the network Set Parents(X i ) to be a minimal subset of { X 1 …X i-1 } such that we have conditional independence of X i and all other members of { X 1 …X i-1 } given Parents(X i ) Define the probability table of P(X i =k  values of Parents(X i ) ) D1 – first die; D2 – second; R – roll, Is7 D1, D2, R, Is7 D1 D2 R Is7 D1P(D2) Fair0.33 Lo0.33 Hi0.33 D1D2P(D2|D1) FairLo0.5 FairHi0.5 LoFair0.5 LoHi0.5 HiFair0.5 HiLo0.5 D1D2RP(R|D1,D2) FairLo1,11/6 * 1/2 FairLo1,21/6 * 1/10..……… RIs7P(D2|D1) 1,1No1.0 1,2No1.0 ……… 1,6Yes1.0 2,1No… … … P(D1,D2,R,Is7)= P(Is7|R)*P(R|D1,D2)*P(D2|D1)*P(D1)

33 Slide 33 Another construction D1 D2 R1 R2 Is7 R2=1,2,…,6 R1=1,2,…,6 Is7=yes,no Pick the order D1, D2, R1, R2, Is7 Network will be: What if I pick other orders? What about R2,R1,D2,D1,Is7? What about the first network A  B? suppose I picked the order B,A ?

34 Slide 34 Another simple example Bigram language model: Pick word w 0 from a multinomial 1 P(w 0 ) For t=1 to 100 Pick word w t from multinomial 2 P(w t |w t-1 ) w0w0 w1w1 w2w2 … w 100 w0w0 P(w 0 ) aardvark0.0002 …… the0.179 …… zymurgy0.097 W t-1 WtWt P(w 0 ) aardvark 0.00000001 aardvark…… tastes0.1275 ……… aardvarkzymurgy0.0000076

35 Slide 35 A simple example Bigram language model: Pick word w 0 from a multinomial 1 P(w 0 ) For t=1 to 100 Pick word w t from multinomial 2 P(w t |w t-1 ) w0w0 w1w1 w2w2 … w 100 w0w0 P(w 0 ) aardvark0.0002 …… the0.179 …… zymurgy0.097 W t-1 WtWt P(w t | w t-1 ) aardvark 0.00000001 aardvark…… tastes0.1275 ……… aardvarkzymurgy0.0000076 Size of CPTs: |V| + |V| 2 Size of joint table: |V| ^ 101 |V|=20 for amino acids (proteins)

36 Slide 36 A simple example Trigram language model: Pick word w 0 from a multinomial M1 P(w 0 ) Pick word w 1 from a multinomial M2 P(w 1 |w 0 ) For t=2 to 100 Pick word w t from multinomial M3 P(w t |w t-1, w t-2 ) w0w0 w1w1 w2w2 … w 99 Size of CPTs: |V| + |V| 2 + | V | 3 Size of joint table: |V| ^ 101 |V|=20 for amino acids (proteins) w 100

37 Slide 37 What Independencies does a Bayes Net Model? In order for a Bayesian network to model a probability distribution, the following must be true by definition: Each variable is conditionally independent of all its non- descendants in the graph given the value of all its parents. This implies But what else does it imply?

38 Slide 38 What Independencies does a Bayes Net Model? Example: Z Y X Given Y, does learning the value of Z tell us nothing new about X? I.e., is P(X|Y, Z) equal to P(X | Y)? Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y).

39 Slide 39 Quick proof that independence is symmetric Assume: P(X|Y, Z) = P(X|Y) Then: (Bayes’s Rule) (Chain Rule) (By Assumption) (Bayes’s Rule)

40 Slide 40 What Independencies does a Bayes Net Model? Let I represent X and Z being conditionally independent given Y. I ? Yes, just as in previous example: All X’s parents given, and Z is not a descendant. Y XZ

41 Slide 41 What Independencies does a Bayes Net Model? I ? No. I ? Yes. Maybe I iff S acts a cutset between X and Z in an undirected version of the graph…? Z VU X

42 Slide 42 Things get a little more confusing X has no parents, so we know all its parents’ values trivially Z is not a descendant of X So, I, even though there’s a undirected path from X to Z through an unknown variable Y. What if we do know the value of Y, though? Or one of its descendants? ZX Y

43 Slide 43 The “Burglar Alarm” example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn’t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh! BurglarEarthquake Alarm Phone Call

44 Slide 44 Things get a lot more confusing But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. Earthquake “explains away” the hypothetical burglar. But then it must not be the case that I, even though I ! BurglarEarthquake Alarm Phone Call

45 Slide 45 d-separation to the rescue Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation. Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true:... ie. X and Z are dependent iff there exists an unblocked path

46 Slide 46 A path is “blocked” when... There exists a variable V on the path such that it is in the evidence set E the arcs putting Y in the path are “tail-to-tail” Or, there exists a variable V on the path such that it is in the evidence set E the arcs putting Y in the path are “tail-to-head” Or,... Y Y unknown “common causes” of X and Z impose dependency unknown “causal chains” connecting X an Z impose dependency

47 Slide 47 A path is “blocked” when… (the funky case) … Or, there exists a variable Y on the path such that it is NOT in the evidence set E neither are any of its descendants the arcs putting Y on the path are “head-to-head” Y Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z

48 Slide 48 d-separation to the rescue, cont’d Theorem [Verma & Pearl, 1998]: If a set of evidence variables E d-separates X and Z in a Bayesian network’s graph, then I. d-separation can be computed in linear time using a depth-first-search- like algorithm. Great! We now have a fast algorithm for automatically inferring whether learning the value of one variable might give us any additional hints about some other variable, given what we already know. “ Might”: Variables may actually be independent when they’re not d- separated, depending on the actual probabilities involved


Download ppt "Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning 10-601 Feb 2008."

Similar presentations


Ads by Google