Presentation is loading. Please wait.

Presentation is loading. Please wait.

10/15  Homework 3 due 10/17  Mid-term on 10/24  Project 2 accepted until 10/26 (4 days automatic extension)

Similar presentations


Presentation on theme: "10/15  Homework 3 due 10/17  Mid-term on 10/24  Project 2 accepted until 10/26 (4 days automatic extension)"— Presentation transcript:

1 10/15  Homework 3 due 10/17  Mid-term on 10/24  Project 2 accepted until 10/26 (4 days automatic extension)

2 Problem: --Need too many numbers… --The needed numbers are harder to assess You can avoid assessing P(E=e) if you assess P(Y|E=e) since it must add up to 1

3

4 Relative ease/utility of Assessing various types of probabilities Joint distribution requires us to assess probabilities of type P(x1,~x2,x3,….~xn) This means we have to look at all entities in the world and see which fraction of them have x1,~x2,x3….~xm true Difficult experiment to setup.. Conditional probabilities of type P(A|B) are relatively much easier to assess –You just need to look at the set of entities having B true, and look at the fraction of them that also have A true Among the conditional probabilities, causal probabilities of the form P(effect|cause) are better to assess than diagnostic probabilities of the form P(cause|effect) –Causal probabilities tend to me more stable compared to diagnostic probabilities –(for example, a text book in dentistry can publish P(TA|Cavity) and hope that it will hold in a variety of places. In contrast, P(Cavity|TA) may depend on other fortuitous factors—e.g. in areas where people tend to eat a lot of icecream, many tooth aches may be prevalent, and few of them may be actually due to cavities. Doc, Doc, I have flu. Can you tell if I have runny nose?

5 A be Anthrax; Rn be Runny Nose P(A|Rn) = P(Rn|A) P(A)/ P(Rn) Get by with easier to assess numbers Generalized bayes rule P(A|B,e) = P(B|A,e) P(A|e) P(B|e) Think of this as analogous to inference rules (like modus-ponens)

6 Can we avoid assessing P(S)? P(M|S) = P(S|M) P(M)/P(S) P(~M|S) = P(S|~M) P(~M)/P(S) ---------------------------------------------------------------- 1 = 1/P(S) [ P(S|M) P(M) + P(S|~M) P(~M) ] So, if we assess P(S|~M), then we don’t need to assess P(S) “Normalization”

7 Is P(S|~M) any easier to assess than P(~S)? P(S|M) is clearly easy to assess (just look at the fraction of meningitis patients that have stiff neck P(S) seems hard to assess—you need to ask random people whether they have stiff neck or not P(S|~M) seems just as hard to assess… –And in general there seems to be no good argument that it is always easier to assess than P(S) In fact they are related in a quite straightforward way – P(S) =P(S|M)*P(M) + P(S|~M)*P(~M) »(To see this, note that P(S)= P(S&M)+P(S&~M) and then use product rule) The real reason we assess P(S|~M) is that often we need the posterior distribution rather than just the single probability –For boolean variables, you can get the distribution given one value –But for multi-valued variables, we need to assess P(D=di|S) for all values di of the variable D. To do this, we need P(S|D=di) type probabilities anyway…

8 What happens if there are multiple symptoms…? Patient walked in and complained of toothache You assess P(Cavity|Toothache) Now you try to probe the patients mouth with that steel thingie, and it catches… How do we update our belief in Cavity? P(Cavity|TA, Catch) = P(TA,Catch| Cavity) * P(Cavity) P(TA,Catch) =  P(TA,Catch|Cavity) * P(Cavity) Need to know this! If n evidence variables, We will need 2 n probabilities! Conditional independence To the rescue Suppose P(TA,Catch|cavity) = P(TA|Cavity)*P(Catch|Cavity)

9

10 Written as A||B

11

12

13 Generalized bayes rule P(A|B,e) = P(B|A,e) P(A|e) P(B|e)

14 Directly using Joint Distribution Directly using Bayes rule Using Bayes rule With bayes nets Takes O(2 n ) for most natural queries of type P(D|Evidence) NEEDS O(2 n ) probabilities as input Probabilities are of type P(w k )—where w k is a world Can take much less than O(2 n ) time for most natural queries of type P(D|Evidence) STILL NEEDS O(2 n ) probabilities as input Probabilities are of type P(X 1..X n |Y) Can take much less than O(2 n ) time for most natural queries of type P(D|Evidence) Can get by with anywhere between O(n) and O(2 n ) probabilities depending on the conditional independences that hold. Probabilities are of type P(X 1..X n |Y)

15

16

17

18 Generalized bayes rule P(A|B,e) = P(B|A,e) P(A|e) P(B|e)

19 Conditional Independence Assertions We write X || Y | Z to say that the set of variables X is conditionally independent of the set of variables Y given evidence on the set of variables Z (where X,Y,Z are subsets of the set of all random variables in the domain model) We saw that Bayes Rule computations can exploit conditional independence assertions. Specifically, –X || Y| Z implies P(X & Y|Z) = P(X|Z) * P(Y|Z) P(X|Y, Z) = P(X|Z) P(Y|X,Z) = P(Y|Z) Idea: Why not write down all conditional independence assertions that hold in a domain?

20 Cond. Indep. Assertions (Contd) Idea: Why not write down all conditional independence assertions (CIA) (X || Y | Z) that hold in a domain? Problem: There can be exponentially many conditional independence assertions that hold in a domain (recall that X, Y and Z are all subsets of the domain variables. Brilliant Idea: May be we should implicitly specify the CIA by writing down the “dependencies” between variables using a graphical model –A Bayes Network is a way of doing just this. The Bayes Net is a Directed Acyclic Graph whose nodes are random variables, and the immediate dependencies between variables are represented by directed arcs –The topology of a bayes network shows the inter-variable dependencies. Given the topology, there is a way of checking if any Cond. Indep. Assertion. holds in the network (the Bayes Ball algorithm and the D-Sep idea)

21

22 CIA implicit in Bayes Nets So, what conditional independence assumptions are implicit in Bayes nets? –Local Markov Assumption: A node N is independent of its non-descendants (including ancestors) given its immediate parents. (So if P are the immediate paretnts of N, and A is a subset of of Ancestors and other non-descendants, then {N} || A| P ) (Equivalently) A node N is independent of all other nodes given its markov blanket (parents, children, children’s parents) –Given this assumption, many other conditional independencies follow. For a full answer, we need to appeal to D-Sep condition and/or Bayes Ball reachability

23 Topological Semantics Independence from Non-descedants holds Given just the parents Independence from Every node holds Given markov blanket These two conditions are equivalent Many other conditional indepdendence assertions follow from these

24

25 P(~B)P(~E|~B)P(A|~E,~B)P(J|A,~E,~B)P(M|A,~B,~E,J) (Chain Rule) P(~B) P(~E)P(A|~E,~B)P(J|A)P(M|A) Local Semantics Node independent of non-descendants given its parents

26 D-sep (direction dependent Separation) X || Y | E if every undirected path from X to Y is blocked by E –A path is blocked if there is a node Z on the path s.t. 1.Z is in E and Z has one arrow coming in and another going out 2.Z is in E and Z has both arrows going out 3.Neither Z nor any of its descendants are in E and both path arrows lead to Z B||M|A (J,M)||E | A B||E B||E | A B||E | M

27 Bayes Nets are not sufficient to model all sets of CIAs We said that a bayes net implicitly represents a bunch of CIA Qn. If I tell you exactly which CIA hold in a domain, can you give me a bayes net that exactly models those and only those CIA? –Unfortunately, NO. (See the example to the right) –This is why there is another type of graphical models called “undirected graphical models” In an undirected graphical model, also called a markov random field, nodes correspond to random variables, and the immediate dependencies between variables are represented by undirected edges. –The CIA modeled by an undirected graphical model are different »X || Y | Z in an undirected graph if every path from a node in X to a node in Y must pass through a node in Z (so if we remove the nodes in Z, then X and Y will be disconnected) –Undirected models are good to represent “soft constraints” between random variables (e.g. the correlation between different pixels in an image) while directed models are good for representing causal influences between variables Give a bayes net on X,Y,Z,W s.t. X||Y|(Z,W) Z||W|(X,Y) And no other C.I. Impossible! X Y WZ

28 10/17

29 P(~B)P(~E|~B)P(A|~E,~B)P(J|A,~E,~B)P(M|A,~B,~E,J) (Chain Rule) P(~B) P(~E)P(A|~E,~B)P(J|A)P(M|A) Local Semantics Node independent of non-descendants given its parents

30 D-sep (direction dependent Separation) X || Y | E if every undirected path from X to Y is blocked by E –A path is blocked if there is a node Z on the path s.t. 1.Z is in E and Z has one arrow coming in and another going out 2.Z is in E and Z has both arrows going out 3.Neither Z nor any of its descendants are in E and both path arrows lead to Z B||M|A (J,M)||E | A B||E B||E | A B||E | M

31 Causal Chains; Common Causes; Common Effects: Another way of understanding D-SEP Causal chain X causes Y through Z is blocked if Z is given Common Cause X and Y are caused by Z is blocked if Z is given Common Effect X and Y cause Z is blocked only if neither Z nor its descendants are given

32 Causality Logic (both deterministic and probabilistic) doesn’t by default encapsulate causality –A logical dependency A => B is direction less (it is basically equivalent to ~A V B) –A probabilistic dependency A || B | C is also directionless. It only states that A and B are correlated –Causality, on the other hand, has direction. –If the logical/probabilistic model ignores this it can get to logically valid and yet silly conclusions E.g. If grass is wet, then it rained. If you break this bottle, then grass gets wet. Ergo: If you break this botte, then it will rain. E.g. The probability of there being two bombs on a plane is exceedingly low. So, to increase the safety of the flight I will take a bomb along with me. E.g.(?) Children whose homes have a lot of books do well in school. Go ahead and put a whole bunch of books in each home (Actually, the state of Illinois had a program where they will send books every month to all kids). –Learning “causal models” from data is thus harder than learning “correlation models” –Which is why it is so very great when causal models are available for the asking— through domain experts. See IJCAI 99 Research Excellence lecture by Judea Pearl

33 Blog Questions You have been given the topology of a bayes network, but haven't yet gotten the conditional probability tables (to be concrete, you may think of the pearl alarm-earth quake scenario bayes net). Your friend shows up and says he has the joint distribution all ready for you. You don't quite trust your friend and think he is making these numbers up. Is there any way you can prove that your friends' joint distribution is not correct? Answer: – Check to see if the joint distribution given by your friend satisfies all the conditional independence assumptions. –For example, in the Pearl network, Compute P(J|A,M,B) and P(J|A). These two numbers should come out the same! Notice that your friend could pass all the conditional indep assertions, and still be cheating re: the probabilities –For example, he filled up the CPTs of the network with made up numbers (e.g. P(B)=0.9; P(E)=0.7 etc) and computed the joint probability by multiplying the CPTs. This will satisfy all the conditional indep assertions..! –The main point to understand here is that the network topology does put restrictions on the joint distribution.

34 Blog Questions (2) Continuing bad friends, in the question above, suppose a second friend comes along and says that he can give you the conditional probabilities that you want to complete the specification of your bayes net. You ask him a CPT entry, and pat comes a response--some number between 0 and 1. This friend is well meaning, but you are worried that the numbers he is giving may lead to some sort of inconsistent joint probability distribution. Is your worry justified ( i.e., can your friend give you numbers that can lead to an inconsistency?) (To understand "inconsistency", consider someone who insists on giving you P(A), P(B), P(A&B) as well as P(AVB) and they wind up not satisfying the P(AVB)= P(A)+P(B) -P(A&B) [or alternately, they insist on giving you P(A|B), P(B|A), P(A) and P(B), and the four numbers dont satisfy the bayes rule] Answer: No—as long as we only ask the friend to fill up the CPTs in the bayes network, there is no way the numbers won’t makeup a consistent joint probability distribution –This should be seen as a feature.. Personal Probabilities –John may be an optimist and believe that P(burglary)=0.01 and Tom may be a pessimist and believe that P(burglary)=0.99 –Bayesians consider both John and Tom to be fine (they don’t insist on an objective frequentist interpretation for probabilites) –However, Bayesians do think that John and Tom should act consistently with their own beliefs For example, it makes no sense for John to go about installing tons of burglar alarms given his belief, just as it makes no sense for Tom to put all his valuables on his lawn

35 Blog Questions (3) Your friend heard your claims that Bayes Nets can represent any possible conditional independence assertions exactly. He comes to you and says he has four random variables, X, Y, W and Z, and only TWO conditional independence assertions: X.ind. Y | {W,Z} W.ind. Z | {X, Y} He dares you to give him a bayes network topology on these four nodes that exactly represents these and only these conditional independencies. Can you? (Note that you only need to look at 4 vertex directed graphs). Answer: No this is not possible. Here are two “wrong” answers –Consider a disconnected graph where X, Y, W, Z are all unconnected. In this graph, the two CIAs hold. However, unfortunately so do many other CIAs –Consider a graph where W and Z are both immediate parents of X and Y. In this case, clearly, X.ind. Y| {W,Z}. However, W and Z are definitely dependent given X and Y (Explaining away). Undirected models can capture these CIA exactly. Consider a graph X is connected to W and Z; and Y is connected to W and Z (sort of a diamond). –In undirected models CIA is defined in terms of graph separability –Since X and Y separate W and Z (i.e., every path between W and Z must pass through X and Y), W.ind. Z|{X,Y}. Similarly the other CIA Undirected graphs will be unable to model some scenarios that directed ones can; so you need both… There are also distributions that neither undirected nor directed models can perfect-map (see picture above) If we can’t have perfect map, we can consider giving up either I-map or a d-map –Giving up I-map leads to loss of accuracy (since your model assumes CIAs that don’t exist in the distribution). It can however increase efficiency (e.g. naïve bayes models) –Giving up D- map leads to loss of efficiency but preserves accuracy (if you think more things are connected that really are, you will assess more probabilities—and some of them wind up being redundant anyway because of the CIAs that hold in the distribution) Given a graphical model G, and a distribution D, G is an I-map of D if every CIA reflected in G actually holds in D [“soundness”] G is a D-map of D if every CIA of D is reflected in G [“completeness”] G is a perfect map of D if it is both I-map and D-map BN All distributions MN Distributions that can have a perfect map in terms of a bayes network An MN that BN can’t represent A BN that MN can’t represent

36

37 Alarm P(A|J,M) =P(A)? Burglary Earthquake How many probabilities are needed? 13 for the new; 10 for the old Is this the worst? Introduce variables in the causal order

38 Digression: Is assessing/learning numbers the only hard model-learning problem? We are making it sound as if assessing the probabilities is a big deal In doing so, we are taking into account model acquisition/learning costs. How come we didn’t care about these issues in logical reasoning? Is it because acquiring logical knowledge is easy? Actually—if we are writing programs for worlds that we (the humans) already live in, it is easy for us (humans) to add the logical knowledge into the program. It is a pain to give the probabilities.. On the other hand, if the agent is fully autonomous and is bootstrapping itself, then learning logical knowledge is actually harder than learning probabilities.. –For example, we will see that given the bayes network topology (“logic”), learning its CPTs is much easier than learning both topology and CPTs

39 Making the network Sparse by introducing intermediate variables Consider a network of boolean variables where n parent nodes are connected to m children nodes (with each parent influencing each child). –You will need n+m*2 n conditional probabilities Suppose you realize that what is really influencing the child nodes is some single aggregate function on the parent’s values (e.g. sum of the parents). –We can introduce a single intermediate node called “sum” which has links from all the n parent nodes, and separately influences each of the m child nodes You will wind up needing only n+2 n +2m conditional probabilities to specify this new network!

40

41

42

43 We only consider the failure to cause Prob of the causes that hold How about Noisy And? (hint: A&B => ~( ~A V ~B) ) k i=j+1 riri Prob that X holds even though i th parent doesn’t

44

45 Constructing Belief Networks: Summary [[Decide on what sorts of queries you are interested in answering –This in turn dictates what factors to model in the network Decide on a vocabulary of the variables and their domains for the problem –Introduce “Hidden” variables into the network as needed to make the network “sparse” Decide on an order of introduction of variables into the network –Introducing variables in causal direction leads to fewer connections (sparse structure) AND easier to assess probabilities Try to use canonical distributions to specify the CPTs –Noisy-OR –Parameterized discrete/continuous distributions Such as Poisson, Normal (Gaussian) etc

46 Case Study: Pathfinder System Domain: Lymph node diseases –Deals with 60 diseases and 100 disease findings Versions: –Pathfinder I: A rule-based system with logical reasoning –Pathfinder II: Tried a variety of approaches for uncertainity Simple bayes reasoning outperformed –Pathfinder III: Simple bayes reasoning, but reassessed probabilities –Parthfinder IV: Bayesian network was used to handle a variety of conditional dependencies. Deciding vocabulary: 8 hours Devising the topology of the network: 35 hours Assessing the (14,000) probabilities: 40 hours –Physician experts liked assessing causal probabilites Evaluation: 53 “referral” cases –Pathfinder III: 7.9/10 –Pathfinder IV: 8.9/10 [Saves one additional life in every 1000 cases!] –A more recent comparison shows that Pathfinder now outperforms experts who helped design it!!

47

48

49

50

51

52

53 What Conditional Independecies are exactly implicit in a bayes net?

54 D-sep (direction dependent Separation) (for those who don’t like Bayes Balls) X || Y | E if every undirected path from X to Y is blocked by E –A path is blocked if there is a node Z on the path s.t. 1.Z is in E and Z has one arrow coming in and another going out 2.Z is in E and Z has both arrows going out 3.Neither Z nor any of its descendants are in E and both path arrows lead to Z

55 Bayes Ball Alg:  Shade the evidence nodes  Put one ball each at the X nodes  See if any if them can find their way to any of the Y nodes

56

57 Bayes Ball Alg:  Shade the evidence nodes  Put one ball each at the X nodes  See if any if them can find their way to any of the Y nodes

58

59

60

61

62

63

64 Blog Comments.. arcarc said... I'd just like to say that this project is awesome. And not solely because it doesn't involve LISP, either! “Can’t leave well enough alone” rejoinder

65 Directly using Joint Distribution Directly using Bayes rule Using Bayes rule With bayes nets Takes O(2 n ) for most natural queries of type P(D|Evidence) NEEDS O(2 n ) probabilities as input Probabilities are of type P(w k )—where w k is a world Can take much less than O(2 n ) time for most natural queries of type P(D|Evidence) STILL NEEDS O(2 n ) probabilities as input Probabilities are of type P(X 1..X n |Y) Can take much less than O(2 n ) time for most natural queries of type P(D|Evidence) Can get by with anywhere between O(n) and O(2 n ) probabilities depending on the conditional independences that hold. Probabilities are of type P(X 1..X n |Y)


Download ppt "10/15  Homework 3 due 10/17  Mid-term on 10/24  Project 2 accepted until 10/26 (4 days automatic extension)"

Similar presentations


Ads by Google