Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.

Similar presentations


Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks."— Presentation transcript:

1 CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

2 AGENDA Bayesian networks Chain rule for Bayes nets Naïve Bayes models Independence declarations D-separation Probabilistic inference queries

3 P URPOSES OF BAYESIAN N ETWORKS Efficient and intuitive modeling of complex causal interactions Compact representation of joint distributions O(n) rather than O(2 n ) Algorithms for efficient inference with given evidence (more on this next time)

4 I NDEPENDENCE OF RANDOM VARIABLES Two random variables A and B are independent if P(A,B) = P(A) P(B) hence P(A|B) = P(A) Knowing B doesn’t give you any information about A [This equality has to hold for all combinations of values that A and B can take on, i.e., all events A=a and B=b are independent]

5 S IGNIFICANCE OF INDEPENDENCE If A and B are independent, then P(A,B) = P(A) P(B) => The joint distribution over A and B can be defined as a product over the distribution of A and the distribution of B => Store two much smaller probability tables rather than a large probability table over all combinations of A and B

6 C ONDITIONAL I NDEPENDENCE Two random variables a and b are conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C) hence P(A|B,C) = P(A|C) Once you know C, learning B doesn’t give you any information about A [again, this has to hold for all combinations of values that A,B,C can take on]

7 S IGNIFICANCE OF C ONDITIONAL INDEPENDENCE Consider Grade(CS101), Intelligence, and SAT Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scores but good students are more likely to get good SAT scores, so they are not independent… It is reasonable to believe that Grade(CS101) and SAT are conditionally independent given Intelligence

8 BAYESIAN N ETWORK Explicitly represent independence among propositions Notice that Intelligence is the “cause” of both Grade and SAT, and the causality is represented explicitly Intel. Grade P(I=x) high 0.3 low 0.7 SAT 6 probabilities, instead of 11 P(I,G,S) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I) P(G=x|I)I=lowI=high ‘a’ 0.20.74 ‘b’ 0.340.17 ‘C’ 0.460.09 P(S=x|I)I=lowI=high low 0.950.05 high 0.20.8

9 D EFINITION : BAYESIAN NETWORK Set of random variables X ={X 1,…,X n } with domains Val(X 1 ),…,Val(X n ) Each node has a set of parents Pa X Graph must be a DAG Each node also maintains a conditional probability distribution (often, a table) P(X| Pa X ) 2 k -1 entries for binary valued variables Overall: O(n2 k ) storage for binary variables Encodes the joint probability over X 1,…,X n

10 C ALCULATION OF JOINT P ROBABILITY BEP(a| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(b) 0.001 P(e) 0.002 AP(j|…) TFTF 0.90 0.05 AP(m|…) TFTF 0.70 0.01 P(j  m  a   b   e) = ??

11 P(j  m  a  b  e) = P(j  m|a,  b,  e)  P(a  b  e) = P(j|a,  b,  e)  P(m|a,  b,  e)  P(a  b  e) (J and M are independent given A) P(j|a,  b,  e) = P(j|a) (J and B and J and E are independent given A) P(m|a,  b,  e) = P(m|a) P(a  b  e) = P(a|  b,  e)  P(  b|  e)  P(  e) = P(a|  b,  e)  P(  b)  P(  e) (B and E are independent) P(j  m  a  b  e) = P(j|a)P(m|a)P(a|  b,  e)P(  b)P(  e) BurglaryEarthquake Alarm MaryCallsJohnCalls

12 C ALCULATION OF JOINT P ROBABILITY BEP(a| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake alarm MaryCallsJohnCalls P(b) 0.001 P(e) 0.002 AP(j|…) TFTF 0.90 0.05 AP(m|…) TFTF 0.70 0.01 P(j  m  a   b   e) = P(j|a)P(m|a)P(a|  b,  e)P(  b)P(  e) = 0.9 x 0.7 x 0.001 x 0.999 x 0.998 = 0.00062

13 C ALCULATION OF JOINT P ROBABILITY beP(a| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake alarm maryCallsjohnCalls P(b) 0.001 P(e) 0.002 aP(j|…) TFTF 0.90 0.05 aP(m|…) TFTF 0.70 0.01 P(j  m  a   b   e) = P(j|a)P(m|a)P(a|  b,  e)P(  b)P(  e) = 0.9 x 0.7 x 0.001 x 0.999 x 0.998 = 0.00062 P(x 1  x 2  …  x n ) =  i=1,…,n P(x i |pa Xi )  full joint distribution

14 C HAIN R ULE FOR B AYES N ETS Joint distribution is a product of all CPTs P(X 1,X 2,…,X n ) =  i=1,…,n P(X i | Pa Xi )

15 E XAMPLE : N AÏVE BAYES MODELS P(Cause,Effect 1,…,Effect n ) = P(Cause)  i P(Effect i | Cause) Cause Effect 1 Effect 2 Effect n

16 A DVANTAGES OF B AYES N ETS ( AND OTHER GRAPHICAL MODELS ) More manageable # of parameters to set and store Incremental modeling Explicit encoding of independence assumptions Efficient inference techniques

17 A RCS DO NOT NECESSARILY ENCODE CAUSALITY A B C C B A 2 BN’s with the same expressive power, and a 3 rd with greater power (exercise) C B A

18 R EADING OFF INDEPENDENCE RELATIONSHIPS Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)? No! C parent’s (B) are given, and so it is independent of its non-descendents (A) Independence is symmetric: C  A | B => A  C | B A B C

19 B ASIC R ULE A node is independent of its non-descendants given its parents (and given nothing else)

20 W HAT DOES THE BN ENCODE ? Burglary  Earthquake JohnCalls  MaryCalls | Alarm JohnCalls  Burglary | Alarm JohnCalls  Earthquake | Alarm MaryCalls  Burglary | Alarm MaryCalls  Earthquake | Alarm BurglaryEarthquake Alarm MaryCallsJohnCalls A node is independent of its non-descendents, given its parents

21 R EADING OFF INDEPENDENCE RELATIONSHIPS How about Burglary  Earthquake | Alarm ? No! Why? BurglaryEarthquake Alarm MaryCallsJohnCalls

22 R EADING OFF INDEPENDENCE RELATIONSHIPS How about Burglary  Earthquake | Alarm ? No! Why? P(B  E|A) = P(A|B,E)P(B  E)/P(A) = 0.00075 P(B|A)P(E|A) = 0.086 BurglaryEarthquake Alarm MaryCallsJohnCalls

23 R EADING OFF INDEPENDENCE RELATIONSHIPS How about Burglary  Earthquake | JohnCalls? No! Why? Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent BurglaryEarthquake Alarm MaryCallsJohnCalls

24 I NDEPENDENCE RELATIONSHIPS For polytrees, there exists a unique undirected path between A and B. For each node on the path: Evidence on the directed road X  E  Y or X  E  Y makes X and Y independent Evidence on an X  E  Y makes descendants independent Evidence on a “V” node, or below the V: X  E  Y, or X  W  Y with W  …  E makes the X and Y dependent (otherwise they are independent)

25 G ENERAL CASE Formal property in general case: D-separation : the above properties hold for all (acyclic) paths between A and B D-separation  independence That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation The CPTs may indeed encode additional independences

26 P ROBABILITY Q UERIES Given: some probabilistic model over variables X Find: distribution over Y  X given evidence E = e for some subset E  X / Y P( Y | E = e ) Inference problem

27 A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Easiest case: Y = X / E P( Y | E = e ) = P( Y, e )/P( e ) Denominator makes the probabilities sum to 1 Determine P( e ) by marginalizing: P( e ) =  y P( Y=y, e ) Otherwise, let W = X /( E  Y ) P( Y | E = e ) =  w P( Y, W = w, e ) /P( e ) P( e ) =  y  w P( Y=y, W = w, e ) Inference with joint distribution: O(2 | X / E | ) for binary variables

28 N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class)  i P(Feature i | Class) Class Feature 1 Feature 2 Feature n P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C)  i P(F i |C) Given features, what class? Spam / Not Spam English / French / Latin … Word occurrences

29 N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class)  i P(Feature i | Class) P(C|F 1,….,F k ) = 1/Z P(C,F 1,….,F k ) = 1/Z  fk+1…fn P(C,F 1,….,F k,f k+1,…f n ) = 1/Z P(C)  fk+1…fn  i=1…k P(F i |C)  j=k+1…n P(f j |C) = 1/Z P(C)  i=1…k P(F i |C)  j=k+1…n  f j P(f j |C) = 1/Z P(C)  i=1…k P(F i |C) Given some features, what is the distribution over class?

30 F OR G ENERAL Q UERIES For BNs and queries in general, it’s not that simple… more in later lectures. Next class: skim 5.1-3, begin reading 9.1-4


Download ppt "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks."

Similar presentations


Ads by Google