CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

AGENDA Bayesian networks Chain rule for Bayes nets Naïve Bayes models Independence declarations D-separation Probabilistic inference queries

P URPOSES OF BAYESIAN N ETWORKS Efficient and intuitive modeling of complex causal interactions Compact representation of joint distributions O(n) rather than O(2 n ) Algorithms for efficient inference with given evidence (more on this next time)

I NDEPENDENCE OF RANDOM VARIABLES Two random variables A and B are independent if P(A,B) = P(A) P(B) hence P(A|B) = P(A) Knowing B doesn’t give you any information about A [This equality has to hold for all combinations of values that A and B can take on, i.e., all events A=a and B=b are independent]

S IGNIFICANCE OF INDEPENDENCE If A and B are independent, then P(A,B) = P(A) P(B) => The joint distribution over A and B can be defined as a product over the distribution of A and the distribution of B => Store two much smaller probability tables rather than a large probability table over all combinations of A and B

C ONDITIONAL I NDEPENDENCE Two random variables a and b are conditionally independent given C, if P(A, B|C) = P(A|C) P(B|C) hence P(A|B,C) = P(A|C) Once you know C, learning B doesn’t give you any information about A [again, this has to hold for all combinations of values that A,B,C can take on]

S IGNIFICANCE OF C ONDITIONAL INDEPENDENCE Consider Grade(CS101), Intelligence, and SAT Ostensibly, the grade in a course doesn’t have a direct relationship with SAT scores but good students are more likely to get good SAT scores, so they are not independent… It is reasonable to believe that Grade(CS101) and SAT are conditionally independent given Intelligence

BAYESIAN N ETWORK Explicitly represent independence among propositions Notice that Intelligence is the “cause” of both Grade and SAT, and the causality is represented explicitly Intel. Grade P(I=x) high 0.3 low 0.7 SAT 6 probabilities, instead of 11 P(I,G,S) = P(G,S|I) P(I) = P(G|I) P(S|I) P(I) P(G=x|I)I=lowI=high ‘a’ 0.20.74 ‘b’ 0.340.17 ‘C’ 0.460.09 P(S=x|I)I=lowI=high low 0.950.05 high 0.20.8

D EFINITION : BAYESIAN NETWORK Set of random variables X ={X 1,…,X n } with domains Val(X 1 ),…,Val(X n ) Each node has a set of parents Pa X Graph must be a DAG Each node also maintains a conditional probability distribution (often, a table) P(X| Pa X ) 2 k -1 entries for binary valued variables Overall: O(n2 k ) storage for binary variables Encodes the joint probability over X 1,…,X n

C ALCULATION OF JOINT P ROBABILITY BEP(a| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(b) 0.001 P(e) 0.002 AP(j|…) TFTF 0.90 0.05 AP(m|…) TFTF 0.70 0.01 P(j  m  a   b   e) = ??

C ALCULATION OF JOINT P ROBABILITY BEP(a| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake alarm MaryCallsJohnCalls P(b) 0.001 P(e) 0.002 AP(j|…) TFTF 0.90 0.05 AP(m|…) TFTF 0.70 0.01 P(j  m  a   b   e) = P(j|a)P(m|a)P(a|  b,  e)P(  b)P(  e) = 0.9 x 0.7 x 0.001 x 0.999 x 0.998 = 0.00062

C ALCULATION OF JOINT P ROBABILITY beP(a| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake alarm maryCallsjohnCalls P(b) 0.001 P(e) 0.002 aP(j|…) TFTF 0.90 0.05 aP(m|…) TFTF 0.70 0.01 P(j  m  a   b   e) = P(j|a)P(m|a)P(a|  b,  e)P(  b)P(  e) = 0.9 x 0.7 x 0.001 x 0.999 x 0.998 = 0.00062 P(x 1  x 2  …  x n ) =  i=1,…,n P(x i |pa Xi )  full joint distribution

C HAIN R ULE FOR B AYES N ETS Joint distribution is a product of all CPTs P(X 1,X 2,…,X n ) =  i=1,…,n P(X i | Pa Xi )

E XAMPLE : N AÏVE BAYES MODELS P(Cause,Effect 1,…,Effect n ) = P(Cause)  i P(Effect i | Cause) Cause Effect 1 Effect 2 Effect n

A DVANTAGES OF B AYES N ETS ( AND OTHER GRAPHICAL MODELS ) More manageable # of parameters to set and store Incremental modeling Explicit encoding of independence assumptions Efficient inference techniques

A RCS DO NOT NECESSARILY ENCODE CAUSALITY A B C C B A 2 BN’s with the same expressive power, and a 3 rd with greater power (exercise) C B A

R EADING OFF INDEPENDENCE RELATIONSHIPS Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)? No! C parent’s (B) are given, and so it is independent of its non-descendents (A) Independence is symmetric: C  A | B => A  C | B A B C

B ASIC R ULE A node is independent of its non-descendants given its parents (and given nothing else)

W HAT DOES THE BN ENCODE ? Burglary  Earthquake JohnCalls  MaryCalls | Alarm JohnCalls  Burglary | Alarm JohnCalls  Earthquake | Alarm MaryCalls  Burglary | Alarm MaryCalls  Earthquake | Alarm BurglaryEarthquake Alarm MaryCallsJohnCalls A node is independent of its non-descendents, given its parents

R EADING OFF INDEPENDENCE RELATIONSHIPS How about Burglary  Earthquake | Alarm ? No! Why? BurglaryEarthquake Alarm MaryCallsJohnCalls

R EADING OFF INDEPENDENCE RELATIONSHIPS How about Burglary  Earthquake | JohnCalls? No! Why? Knowing JohnCalls affects the probability of Alarm, which makes Burglary and Earthquake dependent BurglaryEarthquake Alarm MaryCallsJohnCalls

I NDEPENDENCE RELATIONSHIPS For polytrees, there exists a unique undirected path between A and B. For each node on the path: Evidence on the directed road X  E  Y or X  E  Y makes X and Y independent Evidence on an X  E  Y makes descendants independent Evidence on a “V” node, or below the V: X  E  Y, or X  W  Y with W  …  E makes the X and Y dependent (otherwise they are independent)

G ENERAL CASE Formal property in general case: D-separation : the above properties hold for all (acyclic) paths between A and B D-separation  independence That is, we can’t read off any more independence relationships from the graph than those that are encoded in D-separation The CPTs may indeed encode additional independences

P ROBABILITY Q UERIES Given: some probabilistic model over variables X Find: distribution over Y  X given evidence E = e for some subset E  X / Y P( Y | E = e ) Inference problem

A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Easiest case: Y = X / E P( Y | E = e ) = P( Y, e )/P( e ) Denominator makes the probabilities sum to 1 Determine P( e ) by marginalizing: P( e ) =  y P( Y=y, e ) Otherwise, let W = X /( E  Y ) P( Y | E = e ) =  w P( Y, W = w, e ) /P( e ) P( e ) =  y  w P( Y=y, W = w, e ) Inference with joint distribution: O(2 | X / E | ) for binary variables

N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class)  i P(Feature i | Class) Class Feature 1 Feature 2 Feature n P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C)  i P(F i |C) Given features, what class? Spam / Not Spam English / French / Latin … Word occurrences

N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class)  i P(Feature i | Class) P(C|F 1,….,F k ) = 1/Z P(C,F 1,….,F k ) = 1/Z  fk+1…fn P(C,F 1,….,F k,f k+1,…f n ) = 1/Z P(C)  fk+1…fn  i=1…k P(F i |C)  j=k+1…n P(f j |C) = 1/Z P(C)  i=1…k P(F i |C)  j=k+1…n  f j P(f j |C) = 1/Z P(C)  i=1…k P(F i |C) Given some features, what is the distribution over class?

F OR G ENERAL Q UERIES For BNs and queries in general, it’s not that simple… more in later lectures. Next class: skim 5.1-3, begin reading 9.1-4

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.

Similar presentations

Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.

Similar presentations

Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks."— Presentation transcript:

Similar presentations

About project

Feedback