CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

AGENDA Probabilistic inference queries Top-down inference Variable elimination

P ROBABILITY Q UERIES Given: some probabilistic model over variables X Find: distribution over Y  X given evidence E = e for some subset E  X / Y P( Y | E = e ) Inference problem

A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Easiest case: Y = X / E P( Y | E = e ) = P( Y, e )/P( e ) Denominator makes the probabilities sum to 1 Determine P( e ) by marginalizing: P( e ) =  y P( Y=y, e ) Otherwise, let W = X /( E  Y ) P( Y | E = e ) =  w P( Y, W = w, e ) /P( e ) P( e ) =  y  w P( Y=y, W = w, e ) Inference with joint distribution: O(2 | X / E | ) for binary variables

A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Another common case: Y ={Q} (single query variable) Can we do better than brute force marginalization of the joint distribution?

BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(Alarm)

BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e)

BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) 3.P(Alarm) = P(A|B,E)P(B)P(E) + P(A|B,  E)P(B)P(  E) + P(A|  B,E)P(  B)P(E) + P(A|  B,  E)P(  B)P(  E) Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) 3.P(Alarm) = P(A|B,E)P(B)P(E) + P(A|B,  E)P(B)P(  E) + P(A|  B,E)P(  B)P(E) + P(A|  B,  E)P(  B)P(  E)

BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(A) = Σ b,e P(A,b,e) 2.P(A) = Σ b,e P(A|b,e)P(b)P(e) 3.P(A) = P(A|B,E)P(B)P(E) + P(A|B,  E)P(B)P(  E) + P(A|  B,E)P(  B)P(E) + P(A|  B,  E)P(  B)P(  E) 4.P(A) = 0.95*0.001*0.002 + 0.94*0.001*0.998 + 0.29*0.999*0.002 + 0.001*0.999*0.998 = 0.00252 Suppose we want to compute P(Alarm) 1.P(A) = Σ b,e P(A,b,e) 2.P(A) = Σ b,e P(A|b,e)P(b)P(e) 3.P(A) = P(A|B,E)P(B)P(E) + P(A|B,  E)P(B)P(  E) + P(A|  B,E)P(  B)P(E) + P(A|  B,  E)P(  B)P(  E) 4.P(A) = 0.95*0.001*0.002 + 0.94*0.001*0.998 + 0.29*0.999*0.002 + 0.001*0.999*0.998 = 0.00252

BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls)

BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M|  A) P(  A) Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M|  A) P(  A)

BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M|  A) P(  A) 2.P(M) = 0.70*0.00252 + 0.01*(1-0.0252) = 0.0117 Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M|  A) P(  A) 2.P(M) = 0.70*0.00252 + 0.01*(1-0.0252) = 0.0117

BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 T OP -D OWN INFERENCE WITH E VIDENCE Suppose we want to compute P(Alarm|Earthquake)

T OP -D OWN INFERENCE Only works if the graph of ancestors is a polytree Evidence given on ancestor(s) of Q Efficient: O(d) time, where d is the number of ancestors of a variable, | Pa X | assumed constant Evidence on an ancestor cuts off influence of portion of graph above evidence node

N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class)  i P(Feature i | Class) Class Feature 1 Feature 2 Feature n P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C)  i P(F i |C) Given features, what class? Spam / Not Spam English / French / Latin … Word occurrences

N ORMALIZATION F ACTORS P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C)  i P(F i |C) 1/Z term is a normalization factor so that P(C|F 1,…,F n ) sums to 1 Z =  c P(C=c)  i P(F i |C=c) Different for each value of F 1,…,F n Often left implicit Usual implementation: first compute the unnormalized distribution P(C)  i P(F i =f i |C) for all values of C, then performing a normalization step in O(|Val(C)|) time

N OTE : N UMERICAL I SSUES IN I MPLEMENTATION Suppose P(f i |c) is very small for all i, e.g., probability that a given uncommon word f i appears in a document The product P(C)  i P(F i |C) with large n will be exceedingly small and might underflow More numerically stable solution: Compute log P(C) +  i log P(f i |C) for all values of C Compute b = max c [log P(c) +  i log P(f i |c)] P(C|f 1,…,f n ) = exp(log P(C) +  i log P(f i |C) - b) / Z’ With Z’ a normalization factor A common trick when dealing with many products of small #s

N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class)  i P(Feature i | Class) P(C|F 1,….,F k ) = 1/Z P(C,F 1,….,F k ) = 1/Z  fk+1…fn P(C,F 1,….,F k,f k+1,…f n ) = 1/Z P(C)  fk+1…fn  i=1…k P(F i |C)  j=k+1…n P(f j |C) = 1/Z P(C)  i=1…k P(F i |C)  j=k+1…n  fj P(f j |C) = 1/Z P(C)  i=1…k P(F i |C) Given some features, what is the distribution over class?

F OR G ENERAL B AYES N ETS Exact inference: variable elimination Efficient for polytrees and certain “simple” graphs NP hard in general Approximate inference Monte-Carlo sampling techniques Belief propagation (exact in polytrees)

BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) =  b,e P(A|b,e)P(b)P(e) Suppose we want to compute P(A) P(A) =  b,e P(A|b,e)P(b)P(e)

ABE  (A,B,E) TTTTFFFFTTTTFFFF TTFFTTFFTTFFTTFF TFTFTFTFTFTFTFTF 1.9e-6 0.000938 0.000579 0.000997 1e-7 0.0011976 0.00141858 0.996 A MaryCallsJohnCalls A,B,E AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) =  b,e   (A,b,e) (product) Suppose we want to compute P(A) P(A) =  b,e   (A,b,e) (product)

ABE  (A,B,E) TTTTFFFFTTTTFFFF TTFFTTFFTTFFTTFF TFTFTFTFTFTFTFTF 1.9e-6 0.000938 0.000579 0.000997 1e-7 0.0011976 0.00141858 0.996 A MaryCallsJohnCalls AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) =  b,e   (A,b,e) (product) =  b,e   (A,b,e) (sum) Suppose we want to compute P(A) P(A) =  b,e   (A,b,e) (product) =  b,e   (A,b,e) (sum) P(A=T) P(A=F)

P ROBABILITY QUERIES Computing P( Y,E ) in a BN is a sum-product operation: P( Y, E ) =  w P( Y, W = w, E ) =  w  ( Y  E, W = w ) With  ( X ) =  X P(X| Pa X ) Idea of variable elimination: rearrange the order of sum-products so as a recursive set of smaller sum-products

V ARIABLE E LIMINATION Consider linear network X 1  X 2  X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 )

V ARIABLE E LIMINATION Consider linear network X 1  X 2  X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) = Σ x2 P(X 3 |x 2 )  (x 2 ) Computed for each value of X 2 How many * and + saved? *: 2*4*2=16 vs 4+4=8 + 2*3=8 vs 2+1=3 Can lead to huge gains in larger networks

VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a)

W HAT ORDER TO PERFORM VE? For tree-like BNs (polytrees), order so parents come before children # of variables in each intermediate probability table is 2 k where k is # of parents of a node If the number of parents of a node is bounded, then VE is linear time! Other networks: intermediate factors may become large

N ON - POLYTREE NETWORKS P(D) = Σ a Σ b Σ c P(A)P(B|A)P(C|A)P(D|B,C) = Σ b Σ c P(D|B,C) Σ a P(A)P(B|A)P(C|A) A BC D No more simplifications…

D O T AU - FACTORS CORRESPOND TO CONDITIONAL DISTRIBUTIONS ? Sometimes, but not necessarily A B C D

I MPLEMENTATION NOTES How to implement multidimensional factors? How to efficiently implement sum-product?

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.

Similar presentations

Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.

Similar presentations

Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks."— Presentation transcript:

Similar presentations

About project

Feedback