CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks
AGENDA Probabilistic inference queries Top-down inference Variable elimination
P ROBABILITY Q UERIES Given: some probabilistic model over variables X Find: distribution over Y X given evidence E = e for some subset E X / Y P( Y | E = e ) Inference problem
A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Easiest case: Y = X / E P( Y | E = e ) = P( Y, e )/P( e ) Denominator makes the probabilities sum to 1 Determine P( e ) by marginalizing: P( e ) = y P( Y=y, e ) Otherwise, let W = X /( E Y ) P( Y | E = e ) = w P( Y, W = w, e ) /P( e ) P( e ) = y w P( Y=y, W = w, e ) Inference with joint distribution: O(2 | X / E | ) for binary variables
A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Another common case: Y ={Q} (single query variable) Can we do better than brute force marginalization of the joint distribution?
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(Alarm)
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e)
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) 3.P(Alarm) = P(A|B,E)P(B)P(E) + P(A|B, E)P(B)P( E) + P(A| B,E)P( B)P(E) + P(A| B, E)P( B)P( E) Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) 3.P(Alarm) = P(A|B,E)P(B)P(E) + P(A|B, E)P(B)P( E) + P(A| B,E)P( B)P(E) + P(A| B, E)P( B)P( E)
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(A) = Σ b,e P(A,b,e) 2.P(A) = Σ b,e P(A|b,e)P(b)P(e) 3.P(A) = P(A|B,E)P(B)P(E) + P(A|B, E)P(B)P( E) + P(A| B,E)P( B)P(E) + P(A| B, E)P( B)P( E) 4.P(A) = 0.95*0.001* *0.001* *0.999* *0.999*0.998 = Suppose we want to compute P(Alarm) 1.P(A) = Σ b,e P(A,b,e) 2.P(A) = Σ b,e P(A|b,e)P(b)P(e) 3.P(A) = P(A|B,E)P(B)P(E) + P(A|B, E)P(B)P( E) + P(A| B,E)P( B)P(E) + P(A| B, E)P( B)P( E) 4.P(A) = 0.95*0.001* *0.001* *0.999* *0.999*0.998 =
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls)
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M| A) P( A) Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M| A) P( A)
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M| A) P( A) 2.P(M) = 0.70* *( ) = Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M| A) P( A) 2.P(M) = 0.70* *( ) =
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE WITH E VIDENCE Suppose we want to compute P(Alarm|Earthquake)
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b) Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b)
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b) 3.P(A|e) = 0.95* * = Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b) 3.P(A|e) = 0.95* * =
T OP -D OWN INFERENCE Only works if the graph of ancestors is a polytree Evidence given on ancestor(s) of Q Efficient: O(d) time, where d is the number of ancestors of a variable, | Pa X | assumed constant Evidence on an ancestor cuts off influence of portion of graph above evidence node
N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class) i P(Feature i | Class) Class Feature 1 Feature 2 Feature n P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C) i P(F i |C) Given features, what class? Spam / Not Spam English / French / Latin … Word occurrences
N ORMALIZATION F ACTORS P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C) i P(F i |C) 1/Z term is a normalization factor so that P(C|F 1,…,F n ) sums to 1 Z = c P(C=c) i P(F i |C=c) Different for each value of F 1,…,F n Often left implicit Usual implementation: first compute the unnormalized distribution P(C) i P(F i =f i |C) for all values of C, then performing a normalization step in O(|Val(C)|) time
N OTE : N UMERICAL I SSUES IN I MPLEMENTATION Suppose P(f i |c) is very small for all i, e.g., probability that a given uncommon word f i appears in a document The product P(C) i P(F i |C) with large n will be exceedingly small and might underflow More numerically stable solution: Compute log P(C) + i log P(f i |C) for all values of C Compute b = max c [log P(c) + i log P(f i |c)] P(C|f 1,…,f n ) = exp(log P(C) + i log P(f i |C) - b) / Z’ With Z’ a normalization factor A common trick when dealing with many products of small #s
N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class) i P(Feature i | Class) P(C|F 1,….,F k ) = 1/Z P(C,F 1,….,F k ) = 1/Z fk+1…fn P(C,F 1,….,F k,f k+1,…f n ) = 1/Z P(C) fk+1…fn i=1…k P(F i |C) j=k+1…n P(f j |C) = 1/Z P(C) i=1…k P(F i |C) j=k+1…n fj P(f j |C) = 1/Z P(C) i=1…k P(F i |C) Given some features, what is the distribution over class?
F OR G ENERAL B AYES N ETS Exact inference: variable elimination Efficient for polytrees and certain “simple” graphs NP hard in general Approximate inference Monte-Carlo sampling techniques Belief propagation (exact in polytrees)
BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) = b,e P(A|b,e)P(b)P(e) Suppose we want to compute P(A) P(A) = b,e P(A|b,e)P(b)P(e)
ABE (A,B,E) TTTTFFFFTTTTFFFF TTFFTTFFTTFFTTFF TFTFTFTFTFTFTFTF 1.9e e A MaryCallsJohnCalls A,B,E AP(J|…) TFTF AP(M|…) TFTF S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) = b,e (A,b,e) (product) Suppose we want to compute P(A) P(A) = b,e (A,b,e) (product)
ABE (A,B,E) TTTTFFFFTTTTFFFF TTFFTTFFTTFFTTFF TFTFTFTFTFTFTFTF 1.9e e A MaryCallsJohnCalls AP(J|…) TFTF AP(M|…) TFTF S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) = b,e (A,b,e) (product) = b,e (A,b,e) (sum) Suppose we want to compute P(A) P(A) = b,e (A,b,e) (product) = b,e (A,b,e) (sum) P(A=T) P(A=F)
P ROBABILITY QUERIES Computing P( Y,E ) in a BN is a sum-product operation: P( Y, E ) = w P( Y, W = w, E ) = w ( Y E, W = w ) With ( X ) = X P(X| Pa X ) Idea of variable elimination: rearrange the order of sum-products so as a recursive set of smaller sum-products
V ARIABLE E LIMINATION Consider linear network X 1 X 2 X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 )
V ARIABLE E LIMINATION Consider linear network X 1 X 2 X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) Rearrange equation…
V ARIABLE E LIMINATION Consider linear network X 1 X 2 X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) = Σ x2 P(X 3 |x 2 ) (x 2 ) Factor over each value of X 2 Cache (x 2 ), use for both values of X 3 !
V ARIABLE E LIMINATION Consider linear network X 1 X 2 X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) = Σ x2 P(X 3 |x 2 ) (x 2 ) Computed for each value of X 2 How many * and + saved? *: 2*4*2=16 vs 4+4=8 + 2*3=8 vs 2+1=3 Can lead to huge gains in larger networks
VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a)
VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) Σ a P(a|E,b) P(j|a) P(m|a)
VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) Σ a P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) (j,m,E,b) Factor over all values of E,b Note: (j,m,E,b) = P(j,m|E,b)
VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) Σ a P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) (j,m|E,b) = P(E) (j,m,E) Compute for all values of E Note: (j,m,E) = P(j,m|E)
W HAT ORDER TO PERFORM VE? For tree-like BNs (polytrees), order so parents come before children # of variables in each intermediate probability table is 2 k where k is # of parents of a node If the number of parents of a node is bounded, then VE is linear time! Other networks: intermediate factors may become large
N ON - POLYTREE NETWORKS P(D) = Σ a Σ b Σ c P(A)P(B|A)P(C|A)P(D|B,C) = Σ b Σ c P(D|B,C) Σ a P(A)P(B|A)P(C|A) A BC D No more simplifications…
D O T AU - FACTORS CORRESPOND TO CONDITIONAL DISTRIBUTIONS ? Sometimes, but not necessarily A B C D
I MPLEMENTATION NOTES How to implement multidimensional factors? How to efficiently implement sum-product?