CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.

Slides:

Advertisements

Similar presentations

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.

Advertisements

P ROBABILISTIC I NFERENCE. A GENDA Conditional probability Independence Intro to Bayesian Networks.

Exact Inference in Bayes Nets

1 Bayesian Networks Slides from multiple sources: Weng-Keen Wong, School of Electrical Engineering and Computer Science, Oregon State University.

Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.

Probabilistic Reasoning (2)

Pearl’s Belief Propagation Algorithm Exact answers from tree-structured Bayesian networks Heavily based on slides by: Tomas Singliar,

I NFERENCE IN B AYESIAN N ETWORKS. A GENDA Reading off independence assumptions Efficient inference in Bayesian Networks Top-down inference Variable elimination.

Bayesian network inference

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11

Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.

Bayesian Belief Networks

Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?

Bayesian Networks Russell and Norvig: Chapter 14 CMCS424 Fall 2003 based on material from Jean-Claude Latombe, Daphne Koller and Nir Friedman.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.

Belief Networks Russell and Norvig: Chapter 15 CS121 – Winter 2002.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

1 Probabilistic Belief States and Bayesian Networks (Where we exploit the sparseness of direct interactions among components of a world) R&N: Chap. 14,

Bayesian Networks Russell and Norvig: Chapter 14 CMCS421 Fall 2006.

1 CMSC 471 Fall 2002 Class #19 – Monday, November 4.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables.

Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)

Read R&N Ch Next lecture: Read R&N

B AYESIAN N ETWORKS. S IGNIFICANCE OF C ONDITIONAL INDEPENDENCE Consider Grade(CS101), Intelligence, and SAT Ostensibly, the grade in a course doesn’t.

Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.

B AYESIAN N ETWORKS. S OME A PPLICATIONS OF BN Medical diagnosis Troubleshooting of hardware/software systems Fraud/uncollectible debt detection Data.

P ROBABILISTIC I NFERENCE. A GENDA Random variables Bayes rule Intro to Bayesian Networks.

Probabilistic Belief States and Bayesian Networks (Where we exploit the sparseness of direct interactions among components of a world) R&N: Chap. 14, Sect.

Bayesian Statistics and Belief Networks. Overview Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks.

Introduction to Bayesian Networks

Generalizing Variable Elimination in Bayesian Networks 서울 시립대학원 전자 전기 컴퓨터 공학과 G 박민규.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Announcements  Office hours this week Tuesday (as usual) and Wednesday  HW6 posted, due Monday 10/20.

P ROBABILISTIC I NFERENCE. A GENDA Conditional probability Independence Intro to Bayesian Networks.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.

1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.

Bayesian Networks CSE 473. © D. Weld and D. Fox 2 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential.

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Inference Algorithms for Bayes Networks

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.

Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 

Conditional Probability, Bayes’ Theorem, and Belief Networks CISC 2315 Discrete Structures Spring2010 Professor William G. Tanner, Jr.

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

Belief Networks CS121 – Winter Other Names Bayesian networks Probabilistic networks Causal networks.

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11 CS479/679 Pattern Recognition Dr. George Bebis.

Recuperação de Informação B Modern Information Retrieval Cap. 2: Modeling Section 2.8 : Alternative Probabilistic Models September 20, 1999.

Qian Liu CSE spring University of Pennsylvania

Inference in Bayesian Networks

CS b553: Algorithms for Optimization and Learning

Read R&N Ch Next lecture: Read R&N

Conditional Probability, Bayes’ Theorem, and Belief Networks

Learning Bayesian Network Models from Data

Read R&N Ch Next lecture: Read R&N

Read R&N Ch Next lecture: Read R&N

CSCI 5822 Probabilistic Models of Human and Machine Learning

CAP 5636 – Advanced Artificial Intelligence

Professor Marie desJardins,

Class #19 – Tuesday, November 3

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence Fall 2008

Class #16 – Tuesday, October 26

Read R&N Ch Next lecture: Read R&N

Read R&N Ch Next lecture: Read R&N

Bayesian Networks CSE 573.

Probabilistic Reasoning

Presentation transcript:

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks

AGENDA Probabilistic inference queries Top-down inference Variable elimination

P ROBABILITY Q UERIES Given: some probabilistic model over variables X Find: distribution over Y  X given evidence E = e for some subset E  X / Y P( Y | E = e ) Inference problem

A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Easiest case: Y = X / E P( Y | E = e ) = P( Y, e )/P( e ) Denominator makes the probabilities sum to 1 Determine P( e ) by marginalizing: P( e ) =  y P( Y=y, e ) Otherwise, let W = X /( E  Y ) P( Y | E = e ) =  w P( Y, W = w, e ) /P( e ) P( e ) =  y  w P( Y=y, W = w, e ) Inference with joint distribution: O(2 | X / E | ) for binary variables

A NSWERING I NFERENCE P ROBLEMS WITH THE J OINT D ISTRIBUTION Another common case: Y ={Q} (single query variable) Can we do better than brute force marginalization of the joint distribution?

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(Alarm)

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e)

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) 3.P(Alarm) = P(A|B,E)P(B)P(E) + P(A|B,  E)P(B)P(  E) + P(A|  B,E)P(  B)P(E) + P(A|  B,  E)P(  B)P(  E) Suppose we want to compute P(Alarm) 1.P(Alarm) = Σ b,e P(A,b,e) 2.P(Alarm) = Σ b,e P(A|b,e)P(b)P(e) 3.P(Alarm) = P(A|B,E)P(B)P(E) + P(A|B,  E)P(B)P(  E) + P(A|  B,E)P(  B)P(E) + P(A|  B,  E)P(  B)P(  E)

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(Alarm) 1.P(A) = Σ b,e P(A,b,e) 2.P(A) = Σ b,e P(A|b,e)P(b)P(e) 3.P(A) = P(A|B,E)P(B)P(E) + P(A|B,  E)P(B)P(  E) + P(A|  B,E)P(  B)P(E) + P(A|  B,  E)P(  B)P(  E) 4.P(A) = 0.95*0.001* *0.001* *0.999* *0.999*0.998 = Suppose we want to compute P(Alarm) 1.P(A) = Σ b,e P(A,b,e) 2.P(A) = Σ b,e P(A|b,e)P(b)P(e) 3.P(A) = P(A|B,E)P(B)P(E) + P(A|B,  E)P(B)P(  E) + P(A|  B,E)P(  B)P(E) + P(A|  B,  E)P(  B)P(  E) 4.P(A) = 0.95*0.001* *0.001* *0.999* *0.999*0.998 =

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls)

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M|  A) P(  A) Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M|  A) P(  A)

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M|  A) P(  A) 2.P(M) = 0.70* *( ) = Now, suppose we want to compute P(MaryCalls) 1.P(M) = P(M|A)P(A) + P(M|  A) P(  A) 2.P(M) = 0.70* *( ) =

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE WITH E VIDENCE Suppose we want to compute P(Alarm|Earthquake)

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b) Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b)

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF T OP -D OWN INFERENCE Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b) 3.P(A|e) = 0.95* * = Suppose we want to compute P(A|e) 1.P(A|e) = Σ b P(A,b|e) 2.P(A|e) = Σ b P(A|b,e)P(b) 3.P(A|e) = 0.95* * =

T OP -D OWN INFERENCE Only works if the graph of ancestors is a polytree Evidence given on ancestor(s) of Q Efficient: O(d) time, where d is the number of ancestors of a variable, | Pa X | assumed constant Evidence on an ancestor cuts off influence of portion of graph above evidence node

N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class)  i P(Feature i | Class) Class Feature 1 Feature 2 Feature n P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C)  i P(F i |C) Given features, what class? Spam / Not Spam English / French / Latin … Word occurrences

N ORMALIZATION F ACTORS P(C|F 1,….,F n ) = P(C,F 1,….,F n )/P(F 1,….,F n ) = 1/Z P(C)  i P(F i |C) 1/Z term is a normalization factor so that P(C|F 1,…,F n ) sums to 1 Z =  c P(C=c)  i P(F i |C=c) Different for each value of F 1,…,F n Often left implicit Usual implementation: first compute the unnormalized distribution P(C)  i P(F i =f i |C) for all values of C, then performing a normalization step in O(|Val(C)|) time

N OTE : N UMERICAL I SSUES IN I MPLEMENTATION Suppose P(f i |c) is very small for all i, e.g., probability that a given uncommon word f i appears in a document The product P(C)  i P(F i |C) with large n will be exceedingly small and might underflow More numerically stable solution: Compute log P(C) +  i log P(f i |C) for all values of C Compute b = max c [log P(c) +  i log P(f i |c)] P(C|f 1,…,f n ) = exp(log P(C) +  i log P(f i |C) - b) / Z’ With Z’ a normalization factor A common trick when dealing with many products of small #s

N AÏVE BAYES C LASSIFIER P(Class,Feature 1,…,Feature n ) = P(Class)  i P(Feature i | Class) P(C|F 1,….,F k ) = 1/Z P(C,F 1,….,F k ) = 1/Z  fk+1…fn P(C,F 1,….,F k,f k+1,…f n ) = 1/Z P(C)  fk+1…fn  i=1…k P(F i |C)  j=k+1…n P(f j |C) = 1/Z P(C)  i=1…k P(F i |C)  j=k+1…n  fj P(f j |C) = 1/Z P(C)  i=1…k P(F i |C) Given some features, what is the distribution over class?

F OR G ENERAL B AYES N ETS Exact inference: variable elimination Efficient for polytrees and certain “simple” graphs NP hard in general Approximate inference Monte-Carlo sampling techniques Belief propagation (exact in polytrees)

BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) =  b,e P(A|b,e)P(b)P(e) Suppose we want to compute P(A) P(A) =  b,e P(A|b,e)P(b)P(e)

ABE  (A,B,E) TTTTFFFFTTTTFFFF TTFFTTFFTTFFTTFF TFTFTFTFTFTFTFTF 1.9e e A MaryCallsJohnCalls A,B,E AP(J|…) TFTF AP(M|…) TFTF S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) =  b,e   (A,b,e) (product) Suppose we want to compute P(A) P(A) =  b,e   (A,b,e) (product)

ABE  (A,B,E) TTTTFFFFTTTTFFFF TTFFTTFFTTFFTTFF TFTFTFTFTFTFTFTF 1.9e e A MaryCallsJohnCalls AP(J|…) TFTF AP(M|…) TFTF S UM - PRODUCT FORMULATION Suppose we want to compute P(A) P(A) =  b,e   (A,b,e) (product) =  b,e   (A,b,e) (sum) Suppose we want to compute P(A) P(A) =  b,e   (A,b,e) (product) =  b,e   (A,b,e) (sum) P(A=T) P(A=F)

P ROBABILITY QUERIES Computing P( Y,E ) in a BN is a sum-product operation: P( Y, E ) =  w P( Y, W = w, E ) =  w  ( Y  E, W = w ) With  ( X ) =  X P(X| Pa X ) Idea of variable elimination: rearrange the order of sum-products so as a recursive set of smaller sum-products

V ARIABLE E LIMINATION Consider linear network X 1  X 2  X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 )

V ARIABLE E LIMINATION Consider linear network X 1  X 2  X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) Rearrange equation…

V ARIABLE E LIMINATION Consider linear network X 1  X 2  X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) = Σ x2 P(X 3 |x 2 )  (x 2 ) Factor over each value of X 2 Cache  (x 2 ), use for both values of X 3 !

V ARIABLE E LIMINATION Consider linear network X 1  X 2  X 3 P( X ) = P(X 1 ) P(X 2 |X 1 ) P(X 3 |X 2 ) P(X 3 ) = Σ x1 Σ x2 P(x 1 ) P(x 2 |x 1 ) P(X 3 |x 2 ) = Σ x2 P(X 3 |x 2 ) Σ x1 P(x 1 ) P(x 2 |x 1 ) = Σ x2 P(X 3 |x 2 )  (x 2 ) Computed for each value of X 2 How many * and + saved? *: 2*4*2=16 vs 4+4=8 + 2*3=8 vs 2+1=3 Can lead to huge gains in larger networks

VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a)

VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) Σ a P(a|E,b) P(j|a) P(m|a)

VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) Σ a P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b)  (j,m,E,b) Factor over all values of E,b Note:  (j,m,E,b) = P(j,m|E,b)

VE IN A LARM E XAMPLE P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = Σ a Σ b P(E) P(b) P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b) Σ a P(a|E,b) P(j|a) P(m|a) = P(E) Σ b P(b)  (j,m|E,b) = P(E)  (j,m,E) Compute for all values of E Note:  (j,m,E) = P(j,m|E)

W HAT ORDER TO PERFORM VE? For tree-like BNs (polytrees), order so parents come before children # of variables in each intermediate probability table is 2 k where k is # of parents of a node If the number of parents of a node is bounded, then VE is linear time! Other networks: intermediate factors may become large

N ON - POLYTREE NETWORKS P(D) = Σ a Σ b Σ c P(A)P(B|A)P(C|A)P(D|B,C) = Σ b Σ c P(D|B,C) Σ a P(A)P(B|A)P(C|A) A BC D No more simplifications…

D O T AU - FACTORS CORRESPOND TO CONDITIONAL DISTRIBUTIONS ? Sometimes, but not necessarily A B C D

I MPLEMENTATION NOTES How to implement multidimensional factors? How to efficiently implement sum-product?