Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan 1

Announcements HW3 released online this afternoon, due Oct 29, 11:59pm Demo by Mark Zhao!

Bayesian decision making Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem: given some evidence E = e, what is P(X | e)? Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x 1,e 1 ), …, (x n,e n )}

Inference Graphs can have observed (shaded) and unobserved nodes. If nodes are always unobserved they are called hidden or latent variables Probabilistic inference is the problem of computing a conditional probability distribution over the values of some of the nodes (the “hidden” or “unobserved” nodes), given the values of other nodes (the “evidence” or “observed” nodes).

Probabilistic inference  A general scenario:  Query variables: X  Evidence (observed) variables: E = e  Unobserved variables: Y  If we know the full joint distribution P(X, E, Y), how can we perform inference about X?

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: –Posterior probability: –Most likely explanation: 6 BE A JM

Inference – computing conditional probabilities Marginalization: Conditional Probabilities:

Inference by Enumeration Given unlimited time, inference in BNs is easy Recipe: –State the marginal probabilities you need –Figure out ALL the atomic probabilities you need –Calculate and combine them Example: 8 BE A JM

Example: Enumeration In this simple method, we only need the BN to synthesize the joint entries 9

Probabilistic inference  A general scenario:  Query variables: X  Evidence (observed) variables: E = e  Unobserved variables: Y  If we know the full joint distribution P(X, E, Y), how can we perform inference about X?  Problems  Full joint distributions are too large  Marginalizing out Y may involve too many summation terms

Inference by Enumeration? 11

Variable Elimination Why is inference by enumeration on a Bayes Net inefficient? –You join up the whole joint distribution before you sum out the hidden variables –You end up repeating a lot of work! Idea: interleave joining and marginalizing! –Called “Variable Elimination” –Choosing the order to eliminate variables that minimizes work is NP-hard, but *anything* sensible is much faster than inference by enumeration 12

General Variable Elimination Query: Start with initial factors: –Local CPTs (but instantiated by evidence) While there are still hidden variables (not Q or evidence): –Pick a hidden variable H –Join all factors mentioning H –Eliminate (sum out) H Join all remaining factors and normalize 13

14 Example: Variable elimination Query: What is the probability that a student attends class, given that they pass the exam? [based on slides taken from UMBC CMSC 671, 2005] P(pr|at,st)atst 0.9TT 0.5TF 0.7FT 0.1FF attendstudy prepared fair pass P(at)=.8 P(st)=.6 P(fa)=.9 P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

15 Join study factors attendstudy prepared fair pass P(at)=.8 P(st)=.6 P(fa)=.9 OriginalJointMarginal prepstudyattendP(pr|at,st)P(st)P(pr,st|sm)P(pr|sm) TTT0.90.60.540.74 TFT0.50.40.2 TTF0.70.60.420.46 TFF0.10.40.04 FTT0.10.60.060.26 FFT0.50.40.2 FTF0.30.60.180.54 FFF0.90.40.36 P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

16 Marginalize out study attend prepared, study fair pass P(at)=.8 P(fa)=.9 OriginalJointMarginal prepstudyattendP(pr|at,st)P(st)P(pr,st|at)P(pr|at) TTT0.90.60.540.74 TFT0.50.40.2 TTF0.70.60.420.46 TFF0.10.40.04 FTT0.10.60.060.26 FFT0.50.40.2 FTF0.30.60.180.54 FFF0.90.40.36 P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

17 Remove “study” attend prepared fair pass P(at)=.8 P(fa)=.9 P(pr|at)prat 0.74TT 0.46TF 0.26FT 0.54FF P(pa|at,pre,fa)pratfa 0.9TTT 0.1TTF 0.7TFT 0.1TFF 0.7FTT 0.1FTF 0.2FFT 0.1FFF

18 Join factors “fair” attend prepared fair pass P(at)=.8 P(fa)=.9 P(pr|at)prepattend 0.74TT 0.46TF 0.26FT 0.54FF OriginalJointMarginal papreattendfair P(pa|at,pre, fa)P(fair) P(pa,fa|sm, pre) P(pa|sm,pre ) tTTT0.9 0.810.82 tTTF0.1 0.01 tTFT0.70.90.630.64 tTFF0.1 0.01 tFTT0.70.90.630.64 tFTF0.1 0.01 tFFT0.20.90.180.19 tFFF0.1 0.01

19 Marginalize out “fair” attend prepared pass, fair pass, fair P(at)=.8 P(pr|at)prepattend 0.74TT 0.46TF 0.26FT 0.54FF OriginalJointMarginal papreattendfairP(pa|at,pre,fa)P(fair)P(pa,fa|at,pre)P(pa|at,pre) TTTT0.9 0.810.82 TTTF0.1 0.01 TTFT0.70.90.630.64 TTFF0.1 0.01 TFTT0.70.90.630.64 TFTF0.1 0.01 TFFT0.20.90.180.19 TFFF0.1 0.01

20 Marginalize out “fair” attend prepared pass P(at)=.8 P(pr|at)prepattend 0.74TT 0.46TF 0.26FT 0.54FF P(pa|at,pre)papreattend 0.82tTT 0.64tTF tFT 0.19tFF

21 Join factors “prepared” attend prepared pass P(at)=.8 OriginalJointMarginal papreattendP(pa|at,pr)P(pr|at)P(pa,pr|sm)P(pa|sm) tTT0.820.740.60680.7732 tTF0.640.460.29440.397 tFT0.640.260.1664 tFF0.190.540.1026

22 Join factors “prepared” attend pass, prepared P(at)=.8 OriginalJointMarginal papreattendP(pa|at,pr)P(pr|at)P(pa,pr|at)P(pa|at) tTT0.820.740.60680.7732 tTF0.640.460.29440.397 tFT0.640.260.1664 tFF0.190.540.1026

23 Join factors “prepared” attend pass P(at)=.8 P(pa|at)paattend 0.7732tT 0.397tF

24 Join factors attend pass P(at)=.8 OriginalJointNormalized: paattendP(pa|at)P(at)P(pa,sm)P(at|pa) TT0.77320.80.618560.89 TF0.3970.20.07940.11

25 Join factors attend, pass OriginalJointNormalized: paattendP(pa|at)P(at)P(pa,at)P(at|pa) TT0.77320.80.618560.89 TF0.3970.20.07940.11

Bayesian network inference: Big picture Exact inference is intractable –There exist techniques to speed up computations, but worst-case complexity is still exponential except in some classes of networks Approximate inference –Sampling, variational methods, message passing / belief propagation…

Approximate Inference Sampling 27

Approximate Inference 28

Sampling – the basics... Scrooge McDuck gives you an ancient coin. He wants to know what is P(H) You have no homework, and nothing good is on television – so you toss it 1 Million times. You obtain 700000x Heads, and 300000x Tails. What is P(H)? 29

Sampling – the basics... P(H)=0.7 Why? 30

Monte Carlo Method 31 Who is more likely to win? Green or Purple? What is the probability that green wins, P(G)? Two ways to solve this: 1.Compute the exact probability. 2.Play 100,000 games and see how many times green wins.

Approximate Inference Sampling is a hot topic in machine learning, and it’s really simple Basic idea: –Draw N samples from a distribution –Compute an approximate posterior probability –Show this converges to the true probability P Why sample? –Learning: get samples from a distribution you don’t know –Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination) 32 S A F

Forward Sampling Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass 33 +c0.5 -c0.5 +c +s0.1 -s0.9 -c +s0.5 -s0.5 +c +r0.8 -r0.2 -c +r0.2 -r0.8 +s +r +w0.99 -w0.01 -r +w0.90 -w0.10 -s +r +w0.90 -w0.10 -r +w0.01 -w0.99 Samples: +c, -s, +r, +w -c, +s, -r, +w …

Forward Sampling This process generates samples with probability: …i.e. the BN’s joint probability If we sample n events, then in the limit we would converge to the true joint probabilities. Therefore, the sampling procedure is consistent 34

Example We’ll get a bunch of samples from the BN: +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w If we want to know P(W) –We have counts –Normalize to get P(W) = –This will get closer to the true distribution with more samples –Can estimate anything else, too –What about P(C| +w)? P(C| +r, +w)? P(C| -r, -w)? –Fast: can use fewer samples if less time (what’s the drawback?) Cloudy Sprinkler Rain WetGrass C S R W 35

Rejection Sampling Let’s say we want P(C) –No point keeping all samples around –Just tally counts of C as we go Let’s say we want P(C| +s) –Same thing: tally C outcomes, but ignore (reject) samples which don’t have S=+s –This is called rejection sampling –It is also consistent for conditional probabilities (i.e., correct in the limit) +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w Cloudy Sprinkler Rain WetGrass C S R W 36

Sampling Example There are 2 cups. –The first contains 1 penny and 1 quarter –The second contains 2 quarters Say I pick a cup uniformly at random, then pick a coin randomly from that cup. It's a quarter (yes!). What is the probability that the other coin in that cup is also a quarter?

Likelihood Weighting Problem with rejection sampling: –If evidence is unlikely, you reject a lot of samples –You don’t exploit your evidence as you sample –Consider P(B|+a) Idea: fix evidence variables and sample the rest Problem: sample distribution not consistent! Solution: weight by probability of evidence given parents BurglaryAlarm BurglaryAlarm 38 -b, -a +b, +a -b +a -b, +a +b, +a

Likelihood Weighting Sampling distribution if z sampled and e fixed evidence Now, samples have weights Together, weighted sampling distribution is consistent Cloudy R C S W 39

Likelihood Weighting 40 +c0.5 -c0.5 +c +s0.1 -s0.9 -c +s0.5 -s0.5 +c +r0.8 -r0.2 -c +r0.2 -r0.8 +s +r +w0.99 -w0.01 -r +w0.90 -w0.10 -s +r +w0.90 -w0.10 -r +w0.01 -w0.99 Samples: +c, +s, +r, +w … Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass

 Inference:  Sum over weights that match query value  Divide by total sample weight  What is P(C|+w,+r)? Likelihood Weighting Example 41 CloudyRainySprinklerWet GrassWeight 01110.495 00110.45 0011 0011 10110.09

Likelihood Weighting Likelihood weighting is good –We have taken evidence into account as we generate the sample –E.g. here, W’s value will get picked based on the evidence values of S, R –More of our samples will reflect the state of the world suggested by the evidence Likelihood weighting doesn’t solve all our problems –Evidence influences the choice of downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence) We would like to consider evidence when we sample every variable 42 Cloudy Rain C S R W

Markov Chain Monte Carlo* Idea: instead of sampling from scratch, create samples that are each like the last one. Procedure: resample one variable at a time, conditioned on all the rest, but keep evidence fixed. E.g., for P(b|c): Properties: Now samples are not independent (in fact they’re nearly identical), but sample averages are still consistent estimators! What’s the point: both upstream and downstream variables condition on evidence. 43 +a+c+b +a+c-b-b-a-a -b-b

Gibbs Sampling 1.Set all evidence E to e 2.Do forward sampling to obtain x 1,...,x n 3.Repeat: 1.Pick any variable X i uniformly at random. 2.Resample x i ’ from p(X i | x 1,..., x i-1, x i+1,..., x n ) 3.Set all other x j ’=x j 4.The new sample is x 1’,..., x n ’ 44

Markov Blanket 45 X Markov blanket of X: 1.All parents of X 2.All children of X 3.All parents of children of X (except X itself) X is conditionally independent from all other variables in the BN, given all variables in the markov blanket (besides X).

Inference Algorithms Exact algorithms –Elimination algorithm –Sum-product algorithm –Junction tree algorithm Sampling algorithms –Importance sampling –Markov chain Monte Carlo Variational algorithms –Mean field methods –Sum-product algorithm and variations –Semidefinite relaxations

Summary Sampling can be really effective –The dominating approach to inference in BNs Approaches: –Forward (/Prior) Sampling –Rejection Sampling –Likelihood Weighted Sampling 47

Bayesian decision making Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem: given some evidence E = e, what is P(X | e)? Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples {(x 1,e 1 ), …, (x n,e n )}

Learning in Bayes Nets Task 1: Given the network structure and given data (observed settings for the variables) learn the CPTs for the Bayes Net. Task 2: Given only the data (and possibly a prior over Bayes Nets), learn the entire Bayes Net (both Net structure and CPTs).

Parameter learning Suppose we know the network structure (but not the parameters), and have a training set of complete observations Sample CSRW 1TFTT 2FTFT 3TFFF 4TTTT 5FTFT 6TFTF ………….… ? ???? ???? ???????? Training set

Parameter learning Suppose we know the network structure (but not the parameters), and have a training set of complete observations –P(X | Parents(X)) is given by the observed frequencies of the different values of X for each combination of parent values

Parameter learning Incomplete observations Expectation maximization (EM) algorithm for dealing with missing data Sample CSRW 1?FTT 2?TFT 3?FFF 4?TTT 5?TFT 6?FTF ………….… ? ???? ???? ???????? Training set

Learning from incomplete data Observe real-world data –Some variables are unobserved –Trick: Fill them in EM algorithm –E=Expectation –M=Maximization 53 Sample CSRW 1?FTT 2?TFT 3?FFF 4?TTT 5?TFT 6?FTF ………….…

EM-Algorithm Initialize CPTs (e.g. randomly, uniformly) Repeat until convergence –E-Step: Compute for all i and x (inference) –M-Step: Update CPT tables 54

Parameter learning What if the network structure is unknown? –Structure learning algorithms exist, but they are pretty complicated… Sample CSRW 1TFTT 2FTFT 3TFFF 4TTTT 5FTFT 6TFTF ………….… Training set C SR W ?

Summary: Bayesian networks Joint Distribution Independence Inference Parameters Learning

Naïve Bayes

Bayesian inference, Naïve Bayes model http://xkcd.com/1236/

Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Key tool for probabilistic inference: can get diagnostic probability from causal probability E.g., P(Cavity | Toothache) from P(Toothache | Cavity) Rev. Thomas Bayes (1702-1761)

Bayes Rule example Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

Bayes rule: Another example 1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

Probabilistic inference Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence variable(s) E = e –Examples: X = {spam, not spam}, e = email message X = {zebra, giraffe, hippo}, e = image features

Bayesian decision theory Let x be the value predicted by the agent and x* be the true value of X. The agent has a loss function, which is 0 if x = x* and 1 otherwise Expected loss: What is the estimate of X that minimizes the expected loss? –The one that has the greatest posterior probability P(x|e) –This is called the Maximum a Posteriori (MAP) decision

MAP decision Value x of X that has the highest posterior probability given the evidence E = e: Maximum likelihood (ML) decision: likelihood prior posterior

Naïve Bayes model Suppose we have many different types of observations (symptoms, features) E 1, …, E n that we want to use to obtain evidence about an underlying hypothesis X MAP decision involves estimating –If each feature E i can take on k values, how many entries are in the (conditional) joint probability table P(E 1, …, E n |X = x)?

Naïve Bayes model Suppose we have many different types of observations (symptoms, features) E 1, …, E n that we want to use to obtain evidence about an underlying hypothesis X MAP decision involves estimating We can make the simplifying assumption that the different features are conditionally independent given the hypothesis: –If each feature can take on k values, what is the complexity of storing the resulting distributions?

Exchangeability De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation.

Naïve Bayes model Posterior: MAP decision: likelihood priorposterior

A Simple Example – Naïve Bayes C – Class F - Features We only specify (parameters): prior over class labels how each feature depends on the class

Slide from Dan Klein

Spam filter MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message)

Spam filter MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message) We have P(spam | message)  P(message | spam)P(spam) and P(¬spam | message)  P(message | ¬spam)P(¬spam) To enable classification, we need to be able to estimate the likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam)

Naïve Bayes Representation Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam) Likelihood: bag of words representation –The message is a sequence of words (w 1, …, w n ) –The order of the words in the message is not important –Each word is conditionally independent of the others given message class (spam or not spam)

Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/ http://chir.ag/projects/preztags/

Naïve Bayes Representation Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) and priors P(spam) and P(¬spam) Likelihood: bag of words representation –The message is a sequence of words (w 1, …, w n ) –The order of the words in the message is not important –Each word is conditionally independent of the others given message class (spam or not spam) –Thus, the problem is reduced to estimating marginal likelihoods of individual words P(w i | spam) and P(w i | ¬spam)

General MAP rule for Naïve Bayes: Thus, the filter should classify the message as spam if Summary: Decision rule likelihood priorposterior

Slide from Dan Klein

Percentage of documents in training set labeled as spam/ham Slide from Dan Klein

In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein

In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein

Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Similar presentations

Presentation on theme: "Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Similar presentations

Presentation on theme: "Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,"— Presentation transcript:

Similar presentations

About project

Feedback