Advanced Artificial Intelligence Lecture 5: Probabilistic Inference.

Slides:



Advertisements
Similar presentations
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.
Advertisements

Lirong Xia Hidden Markov Models Tue, March 28, 2014.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
Probabilistic Reasoning (2)
CS 188: Artificial Intelligence Fall 2009 Lecture 16: Bayes’ Nets III – Inference 10/20/2009 Dan Klein – UC Berkeley.
CS 188: Artificial Intelligence Fall 2009 Lecture 15: Bayes’ Nets II – Independence 10/15/2009 Dan Klein – UC Berkeley.
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
CS 188: Artificial Intelligence Fall 2009 Lecture 17: Bayes Nets IV 10/27/2009 Dan Klein – UC Berkeley.
CS 188: Artificial Intelligence Spring 2009 Lecture 15: Bayes’ Nets II -- Independence 3/10/2009 John DeNero – UC Berkeley Slides adapted from Dan Klein.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Announcements Homework 8 is out Final Contest (Optional)
CS 188: Artificial Intelligence Fall 2009 Lecture 14: Bayes’ Nets 10/13/2009 Dan Klein – UC Berkeley.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Advanced Artificial Intelligence
Announcements  Midterm 1  Graded midterms available on pandagrader  See Piazza for post on how fire alarm is handled  Project 4: Ghostbusters  Due.
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
Bayesian Networks Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
CS 188: Artificial Intelligence Fall 2008 Lecture 18: Decision Diagrams 10/30/2008 Dan Klein – UC Berkeley 1.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
INC 551 Artificial Intelligence Lecture 8 Models of Uncertainty.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS 188: Artificial Intelligence Fall 2006 Lecture 18: Decision Diagrams 10/31/2006 Dan Klein – UC Berkeley.
Announcements  Office hours this week Tuesday (as usual) and Wednesday  HW6 posted, due Monday 10/20.
UBC Department of Computer Science Undergraduate Events More
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
Announcements Project 4: Ghostbusters Homework 7
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
QUIZ!! T/F: Probability Tables PT(X) do not always sum to one. FALSE
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Inference Algorithms for Bayes Networks
Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CPSC 7373: Artificial Intelligence Lecture 5: Probabilistic Inference Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CSE 473: Artificial Intelligence Autumn 2011 Bayesian Networks: Inference Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart.
CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.
QUIZ!!  T/F: You can always (theoretically) do BNs inference by enumeration. TRUE  T/F: In VE, always first marginalize, then join. FALSE  T/F: VE is.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
CS b553: Algorithms for Optimization and Learning
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
Artificial Intelligence
Quizzz Rihanna’s car engine does not start (E).
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Advanced Artificial Intelligence
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2008
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2008
Hidden Markov Models Lirong Xia.
CS 188: Artificial Intelligence Fall 2007
Probability Lirong Xia.
Probabilistic Reasoning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Advanced Artificial Intelligence Lecture 5: Probabilistic Inference

Probability 2 “Probability theory is nothing But common sense reduced to calculation.” - Pierre Laplace, 1819 The true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man’s mind.” - James Maxwell, 1850

Probabilistic Inference 3  Joel Spolsky: A very senior developer who moved to Google told me that Google works and thinks at a higher level of abstraction... "Google uses Bayesian filtering the way [previous employer] uses the if statement," he said.

Google Whiteboard 4

Example: Alarm Network Burglary Earthquak e Alarm John calls Mary calls BP(B) +b0.001 bb EP(E) +e0.002 ee BEAP(A|B,E) +b+e+a0.95 +b+e aa b ee +a0.94 +b ee aa 0.06 bb +e+a0.29 bb +e aa 0.71 bb ee +a0.001 bb ee aa AJP(J|A) +a+j0.9 +a jj 0.1 aa +j0.05 aa jj 0.95 AMP(M|A) +a+m0.7 +a mm 0.3 aa +m0.01 aa mm 0.99

Probabilistic Inference  Probabilistic Inference: calculating some quantity from a joint probability distribution  Posterior probability:  In general, partition variables into Query (Q or X), Evidence (E), and Hidden (H or Y) variables 6 BE A JM

Inference by Enumeration  Given unlimited time, inference in BNs is easy  Recipe:  State the unconditional probabilities you need  Enumerate all the atomic probabilities you need  Calculate sum of products  Example: 7 BE A JM

Inference by Enumeration 8 P(+b, +j, +m) = ∑ e ∑ a P(+b, +j, +m, e, a) = ∑ e ∑ a P(+b) P(e) P(a|+b,e) P(+j|a) P(+m|a) = BE A JM

Inference by Enumeration  An optimization: pull terms out of summations 9 = P(+b) ∑ e P(e) ∑ a P(a|+b,e) P(+j|a) P(+m|a) or = P(+b) ∑ a P(+j|a) P(+m|a) ∑ e P(e) P(a|+b,e) P(+b, +j, +m) = ∑ e ∑ a P(+b, +j, +m, e, a) = ∑ e ∑ a P(+b) P(e) P(a|+b,e) P(+j|a) P(+m|a) BE A JM

Inference by Enumeration 10 Not just 4 rows; approximately rows! Problem?

How can we make inference tractible? 11

Causation and Correlation 12 BE A JM JM A B E

Causation and Correlation 13 BE A JM MJ E B A

Variable Elimination  Why is inference by enumeration so slow?  You join up the whole joint distribution before you sum out (marginalize) the hidden variables ( ∑ e ∑ a P(+b) P(e) P(a|+b,e) P(+j|a) P(+m|a) )  You end up repeating a lot of work!  Idea: interleave joining and marginalizing!  Called “Variable Elimination”  Still NP-hard, but usually much faster than inference by enumeration  Requires an algebra for combining “factors” (multi-dimensional arrays) 14

Variable Elimination Factors  Joint distribution: P(X,Y)  Entries P(x,y) for all x, y  Sums to 1  Selected joint: P(x,Y)  A slice of the joint distribution  Entries P(x,y) for fixed x, all y  Sums to P(x) 15 TWP hotsun0.4 hotrain0.1 coldsun0.2 coldrain0.3 TWP coldsun0.2 coldrain0.3

Variable Elimination Factors  Family of conditionals: P(X |Y)  Multiple conditional values  Entries P(x | y) for all x, y  Sums to |Y| (e.g. 2 for Boolean Y)  Single conditional: P(Y | x)  Entries P(y | x) for fixed x, all y  Sums to 1 16 TWP hotsun0.8 hotrain0.2 coldsun0.4 coldrain0.6 TWP coldsun0.4 coldrain0.6

Variable Elimination Factors  Specified family: P(y | X)  Entries P(y | x) for fixed y, but for all x  Sums to … unknown  In general, when we write P(Y 1 … Y N | X 1 … X M )  It is a “factor,” a multi-dimensional array  Its values are all P(y 1 … y N | x 1 … x M )  Any assigned X or Y is a dimension missing (selected) from the array 17 TWP hotrain0.2 coldrain0.6

Example: Traffic Domain  Random Variables  R: Raining  T: Traffic  L: Late for class 18 T L R +r0.1 -r0.9 +r+t0.8 +r-t0.2 -r+t0.1 -r-t0.9 +t+l0.3 +t-l0.7 -t+l0.1 -t-l0.9 P (L | T )

 Track multi-dimensional arrays called factors  Initial factors are local CPTs (one per node)  Any known values are selected  E.g. if we know, the initial factors are  VE: Alternately join factors and eliminate variables 19 Variable Elimination Outline +r0.1 -r0.9 +r+t0.8 +r-t0.2 -r+t0.1 -r-t0.9 +t+l0.3 +t-l0.7 -t+l0.1 -t-l0.9 +t+l0.3 -t+l0.1 +r0.1 -r0.9 +r+t0.8 +r-t0.2 -r+t0.1 -r-t0.9

 Combining factors:  Just like a database join  Get all factors that mention the joining variable  Build a new factor over the union of the variables involved  Example: Join on R  Computation for each entry: pointwise products 20 Operation 1: Join Factors +r0.1 -r0.9 +r+t0.8 +r-t0.2 -r+t0.1 -r-t0.9 +r+t0.08 +r-t0.02 -r+t0.09 -r-t0.81 T R R,T

Operation 2: Eliminate  Second basic operation: marginalization  Take a factor and sum out a variable  Shrinks a factor to a smaller one  A projection operation  Example: 21 +r+t0.08 +r-t0.02 -r+t0.09 -r-t0.81 +t0.17 -t0.83

Example: Compute P(L) 22 Sum out R T L +r+t0.08 +r-t0.02 -r+t0.09 -r-t0.81 +t+l0.3 +t-l0.7 -t+l0.1 -t-l0.9 +t0.17 -t0.83 +t+l0.3 +t-l0.7 -t+l0.1 -t-l0.9 T R L +r0.1 -r0.9 +r+t0.8 +r-t0.2 -r+t0.1 -r-t0.9 +t+l0.3 +t-l0.7 -t+l0.1 -t-l0.9 Join R R, T L

Example: Compute P(L) Join TSum out T T, LL Early marginalization is variable elimination T L +t0.17 -t0.83 +t+l0.3 +t-l0.7 -t+l0.1 -t-l0.9 +t+l t-l t+l t-l l l0.886

 If evidence, start with factors that select that evidence  No evidence uses these initial factors:  Computing, the initial factors become:  We eliminate all vars other than query + evidence 24 Evidence +r0.1 -r0.9 +r+t0.8 +r-t0.2 -r+t0.1 -r-t0.9 +t+l0.3 +t-l0.7 -t+l0.1 -t-l0.9 +r0.1 +r+t0.8 +r-t0.2 +t+l0.3 +t-l0.7 -t+l0.1 -t-l0.9

 Result will be a selected joint of query and evidence  E.g. for P(L | +r), we’d end up with:  To get our answer, just normalize this!  That’s it! 25 Evidence II +l0.26 -l0.74 +r+l r-l0.074 Normalize

General Variable Elimination  Query:  Start with initial factors:  Local CPTs (but instantiated by evidence)  While there are still hidden variables (not Q or evidence):  Pick a hidden variable H  Join all factors mentioning H  Eliminate (sum out) H  Join all remaining factors and normalize 26

Example Choose A Σ a 27

Example Choose E Finish with B Normalize 28 Σ e

Approximate Inference  Sampling / Simulating / Observing  Sampling is a hot topic in machine learning, and it is really simple  Basic idea:  Draw N samples from a sampling distribution S  Compute an approximate posterior probability  Show this converges to the true probability P  Why sample?  Learning: get samples from a distribution you don’t know  Inference: getting a sample is faster than computing the exact answer (e.g. with variable elimination) 29 S A F

Prior Sampling Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass 30 +c0.5 -c0.5 +c +s0.1 -s0.9 -c +s0.5 -s0.5 +c +r0.8 -r0.2 -c +r0.2 -r0.8 +s +r +w0.99 -w0.01 -r +w0.90 -w0.10 -s +r +w0.90 -w0.10 -r +w0.01 -w0.99 Samples: +c, -s, +r, +w -c, +s, -r, +w …

Prior Sampling  This process generates samples with probability: …i.e. the BN’s joint probability  Let the number of samples of an event be  Then  I.e., the sampling procedure is consistent 31

Example  We’ll get a bunch of samples from the BN: +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w  If we want to know P(W)  We have counts  Normalize to get P(W) =  This will get closer to the true distribution with more samples  Can estimate anything else, too  Fast: can use fewer samples if less time Cloudy Sprinkler Rain WetGrass C S R W 32

Rejection Sampling  Let’s say we want P(C)  No point keeping all samples around  Just tally counts of C as we go  Let’s say we want P(C| +s)  Same thing: tally C outcomes, but ignore (reject) samples which don’t have S=+s  This is called rejection sampling  It is also consistent for conditional probabilities (i.e., correct in the limit) +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w Cloudy Sprinkler Rain WetGrass C S R W 33

Sampling Example  There are 2 cups.  First: 1 penny and 1 quarter  Second: 2 quarters  Say I pick a cup uniformly at random, then pick a coin randomly from that cup. It's a quarter. What is the probability that the other coin in that cup is also a quarter? / 1000

Likelihood Weighting  Problem with rejection sampling:  If evidence is unlikely, you reject a lot of samples  You don’t exploit your evidence as you sample  Consider P(B|+a)  Idea: fix evidence variables and sample the rest  Problem: sample distribution not consistent!  Solution: weight by probability of evidence given parents BurglaryAlarm BurglaryAlarm 35 -b, -a +b, +a -b +a -b, +a +b, +a

Likelihood Weighting  P ( R|+s,+w) 36 +c0.5 -c0.5 +c +s0.1 -s0.9 -c +s0.5 -s0.5 +c +r0.8 -r0.2 -c +r0.2 -r0.8 +s +r +w0.99 -w0.01 -r +w0.90 -w0.10 -s +r +w0.90 -w0.10 -r +w0.01 -w0.99 Samples: +c, +s, +r, +w … Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass 0.099

Likelihood Weighting  Sampling distribution if z sampled and e fixed evidence  Now, samples have weights  Together, weighted sampling distribution is consistent Cloudy R C S W 37

Likelihood Weighting  Likelihood weighting is good  We have taken evidence into account as we generate the sample  E.g. here, W’s value will get picked based on the evidence values of S, R  More of our samples will reflect the state of the world suggested by the evidence  Likelihood weighting doesn’t solve all our problems (P(C|+s,+r))  Evidence influences the choice of downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence)  We would like to consider evidence when we sample every variable 38 Cloudy Rain C S R W

Markov Chain Monte Carlo  Idea: instead of sampling from scratch, create samples that are each like the last one.  Procedure: resample one variable at a time, conditioned on all the rest, but keep evidence fixed. E.g., for P(b|c):  Properties: Now samples are not independent (in fact they’re nearly identical), but sample averages are still consistent estimators!  What’s the point: both upstream and downstream variables condition on evidence. 39 +a+c+b +a+c-b-b-a-a -b-b

World’s most famous probability problem? 40

Monty Hall Problem  Three doors, contestant chooses one.  Game show host reveals one of two remaining, knowing it does not have prize  Should contestant accept offer to switch doors?  P(+prize|¬switch) = ? P(+prize|+switch) = ? 41

Monty Hall on Monty Hall Problem 42

Monty Hall on Monty Hall Problem 43