CS 188: Artificial Intelligence Spring 2007

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
Review: Bayesian learning and inference
CS 188: Artificial Intelligence Fall 2009 Lecture 13: Probability 10/8/2009 Dan Klein – UC Berkeley 1.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
CS 188: Artificial Intelligence Fall 2009 Lecture 15: Bayes’ Nets II – Independence 10/15/2009 Dan Klein – UC Berkeley.
KI2 - 2 Kunstmatige Intelligentie / RuG Probabilities Revisited AIMA, Chapter 13.
CS 188: Artificial Intelligence Spring 2009 Lecture 15: Bayes’ Nets II -- Independence 3/10/2009 John DeNero – UC Berkeley Slides adapted from Dan Klein.
Ai in game programming it university of copenhagen Welcome to... the Crash Course Probability Theory Marco Loog.
CS 188: Artificial Intelligence Fall 2009 Lecture 14: Bayes’ Nets 10/13/2009 Dan Klein – UC Berkeley.
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Uncertainty Chapter 13.
Bayes’ Nets [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at
Advanced Artificial Intelligence
Announcements  Midterm 1  Graded midterms available on pandagrader  See Piazza for post on how fire alarm is handled  Project 4: Ghostbusters  Due.
Bayes’ Nets  A Bayes’ net is an efficient encoding of a probabilistic model of a domain  Questions we can ask:  Inference: given a fixed BN, what is.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks Dan Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Probabilistic Models  Models describe how (a portion of) the world works  Models are always simplifications  May not account for every variable  May.
1 Monte Carlo Artificial Intelligence: Bayesian Networks.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
QUIZ!!  T/F: Traffic, Umbrella are cond. independent given raining. TRUE  T/F: Fire, Smoke are cond. Independent given alarm. FALSE  T/F: BNs encode.
Announcements Project 4: Ghostbusters Homework 7
Probabilistic Reasoning [Ch. 14] Bayes Networks – Part 1 ◦Syntax ◦Semantics ◦Parameterized distributions Inference – Part2 ◦Exact inference by enumeration.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
QUIZ!! T/F: Probability Tables PT(X) do not always sum to one. FALSE
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability (road state, other.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
PROBABILISTIC REASONING Heng Ji 04/05, 04/08, 2016.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Artificial Intelligence Bayes’ Nets: Independence Instructors: David Suter and Qince Li Course Harbin Institute of Technology [Many slides.
Computer Science cpsc322, Lecture 25
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
Qian Liu CSE spring University of Pennsylvania
Bayesian networks (1) Lirong Xia Spring Bayesian networks (1) Lirong Xia Spring 2017.
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
Artificial Intelligence
Probabilistic Models Models describe how (a portion of) the world works Models are always simplifications May not account for every variable May not account.
CS 4/527: Artificial Intelligence
Quizzz Rihanna’s car engine does not start (E).
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Probability Topics Random Variables Joint and Marginal Distributions
CHAPTER 7 BAYESIAN NETWORK INDEPENDENCE BAYESIAN NETWORK INFERENCE MACHINE LEARNING ISSUES.
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2008
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
CSE 473: Artificial Intelligence Autumn 2011
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CAP 5636 – Advanced Artificial Intelligence
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Spring 2007
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
CS 188: Artificial Intelligence Spring 2006
Bayesian networks (1) Lirong Xia. Bayesian networks (1) Lirong Xia.
CS 188: Artificial Intelligence Fall 2008
Presentation transcript:

CS 188: Artificial Intelligence Spring 2007 Lecture 12: Graphical Models 2/22/2007 Srini Narayanan – ICSI and UC Berkeley

Probabilistic Models A probabilistic model is a joint distribution over a set of variables Given a joint distribution, we can reason about unobserved variables given observations (evidence) General form of a query: This kind of posterior distribution is also called the belief function of an agent which uses this model Stuff you care about Stuff you already know

Independence Two variables are independent if: This says that their joint distribution factors into a product two simpler distributions Independence is a modeling assumption Empirical joint distributions: at best “close” to independent What could we assume for {Weather, Traffic, Cavity, Toothache}? How many parameters in the joint model? How many parameters in the independent model? Independence is like something from CSPs: what?

Example: Independence N fair, independent coin flips: H 0.5 T H 0.5 T H 0.5 T

Example: Independence? Most joint distributions are not independent Most are poorly modeled as independent T P warm 0.5 cold S P sun 0.6 rain 0.4 T S P warm sun 0.4 rain 0.1 cold 0.2 0.3 T S P warm sun 0.3 rain 0.2 cold

Conditional Independence P(Toothache,Cavity,Catch)? If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache: P(catch | toothache, cavity) = P(catch | cavity) The same independence holds if I don’t have a cavity: P(catch | toothache, cavity) = P(catch| cavity) Catch is conditionally independent of Toothache given Cavity: P(Catch | Toothache, Cavity) = P(Catch | Cavity) Equivalent statements: P(Toothache | Catch , Cavity) = P(Toothache | Cavity) P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)

Conditional Independence Unconditional (absolute) independence is very rare (why?) Conditional independence is our most basic and robust form of knowledge about uncertain environments: What about this domain: Traffic Umbrella Raining What about fire, smoke, alarm?

The Chain Rule II Can always factor any joint distribution as an incremental product of conditional distributions Why? This actually claims nothing… What are the sizes of the tables we supply?

The Chain Rule III Trivial decomposition: With conditional independence: Conditional independence is our most basic and robust form of knowledge about uncertain environments Graphical models help us manage independence

The Chain Rule IV Write out full joint distribution using chain rule: P(Toothache, Catch, Cavity) = P(Toothache | Catch, Cavity) P(Catch, Cavity) = P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity) = P(Toothache | Cavity) P(Catch | Cavity) P(Cavity) Cav P(Cavity) Graphical model notation: Each variable is a node The parents of a node are the other variables which the decomposed joint conditions on MUCH more on this to come! T Cat P(Toothache | Cavity) P(Catch | Cavity)

Combining Evidence P(cavity | toothache, catch) =  P(toothache, catch | cavity) P(cavity) =  P(toothache | cavity) P(catch | cavity) P(cavity) This is an example of a naive Bayes model: C E1 E2 En

Graphical Models Models are descriptions of how (a portion of) the world works Models are always simplifications May not account for every variable May not account for all interactions between variables What do we do with probabilistic models? We (or our agents) need to reason about unknown variables, given evidence Example: explanation (diagnostic reasoning) Example: prediction (causal reasoning) Example: value of information

Bayes’ Nets: Big Picture Two problems with using full joint distributions for probabilistic models: Unless there are only a few variables, the joint is WAY too big to represent explicitly Hard to estimate anything empirically about more than a few variables at a time Bayes’ nets (more properly called graphical models) are a technique for describing complex joint distributions (models) using a bunch of simple, local distributions We describe how variables locally interact Local interactions chain together to give global, indirect interactions

Graphical Model Notation Nodes: variables (with domains) Can be assigned (observed) or unassigned (unobserved) Arcs: interactions Similar to CSP constraints Indicate “direct influence” between variables For now: imagine that arrows mean causation

Example: Coin Flips N independent coin flips No interactions between variables: absolute independence X1 X2 Xn

Example: Traffic Variables: R Model 1: independence R: It rains T: There is traffic Model 1: independence Model 2: rain causes traffic Why is an agent using model 2 better? R T

Example: Traffic II Let’s build a causal graphical model Variables T: Traffic R: It rains L: Low pressure D: Roof drips B: Ballgame C: Cavity

Example: Alarm Network Variables B: Burglary A: Alarm goes off M: Mary calls J: John calls E: Earthquake!

Bayes’ Net Semantics Let’s formalize the semantics of a Bayes’ net A set of nodes, one per variable X A directed, acyclic graph A conditional distribution for each node A distribution over X, for each combination of parents’ values CPT: conditional probability table Description of a noisy “causal” process A1 An X A Bayes net = Topology (graph) + Local Conditional Probabilities

Probabilities in BNs Bayes’ nets implicitly encode joint distributions As a product of local conditional distributions To see what probability a BN gives to a full assignment, multiply all the relevant conditionals together: Example: This lets us reconstruct any entry of the full joint Not every BN can represent every full joint The topology enforces certain conditional independencies

Example: Coin Flips X1 X2 Xn h 0.5 t h 0.5 t h 0.5 t Only distributions whose variables are absolutely independent can be represented by a Bayes’ net with no arcs.

Example: Traffic R r 1/4 r 3/4 r t 3/4 t 1/4 T r t 1/2 t

Example: Alarm Network

Example: Naïve Bayes Imagine we have one cause y and several effects x: This is a naïve Bayes model We’ll use these for classification later

Example: Traffic II Variables T: Traffic R: It rains L: Low pressure D: Roof drips B: Ballgame L R B D T

Size of a Bayes’ Net How big is a joint distribution over N Boolean variables? How big is an N-node net if nodes have k parents? Both give you the power to calculate BNs: Huge space savings! Also easier to elicit local CPTs Also turns out to be faster to answer queries (next class)

Building the (Entire) Joint We can take a Bayes’ net and build the full joint distribution it encodes Typically, there’s no reason to build ALL of it But it’s important to know you could! To emphasize: every BN over a domain implicitly represents some joint distribution over that domain

Example: Traffic Basic traffic net Let’s multiply out the joint R T r 1/4 r 3/4 r t 3/16 t 1/16 r 6/16 r t 3/4 t 1/4 T r t 1/2 t

Example: Reverse Traffic Reverse causality? T t 9/16 t 7/16 r t 3/16 t 1/16 r 6/16 t r 1/3 r 2/3 R t r 1/7 r 6/7

Causality? When Bayes’ nets reflect the true causal patterns: Often simpler (nodes have fewer parents) Often easier to think about Often easier to elicit from experts BNs need not actually be causal Sometimes no causal net exists over the domain (especially if variables are missing) E.g. consider the variables Traffic and Drips End up with arrows that reflect correlation, not causation What do the arrows really mean? Topology may happen to encode causal structure Topology really encodes conditional independencies

Creating Bayes’ Nets So far, we talked about how any fixed Bayes’ net encodes a joint distribution Next: how to represent a fixed distribution as a Bayes’ net Key ingredient: conditional independence The exercise we did in “causal” assembly of BNs was a kind of intuitive use of conditional independence Now we have to formalize the process After that: how to answer queries (inference)

Estimation How to estimate the a distribution of a random variable X? Maximum likelihood: Collect observations from the world For each value x, look at the empirical rate of that value: This estimate is the one which maximizes the likelihood of the data Elicitation: ask a human! Harder than it sounds E.g. what’s P(raining | cold)? Usually need domain experts, and sophisticated ways of eliciting probabilities (e.g. betting games)

Estimation Problems with maximum likelihood estimates: Basic idea: If I flip a coin once, and it’s heads, what’s the estimate for P(heads)? What if I flip it 50 times with 27 heads? What if I flip 10M times with 8M heads? Basic idea: We have some prior expectation about parameters (here, the probability of heads) Given little evidence, we should skew towards our prior Given a lot of evidence, we should listen to the data How can we accomplish this? Stay tuned!