Notes on Graphical Models Padhraic Smyth Department of Computer Science University of California, Irvine.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

A Tutorial on Learning with Bayesian Networks
Learning with Missing Data
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Exact Inference in Bayes Nets
Learning: Parameter Estimation
Bayesian Networks. Introduction A problem domain is modeled by a list of variables X 1, …, X n Knowledge about the problem domain is represented by a.
Graphical Models - Learning -
Bayesian Networks - Intro - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG.
Graphical Models - Inference -
Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Review: Bayesian learning and inference
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
CS 547: Sensing and Planning in Robotics Gaurav S. Sukhatme Computer Science Robotic Embedded Systems Laboratory University of Southern California
. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.
Structure Learning in Bayesian Networks
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
© P. Smyth: UC Irvine: Graphical Models: 2004: 1 Learning from data with graphical models Padhraic Smyth © Information and Computer Science University.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Knowledge and Uncertainty CS 271: Fall 2007 Instructor: Padhraic Smyth.
Quiz 4: Mean: 7.0/8.0 (= 88%) Median: 7.5/8.0 (= 94%)
Bayesian Networks CS 271: Fall 2007 Instructor: Padhraic Smyth.
Read R&N Ch Next lecture: Read R&N
A Brief Introduction to Graphical Models
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
Knowledge and Uncertainty 부산대학교 전자전기컴퓨터공학과 인공지능연구실 김민호
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Generalizing Variable Elimination in Bayesian Networks 서울 시립대학원 전자 전기 컴퓨터 공학과 G 박민규.
COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Some Neat Results From Assignment 1. Assignment 1: Negative Examples (Rohit)
Guidance: Assignment 3 Part 1 matlab functions in statistics toolbox  betacdf, betapdf, betarnd, betastat, betafit.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Bayesian Networks Read R&N Ch Next lecture: Read R&N
Matching ® ® ® Global Map Local Map … … … obstacle Where am I on the global map?                                   
Read R&N Ch Next lecture: Read R&N
Introduction to Artificial Intelligence
Learning Bayesian Network Models from Data
Introduction to Artificial Intelligence
Read R&N Ch Next lecture: Read R&N
Bayesian Networks Read R&N Ch. 13.6,
Read R&N Ch Next lecture: Read R&N
Knowledge and Uncertainty
Uncertainty in AI.
Introduction to Artificial Intelligence
CS 271: Fall 2007 Instructor: Padhraic Smyth
Graduate School of Information Sciences, Tohoku University
Expectation-Maximization & Belief Propagation
Read R&N Ch Next lecture: Read R&N
Read R&N Ch Next lecture: Read R&N
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Notes on Graphical Models Padhraic Smyth Department of Computer Science University of California, Irvine

Probabilistic Model Real World Data P(Data | Parameters)

Probabilistic Model Real World Data P(Data | Parameters) P(Parameters | Data)

Probabilistic Model Real World Data P(Data | Parameters) P(Parameters | Data) Generative Model, Probability Inference, Statistics

Part 1: Review of Probability

Notation and Definitions X is a random variable –Lower-case x is some possible value for X –“X = x” is a logical proposition: that X takes value x –There is uncertainty about the value of X e.g., X is the Dow Jones index at 5pm tomorrow p(X = x) is the probability that proposition X=x is true –often shortened to p(x) If the set of possible x’s is finite, we have a probability distribution and  p(x) = 1 If the set of possible x’s is infinite, p(x) is a density function, and p(x) integrates to 1 over the range of X

Example Let X be the Dow Jones Index (DJI) at 5pm Monday August 22 nd (tomorrow) X can take real values from 0 to some large number p(x) is a density representing our uncertainty about X –This density could be constructed from historical data, e.g., –After 5pm p(x) = 1 for some value of x (no uncertainty), once we hear from Wall Street what x is

Probability as Degree of Belief Different agents can have different p(x)’s –Your p(x) and the p(x) of a Wall Street expert might be quite different –OR: if we were on vacation we might not have access to stock market information we would still be uncertain about p(x) after 5pm So we should really think of p(x) as p(x | B I ) –Where B I is background information available to agent I –(will drop explicit conditioning on B I in notation) Thus, p(x) represents the degree of belief that agent I has in proposition x, conditioned on available background information

Comments on Degree of Belief Different agents can have different probability models –There is no necessarily “correct” p(x) –Why? Because p(x) is a model built on whatever assumptions or background information we use –Naturally leads to the notion of updating p(x | B I ) -> p(x | B I, C I ) This is the subjective Bayesian interpretation of probability –Generalizes other interpretations (such as frequentist) –Can be used in cases where frequentist reasoning is not applicable –We will use “degree of belief” as our interpretation of p(x) in this tutorial Note! –Degree of belief is just our semantic interpretation of p(x) –The mathematics of probability (e.g., Bayes rule) remain the same regardless of our semantic interpretation

Multiple Variables p(x, y, z) –Probability that X=x AND Y=y AND Z =z –Possible values: cross-product of X Y Z –e.g., X, Y, Z each take 10 possible values x,y,z can take 10 3 possible values p(x,y,z) is a 3-dimensional array/table –Defines 10 3 probabilities Note the exponential increase as we add more variables –e.g., X, Y, Z are all real-valued x,y,z live in a 3-dimensional vector space p(x,y,z) is a positive function defined over this space, integrates to 1

Conditional Probability p(x | y, z) –Probability of x given that Y=y and Z = z –Could be hypothetical, e.g., “if Y=y and if Z = z” observational, e.g., we observed values y and z –can also have p(x, y | z), etc –“all probabilities are conditional probabilities” Computing conditional probabilities is the basis of many prediction and learning problems, e.g., –p(DJI tomorrow | DJI index last week) –expected value of [DJI tomorrow | DJI index next week) –most likely value of parameter  given observed data

Computing Conditional Probabilities Variables A, B, C, D –All distributions of interest related to A,B,C,D can be computed from the full joint distribution p(a,b,c,d) Examples, using the Law of Total Probability –p(a) =  {b,c,d} p(a, b, c, d) –p(c,d) =  {a,b} p(a, b, c, d) –p(a,c | d) =  {b} p(a, b, c | d) where p(a, b, c | d) = p(a,b,c,d)/p(d) These are standard probability manipulations: however, we will see how to use these to make inferences about parameters and unobserved variables, given data

Two Practical Problems (Assume for simplicity each variable takes K values) Problem 1: Computational Complexity –Conditional probability computations scale as O(K N ) where N is the number of variables being summed over Problem 2: Model Specification –To specify a joint distribution we need a table of O(K N ) numbers –Where do these numbers come from?

Two Key Ideas Problem 1: Computational Complexity –Idea: Graphical models Structured probability models lead to tractable inference Problem 2: Model Specification –Idea: Probabilistic learning General principles for learning from data

Part 2: Graphical Models

“…probability theory is more fundamentally concerned with the structure of reasoning and causation than with numbers.” Glenn Shafer and Judea Pearl Introduction to Readings in Uncertain Reasoning, Morgan Kaufmann, 1990

Conditional Independence A is conditionally independent of B given C iff p(a | b, c) = p(a | c) (also implies that B is conditionally independent of A given C) In words, B provides no information about A, if value of C is known Example: –a = “reading ability” –b = “height” –c = “age” Note that conditional independence does not imply marginal independence

Graphical Models Represent dependency structure with a directed graph –Node random variable –Edges encode dependencies Absence of edge -> conditional independence –Directed and undirected versions Why is this useful? –A language for communication –A language for computation Origins: –Wright 1920’s –Independently developed by Spiegelhalter and Lauritzen in statistics and Pearl in computer science in the late 1980’s

Examples of 3-way Graphical Models ACB Marginal Independence: p(A,B,C) = p(A) p(B) p(C)

Examples of 3-way Graphical Models A CB Conditionally independent effects: p(A,B,C) = p(B|A)p(C|A)p(A) B and C are conditionally independent Given A e.g., A is a disease, and we model B and C as conditionally independent symptoms given A

Examples of 3-way Graphical Models A B C Independent Causes: p(A,B,C) = p(C|A,B)p(A)p(B)

Examples of 3-way Graphical Models ACB Markov dependence: p(A,B,C) = p(C|B) p(B|A)p(A)

Real-World Example Monitoring Intensive-Care Patients 37 variables 509 parameters …instead of 2 37 (figure courtesy of Kevin Murphy/Nir Friedman) PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

Directed Graphical Models A B C p(A,B,C) = p(C|A,B)p(A)p(B)

Directed Graphical Models A B C In general, p(X 1, X 2,....X N ) =  p(X i | parents(X i ) ) p(A,B,C) = p(C|A,B)p(A)p(B)

Directed Graphical Models A B C Probability model has simple factored form Directed edges => direct dependence Absence of an edge => conditional independence Also known as belief networks, Bayesian networks, causal networks In general, p(X 1, X 2,....X N ) =  p(X i | parents(X i ) ) p(A,B,C) = p(C|A,B)p(A)p(B)

Reminders from Probability…. Law of Total Probability P(a) =  b P(a, b) =  b P(a | b) P(b) –Conditional version: P(a|c) =  b P(a, b|c) =  b P(a | b, c) P(b|c) Factorization or Chain Rule –P(a, b, c, d) = P(a | b, c, d) P(b | c, d) P(c | d) P (d), or = P(b | a, c, d) P(c | a, d) P(d | a) P(a), or = …..

Graphical Models for Computation PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP Say we want to compute P(BP|Press) Law of total probability: -> must sum over all other variables -> exponential in # variables Factorization: -> joint distribution factors into smaller tables Can now sum over smaller tables, can reduce complexity dramatically

Example D A B C F E G

D A B C F E G p(A, B, C, D, E, F, G) =  p( variable | parents ) = p(A|B)p(C|B)p(B|D)p(F|E)p(G|E)p(E|D) p(D)

Example D A B c F E g Say we want to compute p(a | c, g)

Example D A B c F E g Direct calculation: p(a|c,g) =  bdef p(a,b,d,e,f | c,g) Complexity of the sum is O(K 4 )

Example D A B c F E g Reordering (using factorization):  b p(a|b)  d p(b|d,c)  e p(d|e)  f p(e,f |g)

Example D A B c F E g Reordering:  b  p(a|b)  d p(b|d,c)  e p(d|e)  f p(e,f |g) p(e|g)

Example D A B c F E g Reordering:  b  p(a|b)  d p(b|d,c)  e p(d|e) p(e|g) p(d|g)

Example D A B c F E g Reordering:  b  p(a|b)  d p(b|d,c) p(d|g) p(b|c,g)

Example D A B c F E g Reordering:  b  p(a|b) p(b|c,g) p(a|c,g) Complexity is O(K), compared to O(K 4 )

Graphs with “loops” D A B C F E G Message passing algorithm does not work when there are multiple paths between 2 nodes

Graphs with “loops” D A B C F E G General approach: “cluster” variables together to convert graph to a tree

Reduce to a Tree D A B, E C F G

Probability Calculations on Graphs General algorithms exist - beyond trees –Complexity is typically O(m (number of parents ) ) (where m = arity of each node) –If single parents (e.g., tree), -> O(m) –The sparser the graph the lower the complexity Technique can be “automated” –i.e., a fully general algorithm for arbitrary graphs –For continuous variables: replace sum with integral –For identification of most likely values Replace sum with max operator

Part 3: Learning with Graphical Models Further Reading: M. Jordan, Graphical models, Statistical Science: Special Issue on Bayesian Statistics, vol. 19, no. 1, pp , Feb A.Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis (2 nd ed), Chapman and Hall, 2004

Probabilistic Model Real World Data P(Data | Parameters) P(Parameters | Data) Generative Model, Probability Inference, Statistics

The Likelihood Function Likelihood = p(data | parameters) = p( D |  ) = L (  ) Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters Details –Constants that do not involve  can be dropped in defining L (  ) –Often easier to work with log L (  )

Comments on the Likelihood Function Constructing a likelihood function L (  ) is the first step in probabilistic modeling The likelihood function implicitly assumes an underlying probabilistic model M with parameters  L (  ) connects the model to the observed data Graphical models provide a useful language for constructing likelihoods

Binomial Likelihood Binomial model –N memoryless trials, 2 outcomes –probability  of success at each trial Observed data –r successes in n trials –Defines a likelihood: L(  ) = p(D |  ) = p(successes) p(non-successes) =  r (1- ) n-r

Binomial Likelihood Examples

Multinomial Likelihood Multinomial model –N memoryless trials, K outcomes –Probability vector  for outcomes at each trial Observed data –n j successes in n trials –Defines a likelihood: –Maximum likelihood estimates:

Graphical Model for Multinomial w1w1   = [ p(w 1 ), p(w 2 ),….. p(w k ) ] w2w2 wnwn   Parameters  Observed data

“Plate” Notation wiwi i=1:n  Data = D = {w 1,…w n } Model parameters Plate (rectangle) indicates replicated nodes in a graphical model Variables within a plate are conditionally independent manner given parent

Learning in Graphical Models wiwi i=1:n  Data = D = {w 1,…w n } Model parameters Can view learning in a graphical model as computing the most likely value of the parameter node given the data nodes

Maximum Likelihood (ML) Principle (R. Fisher ~ 1922) wiwi i=1:n  L () = p(Data |  ) =  p(y i |  ) Maximum Likelihood:  ML = arg max{ Likelihood() } Select the parameters that make the observed data most likely Data = {w 1,…w n } Model parameters

The Bayesian Approach to Learning wiwi i=1:n  Fully Bayesian: p(  | Data) = p(Data |  ) p() / p(Data) Maximum A Posteriori:  MAP = arg max{ Likelihood() x Prior() }  Prior() = p(  )

Learning a Multinomial Likelihood: same as before Prior: p(   ) = Dirichlet ( 1,… K ) proportional to –Has mean Prior weight for  j Can set all  j =  for “uniform” prior

Dirichlet Shapes From:

Bayesian Learning P( | D, ) is proportional to p(data | ) p() = = Dirichlet( n 1 +  1,…, n K +  K ) Posterior mean estimate

Summary of Bayesian Learning Can use graphical models to describe relationships between parameters and data P(data | parameters) = Likelihood function P(parameters) = prior –In applications such as text mining, prior can be “uninformative”, i.e., flat –Prior can also be optimized for prediction (e.g., on validation data) We can compute P(parameters | data, prior) or a “point estimate” (e.g., posterior mode or mean) Computation of posterior estimates can be computationally intractable – Monte Carlo techniques often used