1 Learning P-maps Param. Learning Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,
Basics of Statistical Estimation
A Tutorial on Learning with Bayesian Networks
Parameter and Structure Learning Dhruv Batra, Recitation 10/02/2008.
Learning: Parameter Estimation
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Learning in Bayes Nets Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs.
Parameter Estimation using likelihood functions Tutorial #1
. Learning Bayesian networks Slides by Nir Friedman.
Bayesian Network Representation Continued
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Bayesian Networks Alan Ritter.
Thanks to Nir Friedman, HU
Learning Bayesian Networks (From David Heckerman’s tutorial)
1 EM for BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, –  Carlos Guestrin.
Crash Course on Machine Learning
Recitation 1 Probability Review
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
1 EM for BNs Kalman Filters Gaussian BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 17 th, 2006 Readings: K&F: 16.2 Jordan.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2008 Readings: K&F: 3.1, 3.2, –  Carlos.
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Maximum Likelihood Estimation
1 BN Semantics 2 – The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 20 th, 2006 Readings: K&F:
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
Oliver Schulte Machine Learning 726
CS 2750: Machine Learning Directed Graphical Models
CS 2750: Machine Learning Density Estimation
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Still More Uncertainty
CS498-EA Reasoning in AI Lecture #20
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Important Distinctions in Learning BNs
Bayesian Learning Chapter
Parameter Learning 2 Structure Learning 1: The good
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Parametric Methods Berlin Chen, 2005 References:
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Junction Trees 3 Undirected Graphical Models
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Variable Elimination Graphical Models – Carlos Guestrin
Generalized Belief Propagation
BN Semantics 2 – The revenge of d-separation
Learning Bayesian networks
Presentation transcript:

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1, 16.2, 16.3, –  Carlos Guestrin

2 Perfect maps (P-maps) I-maps are not unique and often not simple enough Define “simplest” G that is I-map for P  A BN structure G is a perfect map for a distribution P if I(P) = I(G) Our goal:  Find a perfect map!  Must address equivalent BNs

–  Carlos Guestrin Inexistence of P-maps 1 XOR (this is a hint for the homework)

–  Carlos Guestrin Obtaining a P-map Given the independence assertions that are true for P Assume that there exists a perfect map G *  Want to find G * Many structures may encode same independencies as G *, when are we done?  Find all equivalent structures simultaneously!

–  Carlos Guestrin I-Equivalence Two graphs G 1 and G 2 are I-equivalent if I(G 1 ) = I(G 2 ) Equivalence class of BN structures  Mutually-exclusive and exhaustive partition of graphs How do we characterize these equivalence classes?

–  Carlos Guestrin Skeleton of a BN Skeleton of a BN structure G is an undirected graph over the same variables that has an edge X–Y for every X!Y or Y!X in G (Little) Lemma: Two I- equivalent BN structures must have the same skeleton A H C E G D B F K J I

–  Carlos Guestrin What about V-structures? V-structures are key property of BN structure Theorem: If G 1 and G 2 have the same skeleton and V-structures, then G 1 and G 2 are I-equivalent A H C E G D B F K J I

–  Carlos Guestrin Same V-structures not necessary Theorem: If G 1 and G 2 have the same skeleton and V-structures, then G 1 and G 2 are I-equivalent Though sufficient, same V-structures not necessary

–  Carlos Guestrin Immoralities & I-Equivalence Key concept not V-structures, but “immoralities” (unmarried parents )  X ! Z  Y, with no arrow between X and Y  Important pattern: X and Y independent given their parents, but not given Z  (If edge exists between X and Y, we have covered the V-structure) Theorem: G 1 and G 2 have the same skeleton and immoralities if and only if G 1 and G 2 are I-equivalent

–  Carlos Guestrin Obtaining a P-map Given the independence assertions that are true for P  Obtain skeleton  Obtain immoralities From skeleton and immoralities, obtain every (and any) BN structure from the equivalence class

–  Carlos Guestrin Identifying the skeleton 1 When is there an edge between X and Y? When is there no edge between X and Y?

–  Carlos Guestrin Identifying the skeleton 2 Assume d is max number of parents (d could be n) For each X i and X j  E ij  true  For each U  X – {X i,X j }, |U|≤d Is (X i  X j | U) ?  E ij  false  If E ij is true Add edge X – Y to skeleton

–  Carlos Guestrin Identifying immoralities Consider X – Z – Y in skeleton, when should it be an immorality? Must be X ! Z  Y (immorality):  When X and Y are never independent given U, if Z2U Must not be X ! Z  Y (not immorality):  When there exists U with Z2U, such that X and Y are independent given U

–  Carlos Guestrin From immoralities and skeleton to BN structures Representing BN equivalence class as a partially-directed acyclic graph (PDAG) Immoralities force direction on some other BN edges Full (polynomial-time) procedure described in reading

–  Carlos Guestrin What you need to know Minimal I-map  every P has one, but usually many Perfect map  better choice for BN structure  not every P has one  can find one (if it exists) by considering I-equivalence  Two structures are I-equivalent if they have same skeleton and immoralities

Announcements Recitation tomorrow  Don’t miss it! No class on Monday  –  Carlos Guestrin

18 Review Bayesian Networks  Compact representation for probability distributions  Exponential reduction in number of parameters  Exploits independencies Next – Learn BNs  parameters  structure Flu Allergy Sinus Headache Nose

–  Carlos Guestrin Thumbtack – Binomial Distribution P(Heads) = , P(Tails) = 1-  Flips are i.i.d.:  Independent events  Identically distributed according to Binomial distribution Sequence D of  H Heads and  T Tails

–  Carlos Guestrin Maximum Likelihood Estimation Data: Observed set D of  H Heads and  T Tails Hypothesis: Binomial distribution Learning  is an optimization problem  What’s the objective function? MLE: Choose  that maximizes the probability of observed data:

–  Carlos Guestrin Your first learning algorithm Set derivative to zero:

–  Carlos Guestrin Learning Bayes nets Known structureUnknown structure Fully observable data Missing data x (1) … x (m) Data structureparameters CPTs – P(X i | Pa Xi )

–  Carlos Guestrin Learning the CPTs x (1) … x (m) Data For each discrete variable X i

–  Carlos Guestrin Learning the CPTs x (1) … x (m) Data For each discrete variable X i WHY??????????

–  Carlos Guestrin Maximum likelihood estimation (MLE) of BN parameters – example Given structure, log likelihood of data: Flu Allergy Sinus Nose

–  Carlos Guestrin Maximum likelihood estimation (MLE) of BN parameters – General case Data: x (1),…,x (m) Restriction: x (j) [Pa Xi ] ! assignment to Pa Xi in x (j) Given structure, log likelihood of data:

–  Carlos Guestrin Taking derivatives of MLE of BN parameters – General case

–  Carlos Guestrin General MLE for a CPT Take a CPT: P(X|U) Log likelihood term for this CPT Parameter  X=x|U=u :

–  Carlos Guestrin m Can we really trust MLE? What is better?  3 heads, 2 tails  30 heads, 20 tails  3x10 23 heads, 2x10 23 tails Many possible answers, we need distributions over possible parameters

–  Carlos Guestrin Bayesian Learning Use Bayes rule: Or equivalently:

–  Carlos Guestrin Bayesian Learning for Thumbtack Likelihood function is simply Binomial: What about prior?  Represent expert knowledge  Simple posterior form Conjugate priors:  Closed-form representation of posterior (more details soon)  For Binomial, conjugate prior is Beta distribution

–  Carlos Guestrin Beta prior distribution – P(  ) Likelihood function: Posterior:

–  Carlos Guestrin Posterior distribution Prior: Data: m H heads and m T tails Posterior distribution:

–  Carlos Guestrin Conjugate prior Given likelihood function P(D|  ) (Parametric) prior of the form P(  |  ) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as:  P(  |  ’), for some new set of parameters  ’ Prior: Data: m H heads and m T tails (binomial likelihood) Posterior distribution:

–  Carlos Guestrin Using Bayesian posterior Posterior distribution: Bayesian inference:  No longer single parameter:  Integral is often hard to compute

–  Carlos Guestrin Bayesian prediction of a new coin flip Prior: Observed m H heads, m T tails, what is probability of m+1 flip is heads?

–  Carlos Guestrin Asymptotic behavior and equivalent sample size Beta prior equivalent to extra thumbtack flips:  As m → 1, prior is “forgotten” But, for small sample size, prior is important! Equivalent sample size:  Prior parameterized by  H,  T, or  m’ (equivalent sample size) and   Fix m’, change  Fix , change m’

–  Carlos Guestrin Bayesian learning corresponds to smoothing m=0 ) prior parameter m!1 ) MLE m

–  Carlos Guestrin Bayesian learning for multinomial What if you have a k sided coin??? Likelihood function if multinomial:  Conjugate prior for multinomial is Dirichlet:  Observe m data points, m i from assignment i, posterior: Prediction:

–  Carlos Guestrin Bayesian learning for two-node BN Parameters  X,  Y|X Priors:  P(  X ):  P(  Y|X ):

–  Carlos Guestrin Very important assumption on prior: Global parameter independence Global parameter independence:  Prior over parameters is product of prior over CPTs

–  Carlos Guestrin Global parameter independence, d-separation and local prediction Flu Allergy Sinus Headache Nose Independencies in meta BN: Proposition: For fully observable data D, if prior satisfies global parameter independence, then

–  Carlos Guestrin Within a CPT Meta BN including CPT parameters: Are  Y|X=t and  Y|X=f d-separated given D? Are  Y|X=t and  Y|X=f independent given D?  Context-specific independence!!! Posterior decomposes:

–  Carlos Guestrin Priors for BN CPTs (more when we talk about structure learning) Consider each CPT: P(X|U=u) Conjugate prior:  Dirichlet(  X=1|U=u,…,  X=k|U=u ) More intuitive:  “prior data set” D’ with m’ equivalent sample size  “prior counts”:  prediction:

–  Carlos Guestrin An example

–  Carlos Guestrin What you need to know about parameter learning MLE:  score decomposes according to CPTs  optimize each CPT separately Bayesian parameter learning:  motivation for Bayesian approach  Bayesian prediction  conjugate priors, equivalent sample size  Bayesian learning ) smoothing Bayesian learning for BN parameters  Global parameter independence  Decomposition of prediction according to CPTs  Decomposition within a CPT