Unifying Variational and GBP Learning Parameters of MNs EM for BNs

Slides:



Advertisements
Similar presentations
1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,
Advertisements

Variational Inference Amr Ahmed Nov. 6 th Outline Approximate Inference Variational inference formulation – Mean Field Examples – Structured VI.
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Unsupervised Learning
Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Expectation Maximization
An Introduction to Variational Methods for Graphical Models.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
Markov Networks.
Visual Recognition Tutorial
Parameter Learning in Markov Nets Dhruv Batra, Recitation 11/13/2008.
Global Approximate Inference Eran Segal Weizmann Institute.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Visual Recognition Tutorial
Today Logistic Regression Decision Trees Redux Graphical Models
. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.
1 EM for BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, –  Carlos Guestrin.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Lecture 19: More EM Machine Learning April 15, 2010.
Probabilistic Graphical Models
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 EM for BNs Kalman Filters Gaussian BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 17 th, 2006 Readings: K&F: 16.2 Jordan.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
CS Statistical Machine learning Lecture 24
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2008 Readings: K&F: 3.1, 3.2, –  Carlos.
1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.
Lecture 2: Statistical learning primer for biologists
Belief Propagation and its Generalizations Shane Oldenburger.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
1 BN Semantics 2 – The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 20 th, 2006 Readings: K&F:
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
Expectation-Maximization (EM)
Lecture 18 Expectation Maximization
Data Mining Lecture 11.
Markov Networks.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Bayesian Models in Machine Learning
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
An Introduction to Variational Methods for Graphical Models
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Expectation-Maximization & Belief Propagation
Variable Elimination 2 Clique Trees
Parameter Learning 2 Structure Learning 1: The good
EM for BNs Kalman Filters Gaussian BNs
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Approximate Inference by Sampling
Machine Learning: Lecture 6
Expectation Maximization Eric Xing Lecture 10, August 14, 2010
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Junction Trees 3 Undirected Graphical Models
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Machine Learning: UNIT-3 CHAPTER-1
Markov Networks.
Variable Elimination Graphical Models – Carlos Guestrin
Mean Field and Variational Methods Loopy Belief Propagation
Generalized Belief Propagation
BN Semantics 2 – The revenge of d-separation
Presentation transcript:

Unifying Variational and GBP Learning Parameters of MNs EM for BNs Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website Chapter 9 – Jordan K&F: 16.2 Unifying Variational and GBP Learning Parameters of MNs EM for BNs Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University November 15th, 2006

Revisiting Mean-Fields Choice of Q: Optimization problem: 10-708 – Carlos Guestrin 2006

Interpretation of energy functional Exact if P=Q: View problem as an approximation of entropy term: 10-708 – Carlos Guestrin 2006

Entropy of a tree distribution Difficulty SAT Grade Happy Job Coherence Letter Intelligence Entropy term: Joint distribution: Decomposing entropy term: More generally: di number neighbors of Xi 10-708 – Carlos Guestrin 2006

Loopy BP & Bethe approximation Difficulty SAT Grade Happy Job Coherence Letter Intelligence Energy functional: Bethe approximation of Free Energy: use entropy for trees, but loopy graphs: Theorem: If Loopy BP converges, resulting ij & i are stationary point (usually local maxima) of Bethe Free energy! 10-708 – Carlos Guestrin 2006

GBP & Kikuchi approximation DIG GJSL HGJ CD GSI Exact Free energy: Junction Tree Bethe Free energy: Kikuchi approximation: Generalized cluster graph spectrum from Bethe to exact entropy terms weighted by counting numbers see Yedidia et al. Theorem: If GBP converges, resulting Ci are stationary point (usually local maxima) of Kikuchi Free energy! Difficulty SAT Grade Happy Job Coherence Letter Intelligence 10-708 – Carlos Guestrin 2006

What you need to know about GBP Spectrum between Loopy BP & Junction Trees: More computation, but typically better answers If satisfies RIP, equations are very simple General setting, slightly trickier equations, but not hard Relates to variational methods: Corresponds to local optima of approximate version of energy functional 10-708 – Carlos Guestrin 2006

Announcements Tomorrow’s recitation Khalid on learning Markov Networks 10-708 – Carlos Guestrin 2006

Learning Parameters of a BN D S G H J C L I Log likelihood decomposes: Learn each CPT independently Use counts 10-708 – Carlos Guestrin 2006

Log Likelihood for MN Log likelihood of the data: Difficulty SAT Grade Happy Job Coherence Letter Intelligence 10-708 – Carlos Guestrin 2006

Log Likelihood doesn’t decompose for MNs A concave problem Can find global optimum!! Term log Z doesn’t decompose!! Difficulty SAT Grade Happy Job Coherence Letter Intelligence 10-708 – Carlos Guestrin 2006

Derivative of Log Likelihood for MNs Difficulty SAT Grade Happy Job Coherence Letter Intelligence 10-708 – Carlos Guestrin 2006

Derivative of Log Likelihood for MNs Difficulty SAT Grade Happy Job Coherence Letter Intelligence Derivative: Setting derivative to zero Can optimize using gradient ascent Let’s look at a simpler solution 10-708 – Carlos Guestrin 2006

Iterative Proportional Fitting (IPF) Difficulty SAT Grade Happy Job Coherence Letter Intelligence Setting derivative to zero: Fixed point equation: Iterate and converge to optimal parameters Each iteration, must compute: 10-708 – Carlos Guestrin 2006

Log-linear Markov network (most common representation) Feature is some function [D] for some subset of variables D e.g., indicator function Log-linear model over a Markov network H: a set of features 1[D1],…, k[Dk] each Di is a subset of a clique in H two ’s can be over the same variables a set of weights w1,…,wk usually learned from data 10-708 – Carlos Guestrin 2006

Learning params for log linear models 1 – Generalized Iterative Scaling IPF generalizes easily if: i(x) ¸ 0 i i(x) = 1 Update rule: Must compute: If conditions violated, equations are not so simple… c.f., Improved Iterative Scaling [Berger ’97] 10-708 – Carlos Guestrin 2006

Learning params for log linear models 2 – Gradient Ascent Log-likelihood of data: Compute derivative & optimize usually with conjugate gradient ascent You will do an example in your homework!  10-708 – Carlos Guestrin 2006

What you need to know about learning MN parameters? BN parameter learning easy MN parameter learning doesn’t decompose! Learning requires inference! Apply gradient ascent or IPF iterations to obtain optimal parameters applies to both tabular representations and log-linear models 10-708 – Carlos Guestrin 2006

Thus far, fully supervised learning We have assumed fully supervised learning: Many real problems have missing data: 10-708 – Carlos Guestrin 2006

The general learning problem with missing data Marginal likelihood – x is observed, z is missing: 10-708 – Carlos Guestrin 2006

E-step x is observed, z is missing Compute probability of missing data given current choice of  Q(z|xj) for each xj e.g., probability computed during classification step corresponds to “classification step” in K-means 10-708 – Carlos Guestrin 2006

Jensen’s inequality Theorem: log z P(z) f(z) ¸ z P(z) log f(z) 10-708 – Carlos Guestrin 2006

Applying Jensen’s inequality Use: log z P(z) f(z) ¸ z P(z) log f(z) 10-708 – Carlos Guestrin 2006

The M-step maximizes lower bound on weighted data Lower bound from Jensen’s: Corresponds to weighted dataset: <x1,z=1> with weight Q(t+1)(z=1|x1) <x1,z=2> with weight Q(t+1)(z=2|x1) <x1,z=3> with weight Q(t+1)(z=3|x1) <x2,z=1> with weight Q(t+1)(z=1|x2) <x2,z=2> with weight Q(t+1)(z=2|x2) <x2,z=3> with weight Q(t+1)(z=3|x2) … 10-708 – Carlos Guestrin 2006

The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use EQ(t+1)[Count(x,z)] 10-708 – Carlos Guestrin 2006

Convergence of EM Define potential function F(,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood As seen in Machine Learning class last semester 10-708 – Carlos Guestrin 2006

Data likelihood for BNs Flu Allergy Sinus Headache Nose Given structure, log likelihood of fully observed data: 10-708 – Carlos Guestrin 2006

Marginal likelihood What if S is hidden? Flu Allergy Sinus Headache Nose What if S is hidden? 10-708 – Carlos Guestrin 2006

Log likelihood for BNs with hidden data Flu Allergy Sinus Headache Nose Marginal likelihood – O is observed, H is hidden 10-708 – Carlos Guestrin 2006

E-step for BNs E-step computes probability of hidden vars h given o Flu Allergy Sinus Headache Nose E-step computes probability of hidden vars h given o Corresponds to inference in BN 10-708 – Carlos Guestrin 2006

The M-step for BNs Maximization step: Flu Allergy Sinus Headache Nose Maximization step: Use expected counts instead of counts: If learning requires Count(h,o) Use EQ(t+1)[Count(h,o)] 10-708 – Carlos Guestrin 2006

M-step for each CPT M-step decomposes per CPT Standard MLE: Flu Allergy Sinus Headache Nose M-step decomposes per CPT Standard MLE: M-step uses expected counts: 10-708 – Carlos Guestrin 2006

Computing expected counts Flu Allergy Sinus Headache Nose M-step requires expected counts: For a set of vars A, must compute ExCount(A=a) Some of A in example j will be observed denote by AO = aO(j) Some of A will be hidden denote by AH Use inference (E-step computes expected counts): ExCount(t+1)(AO = aO(j), AH = aH) Ã P(AH = aH | AO = aO(j),(t)) 10-708 – Carlos Guestrin 2006

Data need not be hidden in the same way Flu Allergy Sinus Headache Nose When data is fully observed A data point is When data is partially observed But unobserved variables can be different for different data points e.g., Same framework, just change definition of expected counts ExCount(t+1)(AO = aO(j), AH = aH) Ã P(AH = aH | AO = aO(j),(t)) 10-708 – Carlos Guestrin 2006

What you need to know EM for Bayes Nets E-step: inference computes expected counts Only need expected counts over Xi and Paxi M-step: expected counts used to estimate parameters Hidden variables can change per datapoint Use labeled and unlabeled data ! some data points are complete, some include hidden variables 10-708 – Carlos Guestrin 2006