Learning with Missing Data

Slides:



Advertisements
Similar presentations
ABSTRACT: We examine how to detect hidden variables when learning probabilistic models. This problem is crucial for for improving our understanding of.
Advertisements

Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University
EMNLP, June 2001Ted Pedersen - EM Panel1 A Gentle Introduction to the EM Algorithm Ted Pedersen Department of Computer Science University of Minnesota.
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Learning: Parameter Estimation
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Graphical Models - Learning -
Bayesian Networks - Intro - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG.
Graphical Models - Inference -
Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Overview Full Bayesian Learning MAP learning
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Learning Bayesian networks Slides by Nir Friedman.
Global Approximate Inference Eran Segal Weizmann Institute.
Lecture 5: Learning models using EM
. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.
Structure Learning in Bayesian Networks
Bayesian Network Representation Continued
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Parametric Inference.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Learning Bayesian Networks
Thanks to Nir Friedman, HU
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.
Optimization Methods One-Dimensional Unconstrained Optimization
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Semi-Supervised Learning
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
Notes on Graphical Models Padhraic Smyth Department of Computer Science University of California, Irvine.
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
Some Neat Results From Assignment 1. Assignment 1: Negative Examples (Rohit)
Guidance: Assignment 3 Part 1 matlab functions in statistics toolbox  betacdf, betapdf, betarnd, betastat, betafit.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Can small quantum systems learn? NATHAN WIEBE & CHRISTOPHER GRANADE, DEC
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Reading Notes Wang Ning Lab of Database and Information Systems
Introduction to Artificial Intelligence
Data Mining Lecture 11.
Introduction to Artificial Intelligence
Probabilistic Models with Latent Variables
Introduction to Artificial Intelligence
Introduction to EM algorithm
Learning Bayesian networks
CS498-EA Reasoning in AI Lecture #20
Important Distinctions in Learning BNs
Bayesian Learning Chapter
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Learning Bayesian networks
Presentation transcript:

Learning with Missing Data Eran Segal Weizmann Institute

Incomplete Data Hidden variables Missing values Challenges Foundational – is the learning task well defined? Computational – how can we learn with missing data?

We need to consider the data missing mechanism Treating Missing Data How Should we treat missing data? Case I: A coin is tossed on a table, occasionally it drops and measurements are not taken Sample sequence: H,T,?,?,T,?,H Treat missing data by ignoring it Case II: A coin is tossed, but only heads are reported Sample sequence: H,?,?,?,H,?,H Treat missing data by filling it with Tails We need to consider the data missing mechanism

Modeling Data Missing Mechanism X = {X1,...,Xn} are random variables OX = {OX1,...,OXn} are observability variables Always observed Y = {Y1,...,Yn} new random variables Val(Yi) = Val(Yi)  {?} Yi is a deterministic function of Xi and OX1:

Modeling Missing Data Mechanism Case I (random missing values) Case II (deliberate missing values)     X X OX OX Y Y

Modeling Missing Data Mechanism Case I (random missing values) Case II (deliberate missing values)     X X OX OX Y Y

Treating Missing Data When can we ignore the missing data mechanism and focus only on the likelihood? For every Xi, Ind(Xi;OXi) Missing at Random (MAR) is sufficient The probability that the value of Xi is missing is independent of its actual value given other observed values In both cases, the likelihood decomposes

Hidden (Latent) Variables Attempt to learn a model with hidden variables In this case, MAR always holds (variable is always missing) Why should we care about unobserved variables? X1 X2 X3 X1 X2 X3 H Y1 Y2 Y3 Y1 Y2 Y3 17 parameters 59 parameters

Hidden (Latent) Variables Hidden variables also appear in clustering Naïve Bayes model: Class variable is hidden Observed attributes are independent given the class Hidden Cluster X1 ... X2 Xn Observed possible missing values

Likelihood for Complete Data P(X) x0 x1 x0 x1 Input Data: X Y x0 y0 y1 x1 X Likelihood: Y P(Y|X) X y0 y1 x0 y0|x0 y1|x0 x1 y0|x1 y1|x1 Likelihood decomposes by variables Likelihood decomposes within CPDs

Likelihood for Incomplete Data P(X) x0 x1 x0 x1 Input Data: X Y ? y0 x0 y1 X Likelihood: Y P(Y|X) X y0 y1 x0 y0|x0 y1|x0 x1 y0|x1 y1|x1 Likelihood does not decompose by variables Likelihood does not decompose within CPDs Computing likelihood per instance requires inference!

Bayesian Estimation … … Bayesian network for parameter estimation X X … X[1] X[2] X[M] … Y Y[1] Y[2] Y[M] Y|X=0 Y|X=1 Posteriors are not independent

Identifiability Likelihood can have multiple global maxima Example: We can rename the values of the hidden variable H If H has two values, likelihood has two global maxima With many hidden variables, there can be an exponential number of global maxima Multiple local and global maxima can also occur with missing data (not only hidden variables) Y

MLE from Incomplete Data Nonlinear optimization problem Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters Add line search and conjugate gradient methods to get fast convergence L(D|) 

MLE from Incomplete Data Nonlinear optimization problem Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has better score than current point L(D|) 

MLE from Incomplete Data Nonlinear optimization problem L(D|)  Gradient Ascent and EM Find local maxima Require multiple restarts to find approx. to the global maximum Require computations in each iteration

Gradient Ascent Theorem: Proof: How do we compute ?

Gradient Ascent

Gradient Ascent Requires computation: P(xi,pai|o[m],) for all i, m Can be done with clique-tree algorithm, since Xi,Pai are in the same clique

Gradient Ascent Summary Pros Flexible, can be extended to non table CPDs Cons Need to project gradient onto space of legal parameters For reasonable convergence, need to combine with advanced methods (conjugate gradient, line search)

Expectation Maximization (EM) Tailored algorithm for optimizing likelihood functions Intuition Parameter estimation is easy given complete data Computing probability of missing data is “easy” (=inference) given parameters Strategy Pick a starting point for parameters “Complete” the data using current parameters Estimate parameters relative to data completion Iterate Procedure guaranteed to improve at each iteration

Expectation Maximization (EM) Initialize parameters to 0 Expectation (E-step): For each data case o[m] and each family X,U compute P(X,U | o[m], i) Compute the expected sufficient statistics for each x,u Maximization (M-step): Treat the expected sufficient statistics as observed and set the parameters to the MLE with respect to the ESS

Expectation Maximization (EM) Initial network M-Step (reparameterize) X Y Updated network X Expected counts N(X) N(X,Y) Y E-Step (inference) + Training data Iterate X Y ? y0 x0 y1

Expectation Maximization (EM) Formal Guarantees: L(D:i+1)  L(D:i) Each iteration improves the likelihood If i+1=i , then i is a stationary point of L(D:) Usually, this means a local maximum Main cost: Computations of expected counts in E-Step Requires inference for each instance in training set Exactly the same as in gradient ascent!

EM – Practical Considerations Initial parameters Highly sensitive to starting parameters Choose randomly Choose by guessing from another source Stopping criteria Small change in data likelihood Small change in parameters Avoiding bad local maxima Multiple restarts Early pruning of unpromising starting points

EM in Practice – Alarm Network Data sampled from true network 20% of data randomly deleted PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP

EM in Practice – Alarm Network Training error Test error

EM in Practice – Alarm Network

Partial Data: Parameter Estimation Non-linear optimization problem Methods for learning: EM and Gradient Ascent Exploit inference for learning Challenges Exploration of a complex likelihood/posterior More missing data  many more local maxima Cannot represent posterior  must resort to approximations Inference Main computational bottleneck for learning Learning large networks  exact inference is infeasible  resort to approximate inference

Structure Learning w. Missing Data Distinguish two learning problems Learning structure for a given set of random variables Introduce new hidden variables How do we recognize the need for a new variable? Where do we introduce a newly added hidden variable within G? Open ended and less understood…

Structure Learning w. Missing Data Theoretically, there is no problem Define score, and search for structure that maximizes it Likelihood term will require gradient ascent or EM Practically infeasible Typically we have O(n2) candidates at each search step Requires EM for evaluating each candidate Requires inference for each data instance of each candidate Total running time per search step: O(n2  M #EM iteration  cost of BN inference)

Typical Search Requires EM A B C A B D C Requires EM D A B A B C C D D Add B D A B D C Requires EM Reverse CB Delete BC D A B A B C C D D Requires EM

Structural EM Basic idea: use expected sufficient statistics to learn structure, not just parameters Use current network to complete the data using EM Treat the completed data as “real” to score candidates Pick the candidate network with the best score Use the previous completed counts to evaluate networks in the next step After several steps, compute a new data completion from the current network

Structural EM Conceptually In practice Algorithm maintains an actual distribution Q over completed datasets as well as current structure G and parameters G At each step we do one of the following Use <G,G> to compute a new completion Q and redefine G as the MLE relative to Q Evaluate candidate successors G’ relative to Q and pick best In practice Maintain Q implicitly as a model <G,G> Use the model to compute sufficient statistics MQ[x,u] when these are needed to evaluate new structures Use sufficient statistics to compute MLE estimates of candidate structures

Structural EM Benefits Many fewer EM runs Score relative to completed data is decomposable! Utilize same benefits as structure learning w. complete data Each candidate network requires few recomputations Here savings is large since each sufficient statistics computation requires inference As in EM, we optimize a simpler score Can show improvements and convergence An SEM step that improves in D+ space, improves real score