CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Image Modeling & Segmentation
Mixture Models and the EM Algorithm
CS479/679 Pattern Recognition Dr. George Bebis
Expectation Maximization
Supervised Learning Recap
Segmentation and Fitting Using Probabilistic Methods
Visual Recognition Tutorial
EE-148 Expectation Maximization Markus Weber 5/11/99.
Overview Full Bayesian Learning MAP learning
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Expectation Maximization Algorithm
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Gaussian Mixture Models and Expectation Maximization.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Structure Learning.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Data mining and machine learning A brief introduction.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
EM and expected complete log-likelihood Mixture of Experts
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
G AUSSIAN M IXTURE M ODELS David Sears Music Information Retrieval October 8, 2009.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Statistical Learning (From data to distributions).
First topic: clustering and pattern recognition Marc Sobel.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.
CS Statistical Machine learning Lecture 24
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Lecture 2: Statistical learning primer for biologists
Flat clustering approaches
CSE 517 Natural Language Processing Winter 2015
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백
CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
CS479/679 Pattern Recognition Dr. George Bebis
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
Multimodal Learning with Deep Boltzmann Machines
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Biointelligence Laboratory, Seoul National University
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization

A GENDA Learning probability distributions from data in the setting of known structure, missing data Expectation-maximization (EM) algorithm

B ASIC P ROBLEM Given a dataset D={ x [1],…, x [M]} and a Bayesian model over observed variables X and hidden (latent) variables Z Fit the distribution P( X, Z ) to the data Interpretation : each example x [m] is an incomplete view of the “underlying” sample ( x [m], z [m]) Z X

A PPLICATIONS Clustering in data mining Dimensionality reduction Latent psychological traits (e.g., intelligence, personality) Document classification Human activity recognition

H IDDEN V ARIABLES CAN Y IELD MORE P ARSIMONIOUS M ODELS Hidden variables => conditional independences Z X1X1 X2X2 X3X3 X4X4 X1X1 X2X2 X3X3 X4X4 Without Z, the observables become fully dependent

H IDDEN V ARIABLES CAN Y IELD MORE P ARSIMONIOUS M ODELS Hidden variables => conditional independences Z X1X1 X2X2 X3X3 X4X4 X1X1 X2X2 X3X3 X4X4 Without Z, the observables become fully dependent 1+4*2=9 parameters =15 parameters

G ENERATING M ODEL z [1] x [1] z [M] x [M] zz  x|z These CPTs are identical and given

E XAMPLE : DISCRETE VARIABLES z [1] x [1] z [M] x [M] zz  x|z Categorical distributions given by parameters  z P(Z[i] |  z ) = Categorical(  z ) Categorical distribution P(X[i]|z[i],  x|z[i] ) = Categorical(  x|z[i] ) (in other words, z[i] multiplexes between Categorical distributions)

M AXIMUM L IKELIHOOD ESTIMATION Approach: find values of  z,  x | z ), and D Z =( z [1],…, z [M]) that maximize the likelihood of the data L( , D Z ; D) = P(D| , D Z ) Find arg max L( , D Z ; D) over , D Z

M ARGINAL L IKELIHOOD ESTIMATION Approach: find values of  z,  x | z ), and that maximize the likelihood of the data without assuming values of D Z =( z [1],…, z [M]) L(  ; D) =  Dz P(D, D Z |  ) Find arg max L(  ; D) over  (A partially Bayesian approach)

C OMPUTATIONAL CHALLENGES P(D| , D Z ) and P(D,D Z |  ) are easy to evaluate, but… Maximum likelihood arg max L( , D Z ; D) Optimizing over M assignments to Z (|Val(Z)| M possible joint assignments) as well as continuous parameters Maximum marginal likelihood arg max L(  ; D) Optimizing locally over continuous parameters, but objective requires summing over M assignments to Z

E XPECTATION M AXIMIZATION FOR ML Idea: use a coordinate ascent approach arg max , DZ L( , D Z ; D) = arg max  max DZ L( , D Z ; D) Step 1: Finding D Z * = arg max DZ L( , D Z ; D) is easy given a fixed  Fully observed, ML parameter estimation Step 2: Set Q(  ) = L( , D Z * ; D) Finding    arg max  Q(  is easy given that D Z is fixed Fully observed, ML parameter estimation Repeat steps 1 and 2 until convergence

E XAMPLE : C ORRELATED VARIABLES z [1] x 1 [1] z [M] x 1 [M] zz  x1|z x 2 [1] x 2 [M]  x1|z z x1x1 zz x2x2 M Plate notationUnrolled network

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates  z = 0.5  x1|z=1 = 0.4,  x1|z=2 = 0.3  x2|z=1 = 0.7,  x2|z=2 = 0.6

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates  z = 0.5  x1|z=1 = 0.4,  x1|z=2 = 0.3  x2|z=1 = 0.7,  x2|z=2 = 0.6 Estimated Z’s (1,1): type 1 (1,0): type 1 (0,1): type 2 (0,0): type 2

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates  z =  x1|z=1 = 1,  x1|z=2 = 0  x2|z=1 = 0.368,  x2|z=2 = Estimated Z’s (1,1): type 1 (1,0): type 1 (0,1): type 2 (0,0): type 2

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notationSuppose 2 types: 1.X 1 != X 2, random 2.X 1,X 2 =1,1 with 90% chance, 0,0 otherwise Type 1 drawn 75% of the time X Dataset (1,1): 222 (1,0): 382 (0,1): 364 (0,0): 32 Parameter Estimates  z =  x1|z=1 = 1,  x1|z=2 = 0  x2|z=1 = 0.368,  x2|z=2 = Estimated Z’s (1,1): type 1 (1,0): type 1 (0,1): type 2 (0,0): type 2 Converged (true ML estimate)

E XAMPLE : C ORRELATED VARIABLES z x1x1 zz  x1|z x2x2  x2|z M Plate notation x3x3  x3|z x4x4  x4|z Random initial guess  Z = 0.44  X1|Z=1 = 0.97  X2|Z=1 = 0.21  X3|Z=1 = 0.87  X4|Z=1 = 0.57  X1|Z=2 = 0.07  X2|Z=2 = 0.97  X3|Z=2 = 0.71  X4|Z=2 = 0.03 Log likelihood x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,

E XAMPLE : E STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Random initial guess  Z = 0.44  X1|Z=1 = 0.97  X2|Z=1 = 0.21  X3|Z=1 = 0.87  X4|Z=1 = 0.57  X1|Z=2 = 0.07  X2|Z=2 = 0.97  X3|Z=2 = 0.71  X4|Z=2 = 0.03 Log likelihood -4401

E XAMPLE : M STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Current estimates  Z = 0.43  X1|Z=1 = 0.67  X2|Z=1 = 0.27  X3|Z=1 = 0.37  X4|Z=1 = 0.83  X1|Z=2 = 0.31  X2|Z=2 = 0.68  X3|Z=2 = 0.31  X4|Z=2 = 0.21 Log likelihood Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,1 2111

E XAMPLE : E STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Z Assignments Current estimates  Z = 0.43  X1|Z=1 = 0.67  X2|Z=1 = 0.27  X3|Z=1 = 0.37  X4|Z=1 = 0.83  X1|Z=2 = 0.31  X2|Z=2 = 0.68  X3|Z=2 = 0.31  X4|Z=2 = 0.21 Log likelihood x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,

E XAMPLE : E STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Current estimates  Z = 0.40  X1|Z=1 = 0.56  X2|Z=1 = 0.31  X3|Z=1 = 0.40  X4|Z=1 = 0.92  X1|Z=2 = 0.45  X2|Z=2 = 0.66  X3|Z=2 = 0.26  X4|Z=2 = 0.04 Log likelihood Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,1 2111

E XAMPLE : L AST E-M STEP z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Current estimates  Z = 0.43  X1|Z=1 = 0.51  X2|Z=1 = 0.36  X3|Z=1 = 0.35  X4|Z=1 = 1  X1|Z=2 = 0.53  X2|Z=2 = 0.57  X3|Z=2 = 0.33  X4|Z=2 = 0 Log likelihood Z Assignments x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,

P ROBLEM : M ANY L OCAL M INIMA Flipping Z assignments causes large shifts in likelihood, leading to a poorly behaved energy landscape! Solution: EM using the marginal likelihood formulation “Soft” EM (This is the typical form of the EM algorithm)

E XPECTATION M AXIMIZATION FOR MML arg max  L( , D) = arg max  E DZ|D,  [L(  ; D Z, D)] Do arg max  E DZ|D,  [log L(  ; D Z, D)] instead (justified later) Step 1: Given current fixed  t,  find P(Dz|  t, D) Compute a distribution over each Z[i] Step 2: Use these probabilities in the expectation E DZ |D,  t [log L( , D Z ; D)] = Q(  Now find max  Q(  Fully observed, weighted, ML parameter estimation Repeat steps 1 (expectation) and 2 (maximization) until convergence

E STEP IN DETAIL Ultimately, want to maximize Q(  t ) = E DZ|D,  t [log L(  ; D Z, D)] over  Q(  t ) =  m  z [m] P( z [m]| x [m],  t ) log P( x [m], z [m]|  ) E step computes the terms w m, z (  t )=P( Z [m]= z |D,  t ) over all examples m and z  Val[ Z ]

M STEP IN DETAIL arg max  Q(  t ) =  m  z w m, z (  t ) log P ( x [m]| , z [m]= z ) = argmax  m  z P ( x [m]| , z [m]= z )^(w m, z (  t )) This is weighted ML Each z[m] is interpreted to be observed w m, z (  t ) times Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case

E XAMPLE : B ERNOULLI P ARAMETER FOR Z  Z * = arg max  z  m  z w m, z log P ( x [m], z [m]= z |   ) = arg max  z  m  z w m, z log (I[ z =1]  Z +  I[ z =0](1-  Z ) = arg max  z [log (  Z )  m w m, z=1 + log(1-  Z )  m w m, z= 0 ] =>  Z * = (  m w m, z=1 )/  m (w m, z=1 + w m, z =0 ) “Expected counts” M  t [z] =  m w m,z (  t ) Express  Z * = M  t [z=1] / M  t [ ]

E XAMPLE : B ERNOULLI P ARAMETERS FOR X I | Z  Xi|z=k * = arg max  z  m w m, z =k log P( x [m], z [m]= k |  Xi|z=k ) = arg max  xi|z=k  m  z w m, z  log (I[x i [m]=1, z =k]  Xi|z=k +  I[x i [m]=0, z =k](1-  Xi|z=k ) = … (similar derivation)  Xi|z=k * = M  t [x i =1,z=k] / M  t [z=k]

EM ON P RIOR E XAMPLE (100 ITERATIONS ) z x1x1 zz  x1|z x2x2  x2|z M Plate notation X Dataset x3x3  x3|z x4x4  x4|z x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , , Final estimates  Z = 0.49  X1|Z=1 = 0.64  X2|Z=1 = 0.88  X3|Z=1 = 0.41  X4|Z=1 = 0.46  X1|Z=2 = 0.38  X2|Z=2 = 0.00  X3|Z=2 = 0.27  X4|Z=2 = 0.68 Log likelihood P(Z)=2 x 3,x 4 x 1,x 2 0,00,11,01,1 0, , , ,1 0.00

C ONVERGENCE In general, no way to tell a priori how fast EM will converge Soft EM is usually slower than hard EM Still runs into local minima, but has more opportunities to coordinate parameter adjustments

W HY DOES IT WORK ? Why are we optimizing over Q(  t ) =  m  z [m] P( z [m]| x [m],  t ) log P( x [m], z [m]|  ) rather than the true marginalized likelihood: L(  D) =  m  z [m] P( z [m]| x [m],  t ) P( x [m], z [m]|  ) ?

W HY DOES IT WORK ? Why are we optimizing over Q(  t ) =  m  z [m] P( z [m]| x [m],  t ) log P( x [m], z [m]|  ) rather than the true marginalized likelihood: L(  D) =  m  z [m] P( z [m]| x [m],  t ) P( x [m], z [m]|  ) ? Can prove that: The log likelihood is increased at every step A stationary point of arg max  E DZ|D,  [L(  ; D Z, D)] is a stationary point of log L(  D ) see K&F p

G AUSSIAN C LUSTERING USING EM One of the first uses of EM Widely used approach Finding good starting points: k-means algorithm (Hard assignment) Handling degeneracies Regularization

R ECAP Learning with hidden variables Typically categorical