Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Models with Latent Variables

Similar presentations


Presentation on theme: "Probabilistic Models with Latent Variables"β€” Presentation transcript:

1 Probabilistic Models with Latent Variables
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

2 Density Estimation Problem
Learning from unlabeled data π‘₯ 1 , π‘₯ 2 ,…, π‘₯ 𝑁 Unsupervised learning, density estimation Empirical distribution typically has multiple modes Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

3 Density Estimation Problem
From From Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

4 Density Estimation Problem
Conv. composition of unimodal pdf’s: multimodal pdf 𝑓 π‘₯ = π‘˜ 𝑀 π‘˜ 𝑓 π‘˜ (π‘₯) where π‘˜ 𝑀 π‘˜ =1 Physical interpretation Sub populations Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

5 Latent Variables Introduce new variable 𝑍 𝑖 for each 𝑋 𝑖
Latent / hidden: not observed in the data Probabilistic interpretation Mixing weights: 𝑀 π‘˜ ≑𝑝( 𝑧 𝑖 =π‘˜) Mixture densities: 𝑓 π‘˜ (π‘₯)≑𝑝(π‘₯| 𝑧 𝑖 =π‘˜) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

6 Generative Mixture Model
For 𝑖=1,…,𝑁 𝑍 𝑖 ~𝑖𝑖𝑑 𝑀𝑒𝑙𝑑 𝑋 𝑖 ~𝑖𝑖𝑑 𝑝(π‘₯| 𝑧 𝑖 ) 𝑃( π‘₯ 𝑖 , 𝑧 𝑖 )=𝑝 𝑧 𝑖 𝑝( π‘₯ 𝑖 | 𝑧 𝑖 ) 𝑃 π‘₯ 𝑖 = π‘˜ 𝑝( π‘₯ 𝑖 , 𝑧 𝑖 ) recovers mixture distribution 𝑍 𝑖 𝑋 𝑖 𝑁 Plate Notation Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

7 Tasks in a Mixture Model
Inference 𝑃 𝑧 π‘₯ = 𝑖 𝑃( 𝑧 𝑖 | π‘₯ 𝑖 ) Parameter Estimation Find parameters that e.g. maximize likelihood Does not decouple according to classes Non convex, many local minima Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

8 Example: Gaussian Mixture Model
For 𝑖=1,…,𝑁 𝑍 𝑖 ~𝑖𝑖𝑑 𝑀𝑒𝑙𝑑(πœ‹) 𝑋 𝑖 | 𝑍 𝑖 =π‘˜~𝑖𝑖𝑑 𝑁(π‘₯| πœ‡ π‘˜ ,Ξ£) Inference 𝑃( 𝑧 𝑖 =π‘˜| π‘₯ 𝑖 ;πœ‡,Ξ£) Soft-max function Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

9 Example: Gaussian Mixture Model
Loglikelihood Which training instance comes from which component? 𝑙 πœƒ = 𝑖 log 𝑝 π‘₯ 𝑖 = 𝑖 log π‘˜ 𝑝 𝑧 𝑖 =π‘˜ 𝑝( π‘₯ 𝑖 | 𝑧 𝑖 =π‘˜) No closed form solution for maximizing 𝑙 πœƒ Possibility 1: Gradient descent etc Possibility 2: Expectation Maximization Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

10 Expectation Maximization Algorithm
Observation: Know values of 𝑍 𝑖 β‡’ easy to maximize Key idea: iterative updates Given parameter estimates, β€œinfer” all 𝑍 𝑖 variables Given inferred 𝑍 𝑖 variables, maximize wrt parameters Questions Does this converge? What does this maximize? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

11 Expectation Maximization Algorithm
Complete loglikelihood 𝑙 𝑐 πœƒ = 𝑖 log 𝑝 π‘₯ 𝑖 , 𝑧 𝑖 = 𝑖 log 𝑝 𝑧 𝑖 𝑝( π‘₯ 𝑖 | 𝑧 𝑖 ) Problem: 𝑧 𝑖 not known Possible solution: Replace w/ conditional expectation Expected complete loglikelihood 𝑄 πœƒ, πœƒ π‘œπ‘™π‘‘ =𝐸[ 𝑖 log 𝑝 π‘₯ 𝑖 , 𝑧 𝑖 ] Wrt 𝑝(𝑧|π‘₯, πœƒ π‘œπ‘™π‘‘ ) where πœƒ π‘œπ‘™π‘‘ are the current parameters Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

12 Expectation Maximization Algorithm
𝑄 πœƒ, πœƒ π‘œπ‘™π‘‘ =𝐸[ 𝑖 log 𝑝 π‘₯ 𝑖 , 𝑧 𝑖 ] = 𝑖 π‘˜ 𝐸 𝐼 𝑧 𝑖 =π‘˜ log [ πœ‹ π‘˜ 𝑝( π‘₯ 𝑖 | πœƒ π‘˜ )] = 𝑖 π‘˜ 𝑝( 𝑧 𝑖 =π‘˜| π‘₯ 𝑖 , πœƒ π‘œπ‘™π‘‘ ) log [ πœ‹ π‘˜ 𝑝( π‘₯ 𝑖 | πœƒ π‘˜ )] = 𝑖 π‘˜ 𝛾 π‘–π‘˜ log πœ‹ π‘˜ + 𝑖 π‘˜ 𝛾 π‘–π‘˜ log 𝑝( π‘₯ 𝑖 | πœƒ π‘˜ ) Where 𝛾 π‘–π‘˜ =𝑝( 𝑧 𝑖 =π‘˜| π‘₯ 𝑖 , πœƒ π‘œπ‘™π‘‘ ) Compare with likelihood for generative classifier Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

13 Expectation Maximization Algorithm
Expectation Step Update 𝛾 π‘–π‘˜ based on current parameters 𝛾 π‘–π‘˜ = πœ‹ π‘˜ 𝑝 π‘₯ 𝑖 πœƒ π‘œπ‘™π‘‘, π‘˜ π‘˜ πœ‹ π‘˜ 𝑝 π‘₯ 𝑖 πœƒ π‘œπ‘™π‘‘, π‘˜ Maximization Step Maximize 𝑄 πœƒ, πœƒ π‘œπ‘™π‘‘ wrt parameters Overall algorithm Initialize all latent variables Iterate until convergence M Step E Step Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

14 Example: EM for GMM E Step remains the step for all mixture models
M Step πœ‹ π‘˜ = 𝑖 𝛾 π‘–π‘˜ 𝑁 = 𝛾 π‘˜ 𝑁 πœ‡ π‘˜ = 𝑖 𝛾 π‘–π‘˜ π‘₯ 𝑖 𝛾 π‘˜ Ξ£=? Compare with generative classifier Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

15 Analysis of EM Algorithm
Expected complete LL is a lower bound on LL EM iteratively maximizes this lower bound Converges to a local maximum of the loglikelihood Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

16 Bayesian / MAP Estimation
EM overfits Possible to perform MAP instead of MLE in M-step EM is partially Bayesian Posterior distribution over latent variables Point estimate over parameters Fully Bayesian approach is called Variational Bayes Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

17 (Lloyd’s) K Means Algorithm
Hard EM for Gaussian Mixture Model Point estimate of parameters (as usual) Point estimate of latent variables Spherical Gaussian mixture components 𝑧 𝑖 βˆ— = arg max k 𝑝( 𝑧 𝑖 =π‘˜| π‘₯ 𝑖 ,πœƒ) = arg min π‘˜ π‘₯ 𝑖 βˆ’ πœ‡ π‘˜ Where πœ‡ π‘˜ = 𝑖: 𝑧 𝑖 =π‘˜ π‘₯ 𝑖 𝑁 Most popular β€œhard” clustering algorithm Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

18 K Means Problem Given { π‘₯ 𝑖 }, find k β€œmeans” ( πœ‡ 1 βˆ— ,…, πœ‡ π‘˜ βˆ— ) and data assignments ( 𝑧 1 βˆ— ,…, 𝑧 𝑁 βˆ— ) such that πœ‡ βˆ— , 𝑧 βˆ— = arg min πœ‡,𝑧 𝑖 π‘₯ 𝑖 βˆ’πœ‡ 𝑧 𝑖 Note: 𝑧 𝑖 is k-dimensional binary vector Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

19 Model selection: Choosing K for GMM
Cross validation Plot likelihood on training set and validation set for increasing values of k Likelihood on training set keeps improving Likelihood on validation set drops after β€œoptimal” k Does not work for k-means! Why? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

20 Principal Component Analysis: Motivation
Dimensionality reduction Reduces #parameters to estimate Data often resides in much lower dimension, e.g., on a line in a 3D space Provides β€œunderstanding” Mixture models very restricted Latent variables restricted to small discrete set Can we β€œrelax” the latent variable? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

21 Classical PCA: Motivation
Revisit K-means min π‘Š,𝑍 𝐽 π‘Š,𝑍 = π‘‹βˆ’π‘Š 𝑍 𝑇 2 𝐹 W: matrix containing means Z: matrix containing cluster membership vectors How can we relax Z and W? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

22 Classical PCA: Problem
min π‘Š,𝑍 𝐽 π‘Š,𝑍 = |π‘‹βˆ’π‘Š 𝑍 𝑇 | 2 𝐹 X : 𝐷×𝑁 Arbitrary Z of size 𝑁×𝐿, Orthonormal W of size 𝐷×𝐿 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

23 Classical PCA: Optimal Solution
Empirical covariance matrix Ξ£ = 1 𝑁 𝑖 π‘₯ 𝑖 π‘₯ 𝑖 𝑇 Scaled and centered data π‘Š = 𝑉 𝐿 where 𝑉 𝐿 contains L Eigen vectors for the L largest Eigen values of Ξ£ 𝑧 𝑖 = π‘Š 𝑇 π‘₯ 𝑖 Alternative solution via Singular Value Decomposition (SVD) W contains the β€œprincipal components” that capture the largest variance in the data Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

24 𝑃( π‘₯ 𝑖 | 𝑧 𝑖 ) = 𝑁( π‘₯ 𝑖 |π‘Š 𝑧 𝑖 +πœ‡, Ξ¨)
Probabilistic PCA Generative model 𝑃( 𝑧 𝑖 ) = 𝑁( 𝑧 𝑖 | πœ‡ 0 , Ξ£ 0 ) 𝑃( π‘₯ 𝑖 | 𝑧 𝑖 ) = 𝑁( π‘₯ 𝑖 |π‘Š 𝑧 𝑖 +πœ‡, Ξ¨) Ξ¨ forced to be diagonal Latent linear models Factor Analysis Special Case: PCA with Ξ¨ = 𝜎 2 𝐼 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

25 Visualization of Generative Process
From Bishop, PRML Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

26 Relationship with Gaussian Density
πΆπ‘œπ‘£[π‘₯] = π‘Š π‘Š 𝑇 +Ξ¨ Why does Ξ¨ need to be restricted? Intermediate low rank parameterization of Gaussian covariance matrix between full rank and diagonal Compare #parameters Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

27 EM for PCA: Rod and Springs
From Bishop, PRML Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

28 Advantages of EM Simpler than gradient methods w/ constraints
Handles missing data Easy path for handling more complex models Not always the fastest method Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

29 Summary of Latent Variable Models
Learning from unlabeled data Latent variables Discrete: Clustering / Mixture models ; GMM Continuous: Dimensionality reduction ; PCA Summary / β€œUnderstanding” of data Expectation Maximization Algorithm Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya


Download ppt "Probabilistic Models with Latent Variables"

Similar presentations


Ads by Google