Probabilistic Models with Latent Variables

Slides:

Advertisements

Similar presentations

Part 2: Unsupervised Learning

Advertisements

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.

Mixture Models and the EM Algorithm

Unsupervised Learning

Clustering Beyond K-means

Supervised Learning Recap

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

K-means clustering Hongning Wang

Visual Recognition Tutorial

COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.

Expectation Maximization Algorithm

Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.

Maximum Likelihood (ML), Expectation Maximization (EM)

Visual Recognition Tutorial

Gaussian Mixture Models and Expectation Maximization.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Biointelligence Laboratory, Seoul National University

Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber

Lecture 17 Gaussian Mixture Models and Expectation Maximization

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

HMM - Part 2 The EM algorithm Continuous density HMM.

Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.

CS Statistical Machine learning Lecture 24

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.

Lecture 2: Statistical learning primer for biologists

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Flat clustering approaches

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

High Dimensional Probabilistic Modelling through Manifolds

Who am I? Work in Probabilistic Machine Learning Like to teach 

Deep Feedforward Networks

Ch 12. Continuous Latent Variables ~ 12

Expectation-Maximization (EM)

Lecture 18 Expectation Maximization

LECTURE 11: Advanced Discriminant Analysis

Probability for Machine Learning

Classification of unlabeled data:

Latent Variables, Mixture Models and EM

Unsupervised-learning Methods for Image Clustering

Probabilistic Models for Linear Regression

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Hidden Markov Models Part 2: Algorithms

دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry

Bayesian Models in Machine Learning

SMEM Algorithm for Mixture Models

'Linear Hierarchical Models'

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Dimension reduction : PCA and Clustering

Feature space tansformation methods

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

LECTURE 15: REESTIMATION, EM AND MIXTURES

Biointelligence Laboratory, Seoul National University

Expectation Maximization Eric Xing Lecture 10, August 14, 2010

EM Algorithm 主講人：虞台文.

Clustering (2) & EM algorithm

Presentation transcript:

Probabilistic Models with Latent Variables Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Density Estimation Problem Learning from unlabeled data 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑁 Unsupervised learning, density estimation Empirical distribution typically has multiple modes Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Density Estimation Problem From http://yulearning.blogspot.co.uk From http://courses.ee.sun.ac.za/Pattern_Recognition_813 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Density Estimation Problem Conv. composition of unimodal pdf’s: multimodal pdf 𝑓 𝑥 = 𝑘 𝑤 𝑘 𝑓 𝑘 (𝑥) where 𝑘 𝑤 𝑘 =1 Physical interpretation Sub populations Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Latent Variables Introduce new variable 𝑍 𝑖 for each 𝑋 𝑖 Latent / hidden: not observed in the data Probabilistic interpretation Mixing weights: 𝑤 𝑘 ≡𝑝( 𝑧 𝑖 =𝑘) Mixture densities: 𝑓 𝑘 (𝑥)≡𝑝(𝑥| 𝑧 𝑖 =𝑘) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Generative Mixture Model For 𝑖=1,…,𝑁 𝑍 𝑖 ~𝑖𝑖𝑑 𝑀𝑢𝑙𝑡 𝑋 𝑖 ~𝑖𝑖𝑑 𝑝(𝑥| 𝑧 𝑖 ) 𝑃( 𝑥 𝑖 , 𝑧 𝑖 )=𝑝 𝑧 𝑖 𝑝( 𝑥 𝑖 | 𝑧 𝑖 ) 𝑃 𝑥 𝑖 = 𝑘 𝑝( 𝑥 𝑖 , 𝑧 𝑖 ) recovers mixture distribution 𝑍 𝑖 𝑋 𝑖 𝑁 Plate Notation Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Tasks in a Mixture Model Inference 𝑃 𝑧 𝑥 = 𝑖 𝑃( 𝑧 𝑖 | 𝑥 𝑖 ) Parameter Estimation Find parameters that e.g. maximize likelihood Does not decouple according to classes Non convex, many local minima Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Example: Gaussian Mixture Model For 𝑖=1,…,𝑁 𝑍 𝑖 ~𝑖𝑖𝑑 𝑀𝑢𝑙𝑡(𝜋) 𝑋 𝑖 | 𝑍 𝑖 =𝑘~𝑖𝑖𝑑 𝑁(𝑥| 𝜇 𝑘 ,Σ) Inference 𝑃( 𝑧 𝑖 =𝑘| 𝑥 𝑖 ;𝜇,Σ) Soft-max function Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Example: Gaussian Mixture Model Loglikelihood Which training instance comes from which component? 𝑙 𝜃 = 𝑖 log 𝑝 𝑥 𝑖 = 𝑖 log 𝑘 𝑝 𝑧 𝑖 =𝑘 𝑝( 𝑥 𝑖 | 𝑧 𝑖 =𝑘) No closed form solution for maximizing 𝑙 𝜃 Possibility 1: Gradient descent etc Possibility 2: Expectation Maximization Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Expectation Maximization Algorithm Observation: Know values of 𝑍 𝑖 ⇒ easy to maximize Key idea: iterative updates Given parameter estimates, “infer” all 𝑍 𝑖 variables Given inferred 𝑍 𝑖 variables, maximize wrt parameters Questions Does this converge? What does this maximize? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Expectation Maximization Algorithm Complete loglikelihood 𝑙 𝑐 𝜃 = 𝑖 log 𝑝 𝑥 𝑖 , 𝑧 𝑖 = 𝑖 log 𝑝 𝑧 𝑖 𝑝( 𝑥 𝑖 | 𝑧 𝑖 ) Problem: 𝑧 𝑖 not known Possible solution: Replace w/ conditional expectation Expected complete loglikelihood 𝑄 𝜃, 𝜃 𝑜𝑙𝑑 =𝐸[ 𝑖 log 𝑝 𝑥 𝑖 , 𝑧 𝑖 ] Wrt 𝑝(𝑧|𝑥, 𝜃 𝑜𝑙𝑑 ) where 𝜃 𝑜𝑙𝑑 are the current parameters Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Expectation Maximization Algorithm 𝑄 𝜃, 𝜃 𝑜𝑙𝑑 =𝐸[ 𝑖 log 𝑝 𝑥 𝑖 , 𝑧 𝑖 ] = 𝑖 𝑘 𝐸 𝐼 𝑧 𝑖 =𝑘 log [ 𝜋 𝑘 𝑝( 𝑥 𝑖 | 𝜃 𝑘 )] = 𝑖 𝑘 𝑝( 𝑧 𝑖 =𝑘| 𝑥 𝑖 , 𝜃 𝑜𝑙𝑑 ) log [ 𝜋 𝑘 𝑝( 𝑥 𝑖 | 𝜃 𝑘 )] = 𝑖 𝑘 𝛾 𝑖𝑘 log 𝜋 𝑘 + 𝑖 𝑘 𝛾 𝑖𝑘 log 𝑝( 𝑥 𝑖 | 𝜃 𝑘 ) Where 𝛾 𝑖𝑘 =𝑝( 𝑧 𝑖 =𝑘| 𝑥 𝑖 , 𝜃 𝑜𝑙𝑑 ) Compare with likelihood for generative classifier Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Expectation Maximization Algorithm Expectation Step Update 𝛾 𝑖𝑘 based on current parameters 𝛾 𝑖𝑘 = 𝜋 𝑘 𝑝 𝑥 𝑖 𝜃 𝑜𝑙𝑑, 𝑘 𝑘 𝜋 𝑘 𝑝 𝑥 𝑖 𝜃 𝑜𝑙𝑑, 𝑘 Maximization Step Maximize 𝑄 𝜃, 𝜃 𝑜𝑙𝑑 wrt parameters Overall algorithm Initialize all latent variables Iterate until convergence M Step E Step Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Example: EM for GMM E Step remains the step for all mixture models M Step 𝜋 𝑘 = 𝑖 𝛾 𝑖𝑘 𝑁 = 𝛾 𝑘 𝑁 𝜇 𝑘 = 𝑖 𝛾 𝑖𝑘 𝑥 𝑖 𝛾 𝑘 Σ=? Compare with generative classifier Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Analysis of EM Algorithm Expected complete LL is a lower bound on LL EM iteratively maximizes this lower bound Converges to a local maximum of the loglikelihood Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Bayesian / MAP Estimation EM overfits Possible to perform MAP instead of MLE in M-step EM is partially Bayesian Posterior distribution over latent variables Point estimate over parameters Fully Bayesian approach is called Variational Bayes Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

(Lloyd’s) K Means Algorithm Hard EM for Gaussian Mixture Model Point estimate of parameters (as usual) Point estimate of latent variables Spherical Gaussian mixture components 𝑧 𝑖 ∗ = arg max k 𝑝( 𝑧 𝑖 =𝑘| 𝑥 𝑖 ,𝜃) = arg min 𝑘 𝑥 𝑖 − 𝜇 𝑘 2 2 Where 𝜇 𝑘 = 𝑖: 𝑧 𝑖 =𝑘 𝑥 𝑖 𝑁 Most popular “hard” clustering algorithm Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

K Means Problem Given { 𝑥 𝑖 }, find k “means” ( 𝜇 1 ∗ ,…, 𝜇 𝑘 ∗ ) and data assignments ( 𝑧 1 ∗ ,…, 𝑧 𝑁 ∗ ) such that 𝜇 ∗ , 𝑧 ∗ = arg min 𝜇,𝑧 𝑖 𝑥 𝑖 −𝜇 𝑧 𝑖 2 2 Note: 𝑧 𝑖 is k-dimensional binary vector Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Model selection: Choosing K for GMM Cross validation Plot likelihood on training set and validation set for increasing values of k Likelihood on training set keeps improving Likelihood on validation set drops after “optimal” k Does not work for k-means! Why? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Principal Component Analysis: Motivation Dimensionality reduction Reduces #parameters to estimate Data often resides in much lower dimension, e.g., on a line in a 3D space Provides “understanding” Mixture models very restricted Latent variables restricted to small discrete set Can we “relax” the latent variable? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Classical PCA: Motivation Revisit K-means min 𝑊,𝑍 𝐽 𝑊,𝑍 = 𝑋−𝑊 𝑍 𝑇 2 𝐹 W: matrix containing means Z: matrix containing cluster membership vectors How can we relax Z and W? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Classical PCA: Problem min 𝑊,𝑍 𝐽 𝑊,𝑍 = |𝑋−𝑊 𝑍 𝑇 | 2 𝐹 X : 𝐷×𝑁 Arbitrary Z of size 𝑁×𝐿, Orthonormal W of size 𝐷×𝐿 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Classical PCA: Optimal Solution Empirical covariance matrix Σ = 1 𝑁 𝑖 𝑥 𝑖 𝑥 𝑖 𝑇 Scaled and centered data 𝑊 = 𝑉 𝐿 where 𝑉 𝐿 contains L Eigen vectors for the L largest Eigen values of Σ 𝑧 𝑖 = 𝑊 𝑇 𝑥 𝑖 Alternative solution via Singular Value Decomposition (SVD) W contains the “principal components” that capture the largest variance in the data Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

𝑃( 𝑥 𝑖 | 𝑧 𝑖 ) = 𝑁( 𝑥 𝑖 |𝑊 𝑧 𝑖 +𝜇, Ψ) Probabilistic PCA Generative model 𝑃( 𝑧 𝑖 ) = 𝑁( 𝑧 𝑖 | 𝜇 0 , Σ 0 ) 𝑃( 𝑥 𝑖 | 𝑧 𝑖 ) = 𝑁( 𝑥 𝑖 |𝑊 𝑧 𝑖 +𝜇, Ψ) Ψ forced to be diagonal Latent linear models Factor Analysis Special Case: PCA with Ψ = 𝜎 2 𝐼 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Visualization of Generative Process From Bishop, PRML Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Relationship with Gaussian Density 𝐶𝑜𝑣[𝑥] = 𝑊 𝑊 𝑇 +Ψ Why does Ψ need to be restricted? Intermediate low rank parameterization of Gaussian covariance matrix between full rank and diagonal Compare #parameters Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

EM for PCA: Rod and Springs From Bishop, PRML Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Advantages of EM Simpler than gradient methods w/ constraints Handles missing data Easy path for handling more complex models Not always the fastest method Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya

Summary of Latent Variable Models Learning from unlabeled data Latent variables Discrete: Clustering / Mixture models ; GMM Continuous: Dimensionality reduction ; PCA Summary / “Understanding” of data Expectation Maximization Algorithm Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya