SMEM Algorithm for Mixture Models

Slides:



Advertisements
Similar presentations
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.
Advertisements

Image Modeling & Segmentation
Aggregating local image descriptors into compact codes
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
K Means Clustering , Nearest Cluster and Gaussian Mixture
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Segmentation and Fitting Using Probabilistic Methods
Visual Recognition Tutorial
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Lecture 5: Learning models using EM
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Biointelligence Laboratory, Seoul National University
Summarized by Soo-Jin Kim
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture model Based on Minimum Message Length by Nizar Bouguila.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
LECTURE 11: Advanced Discriminant Analysis
LECTURE 10: DISCRIMINANT ANALYSIS
Classification of unlabeled data:
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
Unsupervised-learning Methods for Image Clustering
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Probabilistic Models with Latent Variables
KAIST CS LAB Oh Jong-Hoon
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Boltzmann Machine (BM) (§6.4)
LECTURE 15: REESTIMATION, EM AND MIXTURES
LECTURE 09: DISCRIMINANT ANALYSIS
Biointelligence Laboratory, Seoul National University
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
EM Algorithm and its Applications
EM Algorithm 主講人:虞台文.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Probabilistic Surrogate Models
Presentation transcript:

SMEM Algorithm for Mixture Models N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton Neural Computation, Vol. 12, No. 9 pp. 2109-2128, September 2000 Cho, Dong-Yeon

© 2001 SNU CSE Biointelligence Lab Abstract SMEM Algorithm Split-and-merge expectation-maximization algorithm To overcome the local maxima problem in parameter estimation of finite mixture models. Simultaneous split-and merge operations using a new criterion for efficiently selecting the split-and-merge candidates Gaussian mixtures and mixture of factor analyzers Synthetic and real data Image compression and pattern recognition problems © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Introduction Mixture density models Normal mixtures More sophisticated mixture density models Mixtures of latent variable models: probabilistic PCA and factor analysis Parameters can be estimated using the EM algorithm. Maximum likelihood framework Local maxima problem Deterministic Annealing EM (DAEM) algorithm A modified posterior probability parameterized by temperature Not very efficient at avoiding local maxima for mixture models © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Idea of Performing Split-and-Merge Operations Discrete move that simultaneously merges two components in an overpopulated region and splits a component in an underpopulated region. Applications Clustering and vector quantization Bayesian normal mixture analysis: split-and-merge operations with a MCMC method The proposed method is limited to mixture models. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab EM Algorithm Data Complete data Z = (X, Y) X: observed data (incomplete data) Y: unobserved data Joint Probability Density p(X,Y;) : parameters of the density to be estimated MLE of  Maximization of the incomplete data log-likelihood © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Characteristic of the EM Algorithm Iteratively maximizing the expectation of the complete data log-likelihood function E-step M-step Find the  maximizing Q(|(t)) The convergence of the EM steps is theoretically guaranteed. © 2001 SNU CSE Biointelligence Lab

Split-and-Merge EM Algorithm Split-and-Merge Operation The pdf of a mixture of M density models m: mixing proportion of the mth model (m  0) pm(x;m): d-dimensional density model corresponding to the mth model  = {(m, m), m = 1,…,M} © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab After the EM algorithm has converged Merging models i and j to produce a model i’ Splitting the model k into two models j’ and k’ Initialization Initial parameter for the merged model i’ For models j’ and k’ © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab An example of initialization in a two-dimensional gaussian case © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Partial EM Steps Modified posterior probability We can reestimate the parameters for model i’, j’, and k’ consistently without affecting the other models. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab SMEM Algorithm © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Cmax = M(M-1)(M-2)/2 We have performed experimentally that Cmax5 may be enough because the split-and-merge criteria do work well. The SMEM algorithm monotonically increase the Q function value. If the Q function value does not increase for all c = 1,…, Cmax, then the algorithm stops. Total number of mixture components is unchanged. The global convergence properties of the EM algorithm Intuitively, a simultaneous split-and-merge can be viewed as a way pf tunneling through low-likelihood barriers, there by eliminating many poor local optima. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Split-and-Merge Criteria Merge Criterion When there are many data points, each of which has almost equal posterior probability for any two component, it can be thought that these two components might be merged. Two components with large merge criterion are good candidates for a merge. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Split Criterion Local Kullback divergence The distance between two distributions: the local density around the kth model and the density of the kth model specified by the current parameter estimate. When the weights are equal, that is P(k|x;*) = 1/M The split criterion can be viewed as a likelihood ratio test. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Sorting Candidates The merge candidates are sorted based on merge criterion. For each sorted merge candidate {i, j}c, the split candidates, excluding {i, j}c, are sorted as {k}c. By combining these results and renumbering them, we obtain {i, j,k}c, c=1,…, M(M-1)(M-2)/2 © 2001 SNU CSE Biointelligence Lab

Application to Density Estimation by Mixture of Gaussians Synthetic Data Mixture of Gaussians Mean vector and covariance matrix The split-and-merge operations not only appropriately assign the number of Gaussians in a local data space, but can improve the Gaussian parameters themselves. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Real Data Facial images processed into feature vector 20 dimension Data size: 103 for training, 103 for test 10 different initializations using k-means clustering algorithm M = 5 and a diagonal covariance for each Gaussian Log-Likelihood/Sample Size © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Trajectories of log-likelihood The successive split-and-merge operations improved the log-likelihood for both the training and test data. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab The Number of EM-Steps The number includes not only partial and full EM steps for accepted operations, but also EM-steps for rejected ones. 8.7 times slower than the original EM algorithm The average rank of the accepted split-and-merge candidates was 1.8 (std = 0.9), which indicates that the proposed split-and-merge criteria worked very well. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Application to Dimensionality Reduction Using Mixture of Factor Analyzers Factor Analyzers A single factor analyzer (FA) An observed p-dimensional variable x is generated as a linear transformation of some lower q-dimensional latent variable z ~ N(0, I) plus additive Gaussian noise v ~ (0, ), where  is a diagonal matrix. Factor loading matrix: W  pq Mean vector:  The pdf of the observed data by an FA model © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Mixture of Factor Analyzers M mixture of FAs The MFA model can perform clustering and dimensionality reduction simultaneously. Complete data log-likelihood SMEM algorithm is straightforwardly applicable to the parameter estimation of the MFA model. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Demonstration © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Practical Applications Image Compression An MFA model is available for block transform image coding. © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab 15.8103 10.1103 7.3103 © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Application to Pattern Recognition We can compute the posterior probability for each data point since once an MFA model is fitted to each class. Optimal class i* for x © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Digit recognition task (10 classes) 16 dimensional data (Glucksman’s features) Data size: 200 per class for training and 200 per class for test 3NN: 88.3 % SS (CLAFIC) © 2001 SNU CSE Biointelligence Lab

© 2001 SNU CSE Biointelligence Lab Conclusion Simultaneous Split-and-Merge Operations A way of tunneling through low-likelihood barriers, thereby eliminating many non-global optima SMEM algorithm outperforms the standard EM algorithm, and therefore it can be very useful in practice. Wide Variety of Mixture Models Future Work By introducing probability measures over model, we can also use the split-and-merge operations to determine the appropriate number of components within the Bayesian framework. © 2001 SNU CSE Biointelligence Lab