Ch 12. Continuous Latent Variables ~ 12

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Biointelligence Laboratory, Seoul National University
Computer vision: models, learning and inference Chapter 8 Regression.
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Principal Component Analysis
Dimensional reduction, PCA
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
ICA Alphan Altinok. Outline  PCA  ICA  Foundation  Ambiguities  Algorithms  Examples  Papers.
Continuous Latent Variables --Bishop
Biointelligence Laboratory, Seoul National University
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
PATTERN RECOGNITION AND MACHINE LEARNING
Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.
Biointelligence Laboratory, Seoul National University
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Ch 12. Continuous Latent Variables Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by S.-J. Kim and J.-K. Rhee Revised by D.-Y.
Principal Manifolds and Probabilistic Subspaces for Visual Recognition Baback Moghaddam TPAMI, June John Galeotti Advanced Perception February 12,
EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
High Dimensional Probabilistic Modelling through Manifolds
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Neil Lawrence Machine Learning Group Department of Computer Science
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Machine Learning Basics
Latent Variables, Mixture Models and EM
Probabilistic Models with Latent Variables
A Unifying Review of Linear Gaussian Models
SMEM Algorithm for Mixture Models
Principal Component Analysis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: DISCRIMINANT ANALYSIS
Biointelligence Laboratory, Seoul National University
Principal Component Analysis
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Biointelligence Laboratory, Seoul National University
Presentation transcript:

Ch 12. Continuous Latent Variables 12. 2 ~ 12 Ch 12. Continuous Latent Variables 12.2 ~ 12.4 Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Rhee, Je-Keun Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Contents 12. 2. Probabilistic PCA 12. 3. Kernel PCA 12. 4. Nonlinear Latent Variable Models (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Probabilistic PCA PCA can express naturally as the maximum likelihood solution to a particular form of linear-Gaussian latent variable model. This probabilistic reformulation brings many advantages; the use of EM for parameter estimation principled extensions to mixture of PCA models Bayesian formulations that allow the number of principal component to be determined automatically from the data Etc. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Probabilistic PCA is a simple example of the linear-Gaussian framework. It can be formulate by first introducing an explicit latent variable z corresponding to the principal-component subspace. Prior distribution over z The conditional distribution of the observed variable x, conditioned on the value of the latent variable z (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ D-dimensional observed variable x is defined by a linear transformation of the M-dimensional latent variable z plus additive Gaussian ‘noise’. This framework is based on a mapping from latent space to data space, in contrast to the more conventional view of PCA. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Determining the values of parameters W, μ and σ2 using maximum likelihood marginal distribution p(x) is again Gaussian mean & covariance (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ When we evaluate the predictive distribution, we require C-1, which involves the inversion of a D x D matrix. Because we invert M rather than inverting C-1 directly, the cost of evaluating C-1 is reduced from O(D3) to O(M3). As well as the predictive distribution p(x), we will also require the posterior distribution p(z|x) M is M x M matrix (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Maximum likelihood PCA log-likelihood function The log-likelihood function is a quadratic function of μ, this solution represents the unique maximum, as can confirmed by computing second derivatives. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Maximization with respect to W and σ2 is more complex buy nonetheless has an exact closed-form solution. All of the stationary points of the log likelihood function can be written as UM is a D x M matrix whose columns are given by any subset (of size M) of the eigenvectors of the data covariance matrix S LM is a M x M matrix whose elements are given by corresponding eigenvalues λi . R is an arbitrary M x M orthogonal matrix. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Conventional PCA is generally formulated as a projection of points from the D-dimensional data space onto an M-dimensional linear subspace. Probabilistic PCA, however, is most naturally expressed as a mapping from the latent space into the data space. We can reverse this mapping using Bayes’ theorem. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ EM algorithms for PCA Using the EM algorithm to find maximum likelihood estimates of the model parameters This may seem rather pointless because we have already obtained an exact closed-form solution for the maximum likelihood parameter values. In space of high dimensionality, there may be computational advantages in using an iterative EM procedure rather than working directly with the sample covariance matrix. This EM procedure can also be extended to the factor analysis model, for which there is no closed-form solution. It allows missing data to be handled in a principled way. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ E step M step (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Benefits of the EM algorithm for PCA Computational efficiency for large-scale appication Dealing with missing data (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Benefits of the EM algorithm for PCA Another elegent feature of the EM approach is that we can take the limit σ2 →0, corresponding to standard PCA, and still obtain a valid EM-like algorithm. E step M step (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Model selection In practice, we must choose a value, M, for dimensionality of principal subspace according to application. Probabilistic PCA model has a well-defined likelihood function, we could employ cross-validation to determine the value of dimensionality by selection the largest log likelihood on a validation data set. However, this approach can become computationally costly. Given that we have a probabilistic formulation of PCA, it seems natural to seek a Bayesian approach to model selection. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Bayesian PCA Specifically, we define an independent Gaussian prior over each column of W, which represent the vectors defining the principal subspace. Each Gaussian has an independent variance governed by a precision hyperparameter αi The effective dimensionality of the principal subspace is determined by the number of finite αi values, and the corresponding vector wi can be thought of as ‘relevant’ for modeling the data distribution. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Bayesian PCA E step (12.54) (12.55) M step (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Factor analysis Factor analysis is a linear-Gaussian latent variable model that is closely related to probabilistic PCA. Its definition differs from that of probabilistic PCA only in that the conditional distribution of the observed variable x given the latent variable z is taken to have a diagonal rather than an isotropic covariance. Ψ is a D x D diagonal matirx (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Links between FA and PCA At the stationary points of the likelihood function, for a factor analysis model with Ψ = σ2I, the columns of W are scaled eigenvectors of the sample covariance matrix, and σ2 is the average of the discarded eigenvalues. The maximum of the log likelihood function occurs when the eigenvectors comprising W are chosen to be the principal eigenvectors. Unlike probabilistic PCA, there is no longer a closed-form maximum likelihood solution for W, which must therefore be found iteratively. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Factor analysis E step M step (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Kernel PCA Applying the technique of kernel substitution to principal component analysis, to obtain a nonlinear generalization (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Conventional PCA vs. Kernel PCA (Assume that the mean of vectors xn is zero) a nonlinear transformation φ(x) into an M-dimensional feature space (Assume that the projected data set also has zero mean) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Solving eigenvalue problem Expression in terms of kernel function k(xn, xm) in matrix notation (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ The normalization condition for the coefficient ai is obtained by requiring that the eigenvectors in feature space be normalized. Having solved the eigenvector problem, the resulting principal component projections can then also be cast in terms of the kernel function so that the projection of a point x onto eigenvector i is given by (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ In the original D-dimensional x space D orthogonal eigenvectors at most D linear principal components. In the M-dimensional feature space We can find a number of nonlinear principal components that can exceed D. However, the number of nonzero eigenvalues cannot exceed the number N of data points, because (even if M > N) the covariance matrix in feature space has rank at most equal to N. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ In general cases 1N denote the N x N matrix in which every element takes the value 1/N (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Nonlinear Latent Variable Models Some generalizations of continuous latent variables framework to models that are either nonlinear or non-Gaussian, or both. Independent component analysis (ICA) Autoassociative neural networks Etc. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Independent component analysis Eg. blind source separation The observed variables are related linearly to the latent variables, but the latent distribution is non-Gaussian (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Autoassociative neural networks Neural networks have been applied to unsupervised learning where they have been used for dimensionality reduction. This is achieved by using a network having the same number of outputs as inputs, and optimizing the weights so as to minimize some measure of the reconstruction error between inputs and outputs with respect to a set of training data The vectors of weights which lead into the hidden units form a basis set which spans the principal subspace. These vectors need not be orthogonal or normalized. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Addition of extra hidden layer (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Modeling nonlinear manifolds One way to the nonlinear structure is through a combination of linear model, so that we make a piece-wise linear approximation to the manifold. An alternative to considering a mixture of linear models is to consider a single nonlinear model. Conventional PCA finds a linear subspace that passes close to the data in a least-squares sense. This concept can be extended to one-dimensional nonlinear surfaces in the form of principal curves. Principal curves can be generalized to multidimensional manifolds called principal surfaces. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Multidimensional scaling (MDS) Locally linear embedding (LLE) Isometric feature mapping (isomap) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ The models having continuous latent variables together with discrete observed variables Latent trait models Density network (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Generative topological mapping (GTM) It uses a latent distribution that is defined by a finite regular grid of delta functions over the (typically two-dimensional) latent space. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Self-organizing map (SOM) The GTM can be seen as a probabilistic version of an earlier model, SOM. The SOM represents a two-dimensional nonlinear manifold as a regular array of discrete points. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/