Dimension reduction (1)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Component Analysis (Review)
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimensionality Reduction PCA -- SVD
Chapter 4: Linear Models for Classification
Lecture 7: Principal component analysis (PCA)
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8
Principal Component Analysis
Factor Analysis Purpose of Factor Analysis Maximum likelihood Factor Analysis Least-squares Factor rotation techniques R commands for factor analysis References.
Factor Analysis Purpose of Factor Analysis
Dimensional reduction, PCA
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Separate multivariate observations
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Presented By Wanchen Lu 2/25/2013
Outline Separating Hyperplanes – Separable Case
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Generalizing Linear Discriminant Analysis. Linear Discriminant Analysis Objective -Project a feature space (a dataset n-dimensional samples) onto a smaller.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
2.4 Nonnegative Matrix Factorization  NMF casts matrix factorization as a constrained optimization problem that seeks to factor the original matrix into.
Interpreting Principal Components Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University L i n.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is.
Lecture 4 Linear machine
Discriminant Analysis
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Principle Component Analysis and its use in MA clustering Lecture 12.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Principal Component Analysis (PCA)
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
2D-LDA: A statistical linear discriminant analysis for image matrix
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
Unsupervised learning  Supervised and Unsupervised learning  General considerations  Clustering  Dimension reduction The lecture is partly based on:
Feature Extraction 主講人:虞台文.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Dimensionality Reduction
LECTURE 11: Advanced Discriminant Analysis
Background on Classification
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
Interpreting Principal Components
Principal Component Analysis
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Principal Component Analysis
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Dimension reduction (1) Overview PCA Factor Analysis EDR space SIR References: Applied Multivariate Analysis. http://www.stat.ucla.edu/~kcli/sir-PHD.pdf

Overview The purpose of dimension reduction: Data simplification Data visualization Reduce noise (if we can assume only the dominating dimensions are signals) Variable selection for prediction

Overview An analogy: Data separation Dimension reduction Outcome variable y exists (learning the association rule) Classification, regression SIR, Class-preserving projection, Partial least squares No outcome variable (learning intrinsic structure) Clustering PCA, MDS, Factor Analysis, ICA, NCA…

Does not require normality! PCA Explain the variance-covariance structure among a set of random variables by a few linear combinations of the variables; Does not require normality!

PCA

PCA

Reminder of some results for random vectors

Reminder of some results for random vectors Proof of the first (and second) point of the previous slide.

PCA The eigen values are the variance components: Proportion of total variance explained by the kth PC:

PCA

PCA The geometrical interpretation of PCA:

PCA PCA using the correlation matrix, instead of the covariance matrix? This is equivalent to first standardizing all X vectors.

PCA Using the correlation matrix avoids the domination from one X variable due to scaling (unit changes), for example using inch instead of foot. Example:

PCA Selecting the number of components? Based on eigen values (% variation explained). Assumption: the small amount of variation explained by low-rank PCs is noise.

Factor Analysis If we take the first several PCs that explain most of the variation in the data, we have one form of factor model. L: loading matrix F: unobserved random vector (latent variables). ε: unobserved random vector (noise)

Factor Analysis Orthogonal factor model assumes no correlation between the factor RVs. is a diagonal matrix

Factor Analysis

Factor Analysis Rotations in the m-dimensional subspace defined by the factors make the solution non-unique: PCA is one unique solution, as the vectors are sequentially selected. Maximum likelihood estimator is another solution:

Factor Analysis As we said, rotations within the m-dimensional subspace doesn’t change the overall amount of variation explained. Do rotation to make the results more interpretable:

Factor Analysis Varimax criterion: Find T such that is maximized. V is proportional to the summation of the variance of the squared loadings. Maximizing V makes the squared loadings as spread out as possible --- some are real small, and some are real big.

Factor Analysis Orthogonal simple factor rotation: Rotate the orthogonal factors around the origin until the system is maximally aligned with the separate clusters of variables. Oblique Simple Structure Rotation: Allow the factors to become correlated. Each factor is rotated individually to fit a cluster.

MDS Multidimensional scaling is a dimension reduction procedure that maps the distances between observations to a lower dimensional space. Minimize this objective function: D: distance in the original space d: distance in the reduced dimension space. Numerical method is used for the minimization.

EDR space Now we start talking about regression. The data is {xi, yi} Is dimension reduction on X matrix alone helpful here? Possibly, if the dimension reduction preserves the essential structure about Y|X. This is suspicious. Effective Dimension Reduction --- reduce the dimension of X without losing information which is essential to predict Y.

EDR space The model: Y is predicted by a set of linear combinations of X. If g() is known, this is not very different from a generalized linear model. For dimension reduction purpose, is there a scheme which can work on almost any g(), without knowledge of its actual form?

EDR space The general model encompasses many models as special cases:

EDR space Under this general model, The space B generated by β1, β2, ……, βK is called the e.d.r. space. Reducing to this sub-space causes no loss of information regarding predicting Y. Similar to factor analysis, the subspace B is identifiable, but the vectors aren’t. Any non-zero vector in the e.d.r. space is called an e.d.r. direction.

EDR space This equation assumes almost the weakest form, to reflect the hope that a low-dimensional projection of a high-dimensional regresser variable contains most of the information that can be gathered from a sample of modest size. It doesn’t impose any structure on how the projected regresser variables effect the output variable. Most regression models assume K=1, plus additional structures on g().

EDR space The philosophical point of Sliced Inverse Regression: the estimation of the projection directions can be a more important statistical issue than the estimation of the structure of g() itself. After finding a good e.d.r. space, we can project data to this smaller space. Then we are in a better position to identify what should be pursued further : model building, response surface estimation, cluster analysis, heteroscedasticity analysis, variable selection, ……

SIR Sliced Inverse Regression. In regular regression, our interest is the conditional density h(Y|X). Most important is E(Y|x) and var(Y|x). SIR treats Y as independent variable and X as the dependent variable. Given Y=y, what values will X take? This takes us from a p-dimensional problem (subject to curse of dimensionality) back to a 1-dimensional curve-fitting problem: E(xi|y), i=1,…, p

SIR

SIR

SIR covariance matrix for the slice means of x, weighted by the slice sizes sample covariance for xi ’s Find the SIR directions by conducting the eigenvalue decomposition of with respect to :

SIR An example response surface found by SIR.

SIR and LDA Reminder: Fisher’s linear discriminant analysis seeks a projection direction that maximized class separation. When the underlying distributions are Gaussian, it agrees with the Bayes decision rule. It seeks to maximize: Between-group variance: Within-group variance:

SIR and LDA The solution is the first eigen vector in this eigen value decomposition: If we let , the LDA agrees with SIR up to a scaling.

Structure-preserving dimension reduction in classification. Multi-class LDA Structure-preserving dimension reduction in classification. Within-class scatter: Between-class scatter: Mixture scatter: a: observations, c: class centers Kim et al. Pattern Recognition 2007, 40:2939

The solution come from the eigen value/vectors of Multi-class LDA Maximize: The solution come from the eigen value/vectors of When we have N<<p, Sw is singular. Let Kim et al. Pattern Recognition 2007, 40:2939

Multi-class LDA Kim et al. Pattern Recognition 2007, 40:2939