Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.

Slides:



Advertisements
Similar presentations
Step three: statistical analyses to test biological hypotheses General protocol continued.
Advertisements

Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Lecture 3: A brief background to multivariate statistics
Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
The General Linear Model Or, What the Hell’s Going on During Estimation?
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Lecture 7: Principal component analysis (PCA)
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
An introduction to Principal Component Analysis (PCA)
HCI 530 : Seminar (HCI) Damian Schofield. HCI 530: Seminar (HCI) Transforms –Two Dimensional –Three Dimensional The Graphics Pipeline.
Principal Component Analysis
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Principal component analysis (PCA)
CHAPTER 19 Correspondence Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Principal component analysis (PCA)
Tables, Figures, and Equations
Techniques for studying correlation and covariance structure
Computer Vision Spring ,-685 Instructor: S. Narasimhan WH 5409 T-R 10:30am – 11:50am Lecture #18.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Separate multivariate observations
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
CHAPTER 26 Discriminant Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #19.
Chapter 12 Examining Relationships in Quantitative Research Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Examining Relationships in Quantitative Research
Applied Quantitative Analysis and Practices
Multivariate Data Analysis  G. Quinn, M. Burgman & J. Carey 2003.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Data set Proteins consumption shows the estimates of the average protein consumption from different food sources for the inhabitants of 25 European countries.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Examining Relationships in Quantitative Research
Chapter 9: Correlation and Regression Analysis. Correlation Correlation is a numerical way to measure the strength and direction of a linear association.
Lecture 12 Factor Analysis.
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
CORRELATION ANALYSIS.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Principal Component Analysis
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Feature Extraction 主講人:虞台文.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Unsupervised Learning II Feature Extraction
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Mini-Revision Since week 5 we have learned about hypothesis testing:
9.3 Filtered delay embeddings
Factor analysis Advanced Quantitative Research Methods
Principal Component Analysis (PCA)
CORRELATION ANALYSIS.
Descriptive Statistics vs. Factor Analysis
X.1 Principal component analysis
Principal Components Analysis
Principal Component Analysis (PCA)
Principal Component Analysis
PCA of Waimea Wave Climate
Chapter_19 Factor Analysis
Principal Component Analysis
Presentation transcript:

Principal Components Analysis

Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate data set while accounting for as much of the original variation as possible present in the data set. The basic goal of PCA is to describe variation in a set of correlated variables, X T =(X 1, ……,X q ), in terms of a new set of uncorrelated variables, Y T =(Y 1, …….,Y q ), each of which is linear combination of the X variables. Y 1, ………….,Y q - principal components decrease in the amount of variation in the original data

Principal Components Analysis (PCA) The principal components analysis are most commonly used for constructing an informative graphical representation of the data. Principal components might be useful when: There are too many explanatory variables relative to the number of observations. The explanatory variables are highly correlated.

Principal Components Analysis (PCA) The principal component is the linear combination of the variables X 1, X 2, ….X q Y1 accounts for as much as possible of the variation in the original data among all linear combinations of

Principal Components Analysis (PCA) The second principal component accounts for as much as possible of the remaining variation: with the constrain: and are uncorrelated.

Principal Components Analysis (PCA) The third principal component: is uncorrelated with and. If there are q variables, there are q principal components.

Principal Components Analysis (PCA) HeightFirst Leaf Each observation is considered a coordinate in N-dimensional data space, where N is the number of variables and each axis of data space is one variable. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Mean length Mean of height Step 1: A new set of axes is created, whose origins (0,0) is located at the mean of the dataset. Step 2: The new axes are rotated around their origins until the first axis gives a least squares best fit to the data (residuals are fitted orthogonally). Data: Height of first leaf length of Dactylorhyza orchids.

Principal Components Analysis (PCA) PCA gives three useful sets of information about the dataset: projection onto new coordinate axes (i.e. new set of variables encapsulating the overall information content). the rotations needed to generate each new axis (i.e. the relative importance of each old variable to each new axis). the actual information content of each new axis.

Mechanics of PCA Normalising the data Most multivariate datasets consists of extremely different variables (i.e. plant percentage cover will range from 0% to 100%, animal population values may exceed 10000, chemical concentrations may take any positive value). How to compare such disparate types of data? Approach: calculate the mean (µ) and standard deviation(s) of each variable (X i ) separately, then convert each observation into a corresponding Z score: Z score is dimensionless, each column of the data has been converted into a new variable which preserves the shape of the original data but has µ=0 and s=1. The process of converting to Z scores is known as normalization.

Mechanics of PCA Normalising the data XYZ XYZ µ: s: Before normalisation After normalisation x, y, and z - axes µ - mean s - standard deviation

Mechanics of PCA The extraction of principal components The cloud of N-dimensional data points needs to be rotated to generate a set of N principal axes. The ordination is achieved by finding a set of numbers (loadings) which rotates the data to give the best fit. How to find the best possible values for the loadings? Answer: Finding the eigenvectors and eigenvalues of the Pearson’s correlation matrix (the matrix of all possible Pearson’s correlation coefficients between the variables under examination). The covariance matrix can be used instead of correlation matrix when all the original variables have the same scale or if the data was normalized. XYZ X Y Z

Mechanics of PCA Eigenvalues and eigenvectors When a square (N x N) matrix is multiplied with a (1 x N) matrix, the result is a new (1 x N) matrix. This operation can be repeated on a new (1 x N) matrix, generating another (1 x N) matrix. After a number of repeats (iterations) the pattern of numbers generated settles down to a constant shape, although their actual values change each time by a constant amount. The rate of growth (or shrinkage) per multiplication it is known as dominant eigenvalue, and the pattern they form is the dominant (or principal) eigenvector. - (N x N) matrix - (1 x N) matrix - eigenvalue

Mechanics of PCA Eigenvalues and eigenvectors x = First iteration: Second iteration: x = Iteration number: Resulting matrix: e7 6.40e7 7.96e7 First eigenvector: Second eigenvector: Dominant eigenvalue: 2.48 Once the equilibrium is reached each generation of numbers increases by a factor of 2.48.

Mechanics of PCA PCA takes a set of R observations on N variables as a set of R points in an N-dimensional space. A new set of N principal axes is derived, each one defined by rotating the dataset by a certain angle with respect to the old axes. The first axis in the new space (the first principal axis of the data) encapsulate the maximum possible information content, the second axis contains the second greatest information content and so on. Eigenvectors - a relative patterns of numbers which is preserved under matrix multiplication. Eigenvalues - give a precise indication of the relative importance of each ordination axis, with the largest eigenvalue being associated with the first principal axis, the second largest eigenvalue being associated with the second principal axis, etc.

Mechanics of PCA For example, a matrix with 20 species would generate 20 eigenvectors, but only the first three or four would be of any importance for interpreting the data. The relationship between eigenvalues and variance in PCA: - percent variance explained by the mth ordination axis - the mth eigenvalue - number of variables There is no formal test of significance available to decide if any given ordination axis is meaningful, nor is there any test to decide whether or not individual variables contribute significantly to an ordination axis.

Mechanics of PCA Axis scores The Nth axis of the ordination diagram is derived by multiplying the matrix of normalized data by the Nth eigenvector. XYZ XYZ x = x = first eigenvector second axis scores first axis scores second eigenvector

PCA Example Excavations of prehistoric sites in northeast Thailand have produced a series of canid (dog) bones covering a period from about 3500 BC to the present. In order to clarify the ancestry of the prehistoric dogs, mandible measurements were made on the available specimens. These were then compared with similar measurements on the golden jackal, the Chinese wolf, the Indian wolf, the dingo, the cuon, and the modern dog from Thailand. How these groups are related, and how the prehistoric group is related to the others? R data “Phistdog” Variables: Mbreadth- breadth of mandible Mheight- height of mandible below 1 st molar mlength- length of 1 st molar mbreadth- breadth of 1 st molar mdist- length from 1 st to 3 rd molars inclusive pmdist- length from 1 st to 4 th premolars inclusive

PCA Example >Phistdog=read.csv("E:/Multivariate_analysis/Data/Prehist_dog.csv",header=T,ro w.names=1) # read the “Phistdog” data and consider the first column as the row names > round(sapply(Phistdog,var),2) Mbreath Mheight mlength mbreadth mdist pmdist Calculate the variance of Phistdog data set. The round command is used to reduce the number of decimals at 2 for the reason of space. The measurements are on a similar scale, variances are not very different. We can use either correlation or the covariance matrix.

PCA Example > round(cor(Phistdog),2) Mbreath Mheight mlength mbreadth mdist pmdist Mbreath Mheight mlength mbreadth mdist pmdist Calculate the correlation matrix of the data.

PCA Example Calculate the covariance matrix of the data. > round(cov(Phistdog),2) Mbreath Mheight mlength mbreadth mdist pmdist Mbreath Mheight mlength mbreadth mdist pmdist

PCA Example Calculate the eigenvectores and eigenvalues of the correlation matrix: > eigen(cor(Phistdog)) $values [1] $vectors [,1] [,2] [,3] [,4] [,5] [,6] [1,] [2,] [3,] [4,] [5,] [6,]

PCA Example Calculate the eigenvectores and eigenvalues of the covariance matrix: > eigen(cov(Phistdog)) $values [1] [6] $vectors [,1] [,2] [,3] [,4] [,5] [,6] [1,] [2,] [3,] [4,] [5,] [6,]

PCA Example Extract the principal components from the correlation matrix: > Phistdog_Cor=princomp(Phistdog,cor=TRUE) > summary(Phistdog_Cor,loadings=TRUE) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation Proportion of Variance Cumulative Proportion Loadings: Comp.1 Comp.2 Comp.3 Mbreath Mheight mlength mbreadth mdist pmdist The first principal component accounts for 90% of variance. All other components account for less than 10% variance each.

PCA Example Extract the principal components from the covariance matrix: > Phistdog_Cov=princomp(Phistdog) > summary(Phistdog_Cov,loadings=TRUE) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation Proportion of Variance Cumulative Proportion Loadings: Comp.1 Comp.2 Comp.3 Mbreath Mheight mlength mbreadth mdist pmdist The loadings obtained from the covariance matrix are different compared to those from the correlation matrix. Proportions of variance are similar.

PCA Example Plot variances of the principal components: > screeplot(Phistdog_Cor,main="Phistdog",cex.names=0.75)

PCA Example Equations for the first two principal components from the correlation matrix: Equations for the first two principal components from the covariance matrix: Negative loadings on first principal axis for all variables. Mostly positive loadings on the second principal axis.

PCA Example Calculate the axis scores for the principal components from the correlation matrix: > round(Phistdog_Cor$scores,2) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Modern G.jackal C.wolf I.wolf Cuon Dingo Prehistoric

PCA Example Calculate the axis scores for the principal components from the covariance matrix: > round(Phistdog_Cov$scores,2) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Modern G.jackal C.wolf I.wolf Cuon Dingo Prehistoric

PCA Example Plot the first principal component vs. second principal component obtained from the correlation matrix and >plot(Phistdog_Cor$scores[,2]~Phistdog_Cor$scores[,1],xlab="PC1",ylab="PC2",pch=15,xlim=c(-4.5,3.5),ylim=c(-0.75,1.5)) >text(Phistdog_Cor$scores[,1],Phistdog_Cor$scores[,2],labels=row.names(Phist dog),cex=0.7,pos=rep(1,7)) > abline(h=0) > abline(v=0) >plot(Phistdog_Cov$scores[,2]~Phistdog_Cov$scores[,1],xlab="PC1",ylab="PC 2",pch=15,xlim=c(-14.5,11),ylim=c(-3.5,4.5)) >text(Phistdog_Cov$scores[,1],Phistdog_Cov$scores[,2],labels=row.names(Phi stdog),cex=0.7,pos=rep(1,7)) > abline(v=0) > abline(h=0) from the covariance matrix:

PCA Example PCA diagram based on Covariance PCA diagram based on Correlation

PCA Example Even if the scores given by the covariance and correlation matrix are different the information provided by the two diagrams is the same. The Modern dog has the closest mandible measurements to the Prehistoric dog, which shows that the two groups are related. Cuon and Dingo groups are the next closest groups to the Prehistoric dog. I. wolf, C wolf, and G. jack are not related to the Prehistoric dog or to any other group.