Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is it? Principal Component Analysis (PCA) is a standard tool in multivariate analysis for examining multidimensional data To reveal patterns between.

Similar presentations


Presentation on theme: "What is it? Principal Component Analysis (PCA) is a standard tool in multivariate analysis for examining multidimensional data To reveal patterns between."— Presentation transcript:

1 What is it? Principal Component Analysis (PCA) is a standard tool in multivariate analysis for examining multidimensional data To reveal patterns between objects that would not be apparent in a univariate analysis What is the goal of PCA? PCA reduces a correlated dataset (values of variables {x1, x2, …, xp}) to a dataset containing fewer new variables by axis rotation The new variables are linear combinations of the original ones and are uncorrelated The PCs are the new variables (or axes) which summarize several of the original variables If nonzero correlation exists among the variables of the data set, then it is possible to determine a more compact description of the data, which amounts to finding the dominant modes in the data. Principal Component Analysis

2 PCA can explain most of the variability of the original dataset in a few new variables (if data are well correlated) Correlation introduces redundancy (if two variables are perfectly correlated, then one of them is redundant because if we know x, we know y) PCA exploits this redundancy in multivariate data to pick out patterns and relationships in the variables and reduce the dimensionality of the dataset without significant loss of information How does it work?

3 Types of data we can use PCA on: Basically anything! Usually we have multiple variables (may be different locations of the same variables, or different variables) and samples or replicates of these variables (e.g. samples taken at different times, or data relating to different subjects or locations) Very useful in the geosciences where data are generally well correlated (across variables, across space) What can it be used for? Exploratory data analysis Detection of outliers Identification of clusters (grouping, regionalization) Reduction of variables (data pre-processing, multivariate modeling) Data compression (lossy!) Analysis of variability in space and time New interpretation of the data (in terms of the main components of variability) Forecasting (finding relationships between variables) What do we use it for?

4 Geosciences in general Rock type identification Remote sensing retrievals Classification of land use Hydrology, water quality and ecology Regionalization (e.g. drought and flood regimes) Analysis of water quality and relationships with hydrology Relationships between species richness with morphological and hydrological parameters (lake size, land use, hydraulic residence time) Relationships between hydrology/soils and vegetation patterns Contamination source identification Atmospheric science and climate analysis Weather typing and classification Identification of major modes of variability Teleconnection patterns The Hockey stick plot of global temperature record (example of perceived misuse of PCA – actually NOT!) Others Bioinformatics Gene expressions analysis Image processing and pattern recognition Data compression Some Examples from the Literature

5 Example: Image Processing

6 First principal component of October – December 1950 -1999 sea-surface temperatures in the Tropical Pacific. Example: Analysis of variability in Climate Data El Nino episodes Example taken from Climate Prediction Tool documentation, Simon Mason, IRI

7 Estimation of a reduced set of independent chemical groups with similar physical–chemical behavior. To reduce the dimension of environmental data, to map contaminant similarity groups and for source identification. Dimension reduction refers to finding a small number of statistically independent, physically plausible chemical groups that explain a significant fraction of the overall variance in the data set. Clustering of the chemical species data in PCA space: (a) scatterplot of all data; (b) primary and transition groups; (c) primary groups only. Example: Classification of Chemical Species

8 Principal components analysis for water quality parameters for three creeks in Georgia. The top panel highlights differences between streams. The bottom panel is the same PCA highlighting differences between hydrologic seasons. Example: Effects of Flooding and Drought on Water Quality in Gulf Coastal Plain Streams in Georgia

9 The North Atlantic Oscillation (NAO) is major mode of climate variability in the NH NAO is defined as either: 1 ST PC (leading mode of variability) of SLP in the North Atlantic, OR difference in pressure between north (Iceland) and south (Azores) of North Atlantic Advantage of PCA is that it takes into account the changing pattern of SLP across the Atlantic and ignores day-to-day variability of the weather. Example: Identifying Modes of Climate Variability Positive phase of the NAO: bad weather in the Med; good weather in N. Europe and Eastern USA Negative phase of the NAO: good weather in the Med; bad weather in N. Europe and Eastern USA

10 How do we analyze a global dataset of soil moisture with 50-yrs of data at 15000 grid points? How does it vary in time and space? What are the main modes of variability? Can we relate these to physical phenomena? Example: Global soil moisture variability

11 Understanding PCA: an simple example A cluster of data in 3-D. These could be for three variables (X, Y, Z) or three locations (X, Y, Z). Often the variables are correlated and show more variance in certain directions

12 Understanding PCA: an example The first principal component (PC1) is the direction along which there is the largest variation This is equivalent to axis rotation and expressing the data in a new coordinate system

13 Understanding PCA: an example The second PC (PC2) is in the direction uncorrelated to the first component which along which the data shows the largest variation Looking down the barrel of PC1

14 Understanding PCA: an example PC1 and PC2 are uncorrelated or orthogonal The result:

15 Given set of exam scores of a group of 14 people, how can we best summarize the scores? One objective of summarizing the scores would be to distinguish the good students from the bad students. Principal components are ideal for obtaining such summaries, but they can also provide other informative summaries … Understanding PCA: a REAL example Example taken from Climate Prediction Tool documentation, Simon Mason, IRI

16 Each subject is a variable (like one grid point), each person is a case, or sample (like one year). Input Data

17 Loading weights (left), amplitudes (right) PC1 – positive loadings on all exams. This distinguishes good from bad students. Amplitudes (projection of data onto PC1) are shown at right; good students have positive scores) The first PC

18 PC2 – oppositely signed loadings on physical vs. social sciences. This distinguishes physical scientists from social scientists. Again, amplitudes (projection of data onto PC2) are shown at right; physical scientists have positive score). The second PC Loading weights (left), amplitudes (right)

19 How to do PCA? Step 1: Organize the data (what are the variables, what are the samples?) Step 2: Calculate the covariance matrix (how do the variables co-vary?) Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix Step 4: Calculate the PCs (project the data onto the eigenvectors) Step 5: Choose a subset of PCs Step 6: Interpretation, data reconstruction, data compression, plotting, etc…

20 Perhaps you have some observations of several variables at one instant in time but you may have many samples or realizations of these variables taken at different times. For example: daily temperature and precipitation at a location for one year (2 by 365 matrix: 2 variables measured 365 times) 12 chemical species measured in 10 streams. This would be (12 x 10) or (10 x 12). Or soil moisture at 25 stations for 31 days (25 x 31: 25 variables or locations measured 31 times) In general, the data could be: 1) A space-time array: Measurements of a single variable at M locations taken at N different times, where M and N are integers. 2) A parameter-time array: Measurements of M variables (e.g. temperature, pressure, relative humidity, rainfall,...) taken at one location at N times. 3) A parameter-space array: Measurements of M variables taken at N different locations at a single time. and so on… Step 1: Organize the Data

21 Step 2: Covariance Matrix x1x1 x2x2 x3x3 …xMxM x1x1 cov(1,1)cov(1,2)cov(1,3)cov(1,M) x2x2 cov(2,1)cov(2,2)cov(2,3)cov(2,M) x3x3 cov(3,1)cov(3,2)cov(3,3)cov(3,M) … xMxM cov(M,1)cov(M,2)cov(M,3)cov(M.M) M = number of variables, N = number of samples

22 Step 2: Correlation Matrix (a normalized cov matrix) x1x1 x2x2 x3x3 …xMxM x1x1 corr(1,1)corr(1,2)corr(1,3)corr(1,M) x2x2 corr(2,1)corr(2,2)corr(2,3)corr(2,M) x3x3 corr(3,1)corr(3,2)corr(3,3)corr(3,M) … xMxM corr(M,1)corr(M,2)corr(M,3)corr(M.M) Note that on the diagonal, Corr(i,i) = variance / (sd * sd) = 1

23 Step 3: How to find PCs of the Cov/Corr Matrix? Principal components (PC 1, PC 2 …) are derived via successive multiple regression: use covariance/correlation matrices to look for association between variables derive a linear equation that summarizes the variation in the data with respect to the multiple variables, i.e. a multiple regression repeat as necessary (up to PC N, where N is the number of variables) The PC can be written in general as a linear combination (multiple linear regressions) of the original variables: z = a 1 X 1 + a 2 X 2 + a 2 X 2 + … + a M X M where X 1 … X M are the original variables and a 1 …a M (the loadings) are coefficients that reflect how much each variables contributes to the component. A PC ‘value’ (projection of the data into PC space) is possible for every sample in the dataset: z(t) = a 1 X 1 (t) + a 2 X 2 (t) + a 2 X 2 (t) + … + a M X M (t) where t is a given sample (e.g. a time step) and X 1 (t) … X M (t) are the original values of the variables for sample t.

24 Step 3: Coefficients of Linear Regression are Eigenvectors of Cov Matrix Eigenvector: a list showing how much each original variable contributes to the PC (i.e. the coefficients from the PC equation) one eigenvector for every PC Eigenvalue: a single number that quantifies the amount of the original variance that is explained by a component one eigenvalue for every component/eigenvector PCA is the eigenvalue analysis of a covariance/correlation (dispersion) matrix

25 Step 3: Eigen Analysis or Eigen Decomposition Any symmetric matrix A can be decomposed in the following way through an eigen analysis or eigen decomposition. where λ is an eigenvalue (scalar) and e i is an eigenvector, and E is the matrix of eigenvectors and L is the diagonal matrix of eigenvalues. This can also be written as Each e i will have dimension (Mx1) and E will have dimension MxM (the number of variables). We usually require the eigenvectors to have unit length and thus the product of an eigenvector with itself is 1 and the eigenvectors are mutually orthogonal (orthonormal if unit length):

26 Step 4: Calculate PCs from Eigenvectors Each of the M eigenvectors contains one element related to each of the K variables, x k Each of the PCs is computed from a particular set of observations of the K variables. Geometrically the first eigenvector, e 1, points in the direction in which the data vectors jointly exhibit the most variability The first eigenvector is associated with the largest eigenvalue, λ 1 The second eigenvector is associated with the second largest eigenvalue, λ 2, and is orthogonal to the first eigenvector And so on… The eigenvectors define a new coordinate system in which to view the data. The mth principal component PC m, is the projection of the data vector X onto the mth eigenvector e m :

27 Step 4: Eigenvector and PC Coefficients Eigenvector: – a list showing how much each original variable contributes to the component (i.e. the coefficients from the component equation) z i = e 1,i X 1 + e 2,i X 2 + e 3,i X 3 + … e M,i X M So E is 123…M x1x1 e 1,1 e 1,2 e 1,3 e 1,M X2X2 e 2,1 e 2,2 e 2,3 e 2,M X3X3 e 3,1 e 3,2 e 3,3 e 3,M … xMxM e M,1 e M,2 e M,3 e M,M Variables Eigenvectors

28 Step 5: PCs and Eigenvalues The eigenvalues indicate how much variance is explained by each eigenvector The variance of the mth PC is the mth eigenvalue λ m. If you arrange the eigenvector/eigenvalue pairs with the biggest eigenvalues first, then you may be able to explain a large amount of the variance in the original data set with relative few coordinate directions. Each PC represents a share of the total variation in x that is proportional to its eigenvalue If all PCs are used Now we can reconstruct the dataset to some approximation using fewer PCs and thus obtain a compression of the data or a dataset that contains only the important information of the dataset:

29 2-D example of PCAs


Download ppt "What is it? Principal Component Analysis (PCA) is a standard tool in multivariate analysis for examining multidimensional data To reveal patterns between."

Similar presentations


Ads by Google