Multivariate Statistical Methods

Slides:



Advertisements
Similar presentations
Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Advertisements

Canonical Correlation
Lecture 3: A brief background to multivariate statistics
Factor Analysis and Principal Components Removing Redundancies and Finding Hidden Variables.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Lecture 7: Principal component analysis (PCA)
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:
A quick introduction to the analysis of questionnaire data John Richardson.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Tables, Figures, and Equations
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Techniques for studying correlation and covariance structure
Correlation. The sample covariance matrix: where.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Data Forensics: A Compare and Contrast Analysis of Multiple Methods Christie Plackner.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
Correlation.
The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable.
Chapter 11 Descriptive Statistics Gay, Mills, and Airasian
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Advanced Correlational Analyses D/RS 1013 Factor Analysis.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Lecture 12 Factor Analysis.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Presented by: Muhammad Wasif Laeeq (BSIT07-1) Muhammad Aatif Aneeq (BSIT07-15) Shah Rukh (BSIT07-22) Mudasir Abbas (BSIT07-34) Ahmad Mushtaq (BSIT07-45)
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Descriptive Statistics ( )
Principal Component Analysis
Factor and Principle Component Analysis
Mini-Revision Since week 5 we have learned about hypothesis testing:
Business and Economics 6th Edition
Information Management course
Exploring Microarray data
Information Management course
Principal Component Analysis (PCA)
Dimension Reduction via PCA (Principal Component Analysis)
Numerical Descriptive Measures
Principal Component Analysis (PCA)
Techniques for studying correlation and covariance structure
Descriptive Statistics vs. Factor Analysis
Multivariate Statistical Methods
Measuring latent variables
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Multivariate Statistical Methods
Principal Components Analysis
Principal Component Analysis (PCA)
Multivariate Statistical Analysis
Principal Component Analysis
Multivariate Statistical Methods
Principal Components What matters most?.
Principal Component Analysis
2.3. Measures of Dispersion (Variation):
Principal Component Analysis (PCA)
Business and Economics 7th Edition
Measuring latent variables
Presentation transcript:

Multivariate Statistical Methods Principal Components Analysis (PCA) By Jen-pei Liu, PhD Division of Biometry, Department of Agronomy, National Taiwan University and Division of Biostatistics and Bioinformatics National Health Research Institutes 2019/2/24 Copyright by Jen-pei Liu, PhD

Principal Components Analysis Introduction Procedures Properties Examples Summary 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Introduction Described by K. Pearson (1901) Computing methods by Hotelling (1933) Objective To transform the original variables X1,…,Xp into index variables Z1,…,Zp Z1,…,Zp are linear combinations of X1,…,Xp Z1,…,Zp are independent and are in order of important To describe the variation in the data 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Introduction Lack of correlation  index variables measure different dimensions (domains) Lack of correlation  only consider the variance of index variables and do not have to take covariance into consideration Ordering  Var(Z1)  Var(Z2)  …  Var(Zp) The Z index variables are called the principal components 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Introduction Variance of the variation in the full data set can be adequately describe by the few Z index variables Reduction of dimension from 2-digit number to just 2 to 4 principal compoents High correlations in the original variables 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Introduction 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Introduction Correlations of Female Sparrows X1 X2 X3 X4 X5 Total length (X1) 1.000 Alar length (X2) 0.735 1.000 Length of beak and Head (X3) 0.662 0.674 1.000 Length of humerus (X4) 0.645 0.769 0.763 1.000 Length of keel of sternum (X5) 0.605 0.529 0.626 0.607 1.000 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Introduction Coefficients for Components Component Variance X1 X2 X3 X4 X5 1 3.616 0.452 0.462 0.451 0.471 0.398 2 0.532 -0.051 0.300 0.325 0.185 -0.877 3 0.386 0.691 0.341 -0.455 -0.411 -0.179 4 0.302 -0.420 0.548 -0.606 0.388 0.069 5 0.165 0.374 -0.530 -0.343 0.652 -0.192 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Introduction Z1=0.452X1+0.462X2+0.451X3+0.471X4+0.398X5 Variance of Z1 is 3.62 Variance of Z1 accounts for 72.3% (3.62/5.00) of the total variation All coefficients of Z1 are smaller than 1 and sum of squares of these coefficients is equal to 1 Z1 is in fact as the average (or sum) of X1, X2, X3, X4, and X5 Z1 can be interpreted as the index for the size of the sparrow 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures Data Structure Case X1 X2 … Xp 1 x11 x12 … x1p 2 x21 x22 … x2p . N xn1 xn2 … xnp 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures The First Component The first component is a linear combination of X1, X2, …, Xp Z1= a11X1+a12X2+…+a1pXp Var(Z1) is as large as possible subject to condition that a112+a122+…+a1p2=1 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures The second Component The second component is also a linear combination of X1, X2, …, and Xp Z1= a21X1+a22X2+…+a2pXp Var(Z2) is as large as possible subject to condition that a212+a222+…+a2p2=1, Var(Z2) is the second largest, Z1 and Z2 are not correlated 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures The third Component The third component is also a linear combination of X1, X2, …, and Xp Z1= a31X1+a32X2+…+a3pXp Var(Z2) is as large as possible subject to condition that a312+a322+…+a3p2=1, Var(Z3) is the second largest, Z1, Z2 and Z3 are not correlated 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures Continue until all p principal components are computed Covariance matrix of p variables 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures Different variables might have different units and magnitudes PCA might be influenced by these magnitudes and units Standardization to have zero mean and unit variance Covariance on standardized variables is the correlation matrix 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures Steps of (PCA) Standardizing variables X1, X2,…,Xp to have zero means and unit variances unless that the importance of variables is reflected in their variances Calculate the covariance matrix (correlation matrix) 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Procedures Steps of (PCA) Find the eigenvalues 1, 2,…, p and their corresponding eigenvectors a1, a2, …, ap The coefficients of the ith principal component Zi is the element of ai and i the variance of Zi Discard any components that accounts for only a small proportion of the variation in the data 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Properties 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Properties E(Z)=A V(Z)=AA’=diag{I, i=1,…,p} Cov(Zi,Xj)=aiji Corr(Zi,Xj)=aiji/cjj Corr(Zi,Xj)=aiji, if correlation matrix is used 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Determination of the number of principal components Depends upon the needs of practitioners The proportion of the total variation explained by the selected principal components is high, e.g., at least 80% If correlation matrix is used, select the principal component with the variance greater than 1 because they accounts for more variation than the original variables (=1) Use scree plot 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Evaluation of Statistics Course 16 students for 11 items (variables) Evaluation scales: 1(poor or not at all) to 5(excellent, strongly, or difficult) The first two principal components explain 76.0% of total variation and the last four principal components explain only 2.2% 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Test scores of 10 students in 4 subjects Student Subject 1 2 3 4 5 6 7 8 9 10 Chinese(X1) 85 90 60 70 68 77 50 80 85 55 English(X2) 76 95 45 65 56 80 30 70 75 60 Math(X3) 60 80 38 60 70 65 40 60 65 40 Social(X4) 85 72 80 76 70 68 80 66 84 50 Source: Shen (1998) 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Correlation Matrix X1 X2 X3 X4 X1 1 0.8846 0.8375 0.2784 X2 1 0.8059 -0.1101 X3 1 0.1118 X4 1 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Eigenvalues and Eigenvectors Cum. Eigenvector Eigenvalue Prop. Prop. X1 X2 X3 X4 2.70159 0.6754 0.6554 0.5897 0.1254 0.3592 -0.7124 1.06380 0.2660 0.9414 0.1254 -0.2651 -0.0281 0.9556 0.19870 0.0497 0.9910 0.3592 0.4378 -0.8227 0.0501 0.03591 0.0090 1.0000 -0.7124 0.6444 0.0485 0.2737 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Because the first two principal components account for 94.14%, we can just use these two principal components The first principal component can be interpreted as the index for the sum of Chinese, English and math The second principal component can be thought as social science 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples The above results can be also obtained by inspecting the correlation matrix Correlations among Chinese, English, and math exceed 0.8 Correlations between Chinese, English, and math with social science are below 0.3 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Correlation between the first principal component with original variables Corr(Z1,X1)=a111 =0.58972.70159=0.9692 Corr(Z1,X2)=a121 =0.56822.70159=0.9339 Corr(Z1,X3)=a131 =0.56572.70159=0.9298 Corr(Z1,X4)=a14i = 0.09692.70159=0.1592 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Correlation between the second principal component with original variables Corr(Z2,X1)=a212 =0.12541.0638=0.1294 Corr(Z2,X2)=a222 =-0.26511.0638=-0.2734 Corr(Z2,X3)=a232 =-0.02811.0638=-0.0290 Corr(Z2,X4)=a242 = 0.95561.0638=0.9856 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Student 1st Component 2nd Component 1 0.91883 1.12685 2 2.58868 -0.41488 3 -1.85920 0.84509 4 0.03527 0.23932 5 0.01741 -0.21745 6 0.92643 -0.65337 7 -2.67248 0.96553 8 0.52758 -0.65459 9 1.32646 0.92471 10 -1.80897 -0.16121 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Correlations of Female Sparrows X1 X2 X3 X4 X5 Total length (X1) 1.000 Alar length (X2) 0.735 1.000 Length of beak and Head (X3) 0.662 0.674 1.000 Length of humerus (X4) 0.645 0.769 0.763 1.000 Length of keel of sternum (X5) 0.605 0.529 0.626 0.607 1.000 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Coefficients for Components Component Variance X1 X2 X3 X4 X5 1 3.616 0.452 0.462 0.451 0.471 0.398 2 0.532 -0.051 0.300 0.325 0.185 -0.877 3 0.386 0.691 0.341 -0.455 -0.411 -0.179 4 0.302 -0.420 0.548 -0.606 0.388 0.069 5 0.165 0.374 -0.530 -0.343 0.652 -0.192 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples The first principal component Z1=0.452X1+0.462X2+0.451X3+0.471X4+0.398X5 An index of bird size The second principal component Z2=-0.051X1+0.300X2+0.325X3+0.185X4-0.877X5 An index of bird shape 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples The value of the first principal component for the first bird Z1=0.452(-0.542)+0.462(0.725)+0.451(0.177)+ 0.471(0.055)+0.398(-0.33) = 0.064 The value of the second principal component for the first bird Z2=-0.051(-0.542)+0.300(0.725)+0.325(0.177)+ 0.185(0.055)+(-0.877(-0.33) = 0.602 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Mean Standard Deviation Survivor Nonsurvivor Survivor Nonsurvivor -0.100 0.075 1.506 2.176 0.004 -0.003 0.684 0.776 -0.140 0.105 0.522 0.677 0.073 -0.055 0.563 0.543 0.023 -0.017 0.411 0.408 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Employment in European Countries AGR MIN MAN PS CON SER FIN SPC TC AGR 1.000 MIN 0.316 1.000 MAN -0.254 -0.672 1.000 PS(3) -0.382 -0.387 0.388 1.000 CON -0.349 -0.129 -0.034 0.165 1.000 SER -0.605 -0.407 -0.033 0.155 0.473 1.000 FIN -0.176 -0.248 -0.274 0.094 -0.018 0.379 1.000 SPC -0.811 -0.316 0.050 0.238 0.072 0.388 0.166 1.000 TC -0.487 0.045 0.243 0.105 -0.055 -0.085 -0.391 0.475 1.000 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples 9 eigenvalues: 3.112(34.6%), 1.809(20.1%), 1.496(16.6%), 1.063(11.8%), 0.710(7.9%) 0.311(3.5%), 0.293(3.3%), 0.204(2.4%), and 0(0.0%) The sum of percent employment is 1 The columns of correlation matrix are linearly dependent The last eigenvalue is 0 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples Select the principal components with eigenvaleues greater than 1  the first 4 principal components that explain 85% of the total variation in the data If we take first two principal components which can account only for 55% of total variation 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples The first principal component Z1=0.51(AGR)+0.37(Min)-0.25(MAN)-0.31(PS)-0.22(CON)-0.38(SER)-0.13(FIN)-0.42(SPS)-0.21(TC) A contrast between AGR(agriculture, forestry, and fishing) and MIN(mining and quarrying) versus others 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Examples The second principal component Z1=-0.-2(AGR)+0.00(Min)+0.43(MAN) +0.11(PS)-0.24(CON)-0.41(SER) -0.55(FIN)+0.05(SPS)+0.52(TC) A contrast between MAN(manufacturing) and TC(transport and communication) versus CON(construction),SER(service industry) and FIN(finance) 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD 2019/2/24 Copyright by Jen-pei Liu, PhD

Copyright by Jen-pei Liu, PhD Summary A linear combination of the original variables Try to reduce a large number of variables to a few index variables Index variables are not correlated and ordered in the magnitude of variation Illustration with real examples 2019/2/24 Copyright by Jen-pei Liu, PhD