Stat240: Principal Component Analysis (PCA)
Open/closed book examination data >scores=as.matrix(read.table(" hs.leeds.ac.uk/~charles/mva- data/openclosedbook.dat", head=T)) >colnames(scores) >pairs(scores) MC VC LO NO SO
Sample Variance-Covariance > cov.scores=cov(scores) > round(cov.scores,2) MC VC LO NO SO MC VC LO NO SO > eigen.value=eigen(cov.scores)$values > round(eigen.value,2) [1] > eigen.vec=eigen(cov.scores)$vectors > round(eigen.vec,2) [,1] [,2] [,3] [,4] [,5] [1,] [2,] [3,] [4,] [5,] variances loadings
Principal Components PC1: PC2: PC3: PC4: PC5:
Scree plot >plot(1:5, eigen.value, xlab="i", ylab="variance", main="scree plot", type="b") > round(cumsum(eigen.value)/sum(eigen.value),3) [1]
“princomp” R has a function to conduct PCA > help(princomp) > obj=princomp(scores) > plot(obj, type= " lines " ) > biplot(obj)
PCA in checking MVN assumption By examining normality of PCs, especially the first two PCs. – Histograms, q-q plots – Bivariate plots – Checking outliers
PCA in regression Data: Y nx1, X nxp PCA is useful when we want to regress Y on a large number of independent variables (X) – Reduce dimension – Handle collinearity One would like to transform X to the principal components How to choose principal components?
PCA in regression A misconception: retain those with large variances – There is a tendency that PCs with large variances can better explain the dependent variable – But PCs with small variances might also have predictive value – Should consider largest correlation
Factor Analysis (FA)
PCA vs FA Both attempt to do data reduction PCA leads to principal components FA leads to factors PCAFA X 1 X 2 X 3 X 4 PC 1 … … PC 4 X 1 X 2 X 3 X 4 F 1 F 2 F 3
FA in R The function is “factanal” Example: v1 <- c(1,1,1,1,1,1,1,1,1,1,3,3,3,3,3,4,5,6) v2 <- c(1,2,1,1,1,1,2,1,2,1,3,4,3,3,3,4,6,5) v3 <- c(3,3,3,3,3,1,1,1,1,1,1,1,1,1,1,5,4,6) v4 <- c(3,3,4,3,3,1,1,2,1,1,1,1,2,1,1,5,6,4) v5 <- c(1,1,1,1,1,3,3,3,3,3,1,1,1,1,1,6,4,5) v6 <- c(1,1,1,2,1,3,3,3,4,3,1,1,1,2,1,6,5,4) m1 <- cbind(v1,v2,v3,v4,v5,v6) obj=factanal(m1, factors=2) obj=factanal(covmat=cov(m1), factors=2) plot(obj$loadings,type="n“) text(obj$loadings,labels=c("v1", "v2", "v3", "v4", "v5","v6")) The default method is MLE The default rotation method used by “factanal” is varmax
Example: Examination Scores P=6: Gaelic, English, History, Arithmetic, Algebra, Geometry N=220 male students R=
Factor Rotation Motivation: get better insights Varimax criterion – The rotation that maximizes the total variance of squares of (scaled) loadings