Presentation on theme: "1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen"— Presentation transcript:
1 To centre or not to centre …or perhaps do it twice Ian Jolliffe Universities of Reading, Southampton, Aberdeen firstname.lastname@example.org
2 Outline of talk Introduction Covariance and correlation Principal component analysis (PCA - EOF analysis) Uncentred analyses Doubly-centred analyses Concluding remarks
3 Covariance Given a data set x ij, i = 1, 2, …, n; j = 1, 2, …p, consisting of n observations on p variables, the covariance between the j th and k th variable is, with obvious notation (though divisor (n-1) instead of n might be more appropriate here):
4 Covariance and correlation The correlation between variables j and k is r jk = s jk /[s jj s kk ] ½ The covariance s jk is the (j,k) th element of the matrix S CC = X T CC X CC /n, where X CC is the matrix whose (i,j)th element is
5 Centering The notation X CC indicates that X has been column-centred. There are several alternatives –No centering (uncentred), giving X UC –Row centering, giving X RC –Double centering, giving X DC
6 Other forms of covariance For each of the X matrices, we can calculate a matrix of ‘modified covariances’, as S = X T X/n For example, an ‘uncentred covariance matrix’ can be defined to have elements
7 Other forms of correlation ‘Correlations’ can be defined corresponding to each of the modified covariances Hyvärinen et al. (2001, pp 24,25) define correlation as an uncentred version, but covariance with column centering!
8 PCA (EOF analysis) – some definitions, terminology If x is a vector of p variables, then the principal components (PCs) are linear combinations a T 1 x, a T 2 x, … a T p x In the kth PC, a k the vector of coefficients or loadings is chosen so that the variance of a T k x is maximised, subject to a normalisation constraint a T k a k = 1, and subject to successive PCs being uncorrelated
9 PCA – more definitions, terminology The optimisation problem which defines PCs turns out, like many in multivariate analysis, to be an eigenvalue problem The variances of the PCs are eigenvalues of the covariance matrix of x, in descending order, and the vectors of loadings a k are the corresponding eigenvectors. If variables are replaced by standardised variables, obtained by dividing by respective standard deviations, then PCA finds eigenvalues and eigenvectors of the correlation matrix
10 Varieties of PCA As well as the covariance/correlation dichotomy, we can do corresponding analyses on the various modified versions All have been used somewhere in the literature but it not always obvious how to interpret what is being done, and what the results mean
11 Examples We illustrate the various analyses with two toy examples –Monthly averages of maximum daily temperature for 16 UK stations (n=16; p=12) in 2002 –Monthly precipitation totals for 15 UK stations (n=15; p=12) in 2002 For the first of these, analyses were done using both Celsius and Fahrenheit
13 Temperature data – column centred Correlation matrix analysis has first PC as a measure of overall temperature at each station; second PC measures seasonal cycle Correlation matrix analysis is invariant to use of Celsius or Fahrenheit; so is covariance analysis because the transformation is the same for all variables Covariance analysis is similar except that loadings on the first PC are slightly more variable, reflecting different variance values; similar amounts of variation are accounted for by PC1 (73%, 74% for correlation, covariance respectively)
15 Precipitation data – column centred Because precipitation is less spatially structured than temperature, correlations are more variable and not all are positive Hence loadings on first PC are no longer nearly uniform. The correlation analysis has two main exceptions (July, December); the covariance analysis is much more variable, due to large differences in variances (Sep =159, Feb = 4664)
18 Precipitation data – column centred II Also PC1 is less dominant than for temperature (65% covariance, 55% correlation) PC2 is dominated by months that are least correlated with the rest in both analyses, though details are different
19 Temperature data – uncentred ‘covariance’ analysis We are now looking at directions with the maximum variation with respect to the origin, rather than with respect to the mean. Hence the mean itself often determines the form of the first (frequently very dominant) ‘component’ In this example, PC1 & PC2 have similar loadings to those in the column-centred analysis, but the first PC is a much more dominant source of variation and a seasonal cycle is now apparent in PC1 reflecting the annual cycle in the means
21 Temperature data – uncentred ‘covariance’ analysis II Results are not invariant to choice of scale Because values for Fahrenheit are further from the origin than Celsius, the PC1 is even more dominant (99.95% of ‘variation’ for °F; 99.73% for °C; 74.0% for column-centred) Also loadings in PC1 are less variable for °F than for °C in uncentred analysis It seems unwise to use uncentred analyses unless the origin is meaningful. Even then, it will be uninformative if all measurements are far from the origin
23 Temperature data – uncentred ‘correlation’ analysis Not invariant to choice of scale, but PC1 is very close to an equally weighted combination of all variables in both cases PC2 is also quite similar in both cases – seasonal cycle again Larger numbers for °F so more extreme behaviour (99.94% compared to 99.5% for PC1; greater uniformity of loadings in PC1)
24 Uncentred analyses and anomalies One case where uncentred analyses are appropriate is if we can assume that the population means of our variables are zero, although the sample means are not This is the case when the data are anomalies
25 Precipitation data – uncentred ‘covariance’ analysis PC1 again becomes more dominant than in the column-centred analysis (91.7% vs. 65.0%) All loadings on PC1 now have the same sign and are more similar in value; PC2 has little in common with PC2 for the column- centred analysis
28 Precipitation data – uncentred ‘correlation’ analysis PC1 is very, very close to an equally- weighted combination of all months –it accounts for 91.1% of ‘variation PC2 contrasts the first 6 months with the last 6 months. Why? How can this be interpreted?
29 Temperature data – doubly- centred ‘covariance’ analysis This analysis is invariant to choice of °F or °C PC1 and PC2 have similar loadings to PC2, PC3 in column-centred analysis. This is because the double centering induces a constraint x 1 + x 2 + … + x p =0. This implies that the first PC in the column-centred analysis now has near-zero variance – other PCs move up one, and the last PC is now given by the relationship above
30 Temperature data – doubly- centred ‘correlation’ analysis Again there is invariance to choice of scale PC1 accounts for less ‘variation’ than in ‘covariance’ analysis (63.4% vs. 77.8%) but structure of loadings in first two PCs is similar
31 Precipitation data – doubly- centred ‘covariance’ analysis The double centering again induces a constraint x 1 + x 2 + … + x p =0, given by the last PC Because the first PC in the column centred analysis is not particularly close to x 1 + x 2 + … + x p, PC2 & PC3 don’t look much like PC1 & PC2 of the column centred analysis for these data PC1 accounts for only 40.5% of the (doubly- centred) variation
32 Precipitation data – doubly- centred ‘correlation’ analysis PC1 accounts for 33.6% of ‘variation’ and is similar to that for covariance (Jan, Feb vs. Jul, Aug, Sep, Dec – but how to interpret it?) PC2 is completely different in covariance and correlation analyses
35 When and why use double centering If an analysis is likely to be dominated by an uninteresting ‘size’ PC – all loadings of the same size and roughly equal magnitude (size/shape analysis, species abundance data) – then double- centering removes it Can also be thought of as removing row and column effects from a data matrix and concentrating on interactions between row and columns. Uncentred analysis accentuates size PCs rather than removing them
36 Row-centred analysis If column-centred analysis is S-mode analysis then row-centred analysis is T-mode It is sometimes suggested that T-mode is related to S-mode by simply transposing the data matrix, but this is not the case in general – different centerings are involved The relationship does hold if double-centering is used
37 Related ideas Double-centering uses a similar idea to correspondence analysis, but is different in how row and column effects are removed There are a number of varieties of correlation and anomaly correlation, corresponding to different choices for centering Empirical orthogonal teleconnections (van den Dool et al 2000) use uncentered covariances in a regression context Takane and Shibayama (1991) decompose a data matrix into 4 terms. SVDs of sums of one or more these terms give uncentred, column-centred, row-centred and doubly- centred PCAs
38 Final remarks Standard EOF analysis is (relatively) easy to understand – variance maximisation For other techniques it’s less clear what we are optimising, and how to interpret the results There may be reasons for using no centering or double centering, but potential users need to understand and explain what they are doing
39 References Hyvärinen, A., Karhunen, J. & Oja, E. (2001). Independent component analysis. Wiley Takane, Y. & Shibayama, T. (1991). Principal component analysis with external information on both subjects and variables. Psychometrika, 56, 97-120. Van den Dool, H. M., Saha, S. & Johansson, Å. (2000). Empirical othogonal teleconnections. J. Climate, 13, 1421-1435.
Your consent to our cookies if you continue to use this website.