Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR) l Rationale and use of canonical correlation analysis l Canonical correlation versus multiple regression l Estimating canonical variates and correlations l Significance tests l Rationale and use of canonical correlation analysis l Canonical correlation versus multiple regression l Estimating canonical variates and correlations l Significance tests l Interpretation of canonical variates l Rotation of canonical variates l Redundancy indices l Example: Pgi frequencies in California Euphydras editha colonies l Interpretation of canonical variates l Rotation of canonical variates l Redundancy indices l Example: Pgi frequencies in California Euphydras editha colonies
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.2 Canonical correlation l A method of breaking down associations between two sets of variables, a “predictor” (independent variable) set and a “dependent” variable set.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.3 Example : Pgi frequencies in California Euphydras editha colonies in relation to environmental factors. Dependent setPredictor set
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.4 Canonical correlation (CANCOR)versus multiple regression (MR) l For MR, our interest is in predicting the effect of a particular independent variable on a particular dependent variable l For CANCOR, our interest is in determining the number and nature of independent relationships between independent and dependent variable sets l This is accomplished though the use of pairs of linear combinations of variables that are uncorrelated (canonical variates)
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.5 The data l Consists of a set of p independent variables X 1, X 2,…, X p (the independent variable set) and q dependent variables Y 1, Y 2,…, Y q, measured on a sample of N objects, from which we can derive a (p + q) X (p + q) correlation matrix. Within-set (X) correlation Within-set (Y) correlation Between-set (X,Y) correlation
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.6 What are canonical variates anyway? l Canonical variates are the eigenvectors of the corresponding correlation matrix, and thus represent orthogonal line segments that “span” the within-set variability in either X or Y. X1X1 Y1Y1 Y2Y2 X2X2 X2X2
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.7 Estimating canonical variates l The first canonical variate is obtained by finding coefficients of the linear functions which maximizes the correlation between U 1 and V 1 :
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.8 Estimating canonical variates (cont’d) l The second canonical variate is obtained by finding coefficients of the linear functions which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint:
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.9 which maximizes the correlation between U 3 and V 3 : subject to the constraint: which maximizes the correlation between U 3 and V 3 : subject to the constraint: Estimating canonical variates (cont’d) l The third canonical variate is obtained by finding coefficients of the linear functions
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.10 Calculating canonical covariates and correlations l From the within- and between-set correlation matrices, we solve the eigenvalue problem: which has r solutions l From the within- and between-set correlation matrices, we solve the eigenvalue problem: which has r solutions The eigenvalues j are the squares of the correlations between the canonical variates, i.e., the canonical correlations :
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.11 Calculating canonical covariates and correlations (cont’d) l The coefficients for the Y canonical variates V 1, V 2 etc., are simply the corresponding entries in the within-set (Y) correlation matrix B: l The coefficients for the ith canonical variate of for the X variables is then given by the elements of:
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.12 Calculating canonical covariates and correlations (cont’d) l The ith pair of canonical variates is then given by: l where X and Y are vectors of standardized (0,1) values. In this manner, we can generate canonical variate scores for each observation in the data set
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.13 Standardizing canonical covariates l The variance of U and V will be influenced by the scaling adopted for the eigenvectors a and b, but the canonical correlations r(U,V) will be unaffected. l To generate standardized canonical variates, calculate the standard deviation of U i (V i ) and divide a ij (b ij ) values by the corresponding standard deviation.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.14 The end result l A set of r = min(p,q) canonical variates, one for the dependent variable set {V}, the other for the independent variable set {U}. l A set of r canonical correlations C = r(U,V) each representing the correlation between pairs of canonical variates. l A set of r = min(p,q) canonical variates, one for the dependent variable set {V}, the other for the independent variable set {U}. l A set of r canonical correlations C = r(U,V) each representing the correlation between pairs of canonical variates. U1U1 U2U2 V2V2 X2X2 V1V1 High first canonical correlation Low second canonical correlation
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.15 Significance testing l Question: which canonical correlates are statistically “significant”? For testing significance of all r = min(p, q) canonical correlates based on p + q variables, calculate Bartlett’s V and compare to 2 distribution with pq degrees of freedom. l Question: which canonical correlates are statistically “significant”? For testing significance of all r = min(p, q) canonical correlates based on p + q variables, calculate Bartlett’s V and compare to 2 distribution with pq degrees of freedom. Eigenvalue associated with ith canonical variate
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.16 Significance testing (cont’d) l Each CV is tested in a hierarchical fashion by first testing significance of all CVs combined. l If all CVs combined not significant, then no CV is significant. l If all CVs combined are significant, then remove first CV, recalculate V (= V 1 ) and test. l Continue until residual V j no longer significant at df = (p – j)(q – j) l Each CV is tested in a hierarchical fashion by first testing significance of all CVs combined. l If all CVs combined not significant, then no CV is significant. l If all CVs combined are significant, then remove first CV, recalculate V (= V 1 ) and test. l Continue until residual V j no longer significant at df = (p – j)(q – j)
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.17 Caveats/assumptions: tests of significance l Tests of significance assume that observations have a multivariate normal distribution l Tests of signficance can be very misleading because j th canonical correlation in the population may not appear as j th canonical correlation in the sample due to sampling errors… l So be careful, especially if the sample is small! l Tests of significance assume that observations have a multivariate normal distribution l Tests of signficance can be very misleading because j th canonical correlation in the population may not appear as j th canonical correlation in the sample due to sampling errors… l So be careful, especially if the sample is small!
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.18 Interpreting canonical variates I l Procedure: Examine standardized coefficients of canonical variates l Inference: variables with large (in absolute value) coefficients are most important. l Procedure: Examine standardized coefficients of canonical variates l Inference: variables with large (in absolute value) coefficients are most important. U 1 mainly a contrast between X 3 and X 4 on the one hand, and X 2 on the other
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.19 Interpreting canonical variates II l Procedure: Examine correlations of original variables with canonical variates (canonical factor loadings) l Inference: variables with large (in absolute value) correlations are most important to a particular canonical variate. l Procedure: Examine correlations of original variables with canonical variates (canonical factor loadings) l Inference: variables with large (in absolute value) correlations are most important to a particular canonical variate. X 4 is not associated with U 2 Canonical variate VariableU1U1 U2U2 X1X X2X X3X X4X
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.20 Judging reliability of canonical correlates l Make sure that the ratio of sample size to total number of variables (N/ (p + q)) is large (> 20 for first correlate, > 40 for first two correlates l If this condition does not hold, both canonical variates and correlations are unreliable. l Make sure that the ratio of sample size to total number of variables (N/ (p + q)) is large (> 20 for first correlate, > 40 for first two correlates l If this condition does not hold, both canonical variates and correlations are unreliable. l In such cases, try to increase N/(p + q) by reducing number of variables through: (1) do preliminary PCA and use components as variables; (2) select small subset of X and Y variables; (3) use another technique for variable set reduction (e.g., canonical ridge regression)
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.21 Canonical variate rotation l As with PCA, DF etc., interpretation of canonical variates may be easier if they are rotated… l … but remember that under rotation, the maximization property may be lost, i.e., the correlation between first canonical variates need not be the largest possible. l Remember also that here, pairs of variables are being rotated simultaneously, so that a rotation which assists in the interpretation of U 1 may make interpretation of V 1 more difficult. l As with PCA, DF etc., interpretation of canonical variates may be easier if they are rotated… l … but remember that under rotation, the maximization property may be lost, i.e., the correlation between first canonical variates need not be the largest possible. l Remember also that here, pairs of variables are being rotated simultaneously, so that a rotation which assists in the interpretation of U 1 may make interpretation of V 1 more difficult.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.22 Stewart-Love redundancy index l In MR, multiple R 2 gives the proportion of variance in Y “accounted for”(“extracted by”) by the set of independent variables {X}. l But a squared canonical correlation C i 2 is a measure of variance shared between U i and V i l In MR, multiple R 2 gives the proportion of variance in Y “accounted for”(“extracted by”) by the set of independent variables {X}. l But a squared canonical correlation C i 2 is a measure of variance shared between U i and V i l Therefore, we cannot expect canonical variates to “extract” much variance from their associated sets (e.g. U 1 may not extract much variance from Y}
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.23 Canonical variate variance “extraction” l The amount of variance in set {Y} extracted by canonical variate U i is the average sum of squared canonical variate – y variable correlations: l The redundancy in set {Y}, given canonical variates U 1 to U i, is l The total (S-L) redundancy is: l The redundancy in set {Y}, given canonical variates U 1 to U i, is l The total (S-L) redundancy is:
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.24 Redundancy indices: summary l High redundancy indices that the corresponding set of canonical variates “explains” a large proportion of the variance in set {Y} l In principle, redundancy can also be calculated for canonical variates V i, i.e., how much variance in set {X} is explained by V i
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.25 Example : Pgi frequencies in California Euphydras editha colonies in relation to environmental factors. Dependent setIndependent set
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.26 Within dependent variable set correlations Within basic set y correlations PGI_80 PGI_100 PGI_116 PGI_4060 PGI_ PGI_ PGI_ PGI_
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.27 Within independent variable set correlations ALTITUDE ANN_PRECIP MAXTEMP MINTEMP ALTITUDE ANN_PRECIP MAXTEMP MINTEMP
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.28 Between dependent and independent variable set correlations PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE ANN_PRECIP MAXTEMP MINTEMP
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.29 Partial regression coefficients, individual Ys on X Betas predicting basic y (col) from basic x (row) variables PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE ANN_PRECIP MAXTEMP MINTEMP
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.30 Partial regression coefficient standard errors, individual Ys on X Standard errors of betas PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE ANN_PRECIP ANN_MAXTEMP ANN_MINTEMP
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.31 Partial regression coefficient t- statistics, individual Ys on X T-statistics for betas PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE ANN_PRECIP ANN_MAXTEMP ANN_MINTEMP
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.32 Canonical correlations, Bartlett’s V Canonical correlations Bartlett test of residual correlations Correlations 1 through 4 Chi-square statistic = df = 16 prob= Correlations 2 through 4 Chi-square statistic = df = 9 prob= 0.902
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.33 Canonical variates (V) Canonical coefficients for dependent (y) set PGI_ PGI_ PGI_ PGI_
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.34 Canonical variates (U) Canonical coefficients for independent (x) set ALTITUDE ANN_PRECIP ANN_MAXTEMP ANN_MINTEMP
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.35 Canonical variate – y variable correlations and redundancies Canonical loadings (y variable by factor correlations) PGI_ PGI_ PGI_ PGI_ Canonical redundancies for dependent set