Presentation is loading. Please wait.

Presentation is loading. Please wait.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)

Similar presentations


Presentation on theme: "Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)"— Presentation transcript:

1 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR) l Rationale and use of canonical correlation analysis l Canonical correlation versus multiple regression l Estimating canonical variates and correlations l Significance tests l Rationale and use of canonical correlation analysis l Canonical correlation versus multiple regression l Estimating canonical variates and correlations l Significance tests l Interpretation of canonical variates l Rotation of canonical variates l Redundancy indices l Example: Pgi frequencies in California Euphydras editha colonies l Interpretation of canonical variates l Rotation of canonical variates l Redundancy indices l Example: Pgi frequencies in California Euphydras editha colonies

2 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.2 Canonical correlation l A method of breaking down associations between two sets of variables, a “predictor” (independent variable) set and a “dependent” variable set.

3 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.3 Example : Pgi frequencies in California Euphydras editha colonies in relation to environmental factors. Dependent setPredictor set

4 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.4 Canonical correlation (CANCOR)versus multiple regression (MR) l For MR, our interest is in predicting the effect of a particular independent variable on a particular dependent variable l For CANCOR, our interest is in determining the number and nature of independent relationships between independent and dependent variable sets l This is accomplished though the use of pairs of linear combinations of variables that are uncorrelated (canonical variates)

5 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.5 The data l Consists of a set of p independent variables X 1, X 2,…, X p (the independent variable set) and q dependent variables Y 1, Y 2,…, Y q, measured on a sample of N objects, from which we can derive a (p + q) X (p + q) correlation matrix. Within-set (X) correlation Within-set (Y) correlation Between-set (X,Y) correlation

6 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.6 What are canonical variates anyway? l Canonical variates are the eigenvectors of the corresponding correlation matrix, and thus represent orthogonal line segments that “span” the within-set variability in either X or Y. X1X1 Y1Y1 Y2Y2 X2X2 X2X2

7 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.7 Estimating canonical variates l The first canonical variate is obtained by finding coefficients of the linear functions which maximizes the correlation between U 1 and V 1 :

8 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.8 Estimating canonical variates (cont’d) l The second canonical variate is obtained by finding coefficients of the linear functions which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint:

9 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.9 which maximizes the correlation between U 3 and V 3 : subject to the constraint: which maximizes the correlation between U 3 and V 3 : subject to the constraint: Estimating canonical variates (cont’d) l The third canonical variate is obtained by finding coefficients of the linear functions

10 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.10 Calculating canonical covariates and correlations l From the within- and between-set correlation matrices, we solve the eigenvalue problem: which has r solutions l From the within- and between-set correlation matrices, we solve the eigenvalue problem: which has r solutions The eigenvalues j are the squares of the correlations between the canonical variates, i.e., the canonical correlations :

11 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.11 Calculating canonical covariates and correlations (cont’d) l The coefficients for the Y canonical variates V 1, V 2 etc., are simply the corresponding entries in the within-set (Y) correlation matrix B: l The coefficients for the ith canonical variate of for the X variables is then given by the elements of:

12 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.12 Calculating canonical covariates and correlations (cont’d) l The ith pair of canonical variates is then given by: l where X and Y are vectors of standardized (0,1) values. In this manner, we can generate canonical variate scores for each observation in the data set

13 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.13 Standardizing canonical covariates l The variance of U and V will be influenced by the scaling adopted for the eigenvectors a and b, but the canonical correlations r(U,V) will be unaffected. l To generate standardized canonical variates, calculate the standard deviation of U i (V i ) and divide a ij (b ij ) values by the corresponding standard deviation.

14 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.14 The end result l A set of r = min(p,q) canonical variates, one for the dependent variable set {V}, the other for the independent variable set {U}. l A set of r canonical correlations C = r(U,V) each representing the correlation between pairs of canonical variates. l A set of r = min(p,q) canonical variates, one for the dependent variable set {V}, the other for the independent variable set {U}. l A set of r canonical correlations C = r(U,V) each representing the correlation between pairs of canonical variates. U1U1 U2U2 V2V2 X2X2 V1V1 High first canonical correlation Low second canonical correlation

15 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.15 Significance testing l Question: which canonical correlates are statistically “significant”? For testing significance of all r = min(p, q) canonical correlates based on p + q variables, calculate Bartlett’s V and compare to  2 distribution with pq degrees of freedom. l Question: which canonical correlates are statistically “significant”? For testing significance of all r = min(p, q) canonical correlates based on p + q variables, calculate Bartlett’s V and compare to  2 distribution with pq degrees of freedom. Eigenvalue associated with ith canonical variate

16 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.16 Significance testing (cont’d) l Each CV is tested in a hierarchical fashion by first testing significance of all CVs combined. l If all CVs combined not significant, then no CV is significant. l If all CVs combined are significant, then remove first CV, recalculate V (= V 1 ) and test. l Continue until residual V j no longer significant at df = (p – j)(q – j) l Each CV is tested in a hierarchical fashion by first testing significance of all CVs combined. l If all CVs combined not significant, then no CV is significant. l If all CVs combined are significant, then remove first CV, recalculate V (= V 1 ) and test. l Continue until residual V j no longer significant at df = (p – j)(q – j)

17 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.17 Caveats/assumptions: tests of significance l Tests of significance assume that observations have a multivariate normal distribution l Tests of signficance can be very misleading because j th canonical correlation in the population may not appear as j th canonical correlation in the sample due to sampling errors… l So be careful, especially if the sample is small! l Tests of significance assume that observations have a multivariate normal distribution l Tests of signficance can be very misleading because j th canonical correlation in the population may not appear as j th canonical correlation in the sample due to sampling errors… l So be careful, especially if the sample is small!

18 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.18 Interpreting canonical variates I l Procedure: Examine standardized coefficients of canonical variates l Inference: variables with large (in absolute value) coefficients are most important. l Procedure: Examine standardized coefficients of canonical variates l Inference: variables with large (in absolute value) coefficients are most important. U 1 mainly a contrast between X 3 and X 4 on the one hand, and X 2 on the other

19 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.19 Interpreting canonical variates II l Procedure: Examine correlations of original variables with canonical variates (canonical factor loadings) l Inference: variables with large (in absolute value) correlations are most important to a particular canonical variate. l Procedure: Examine correlations of original variables with canonical variates (canonical factor loadings) l Inference: variables with large (in absolute value) correlations are most important to a particular canonical variate. X 4 is not associated with U 2 Canonical variate VariableU1U1 U2U2 X1X1 -0.920.33 X2X2 -0.77-0.52 X3X3 0.90-0.20 X4X4 0.92-0.05

20 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.20 Judging reliability of canonical correlates l Make sure that the ratio of sample size to total number of variables (N/ (p + q)) is large (> 20 for first correlate, > 40 for first two correlates l If this condition does not hold, both canonical variates and correlations are unreliable. l Make sure that the ratio of sample size to total number of variables (N/ (p + q)) is large (> 20 for first correlate, > 40 for first two correlates l If this condition does not hold, both canonical variates and correlations are unreliable. l In such cases, try to increase N/(p + q) by reducing number of variables through: (1) do preliminary PCA and use components as variables; (2) select small subset of X and Y variables; (3) use another technique for variable set reduction (e.g., canonical ridge regression)

21 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.21 Canonical variate rotation l As with PCA, DF etc., interpretation of canonical variates may be easier if they are rotated… l … but remember that under rotation, the maximization property may be lost, i.e., the correlation between first canonical variates need not be the largest possible. l Remember also that here, pairs of variables are being rotated simultaneously, so that a rotation which assists in the interpretation of U 1 may make interpretation of V 1 more difficult. l As with PCA, DF etc., interpretation of canonical variates may be easier if they are rotated… l … but remember that under rotation, the maximization property may be lost, i.e., the correlation between first canonical variates need not be the largest possible. l Remember also that here, pairs of variables are being rotated simultaneously, so that a rotation which assists in the interpretation of U 1 may make interpretation of V 1 more difficult.

22 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.22 Stewart-Love redundancy index l In MR, multiple R 2 gives the proportion of variance in Y “accounted for”(“extracted by”) by the set of independent variables {X}. l But a squared canonical correlation C i 2 is a measure of variance shared between U i and V i l In MR, multiple R 2 gives the proportion of variance in Y “accounted for”(“extracted by”) by the set of independent variables {X}. l But a squared canonical correlation C i 2 is a measure of variance shared between U i and V i l Therefore, we cannot expect canonical variates to “extract” much variance from their associated sets (e.g. U 1 may not extract much variance from Y}

23 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.23 Canonical variate variance “extraction” l The amount of variance in set {Y} extracted by canonical variate U i is the average sum of squared canonical variate – y variable correlations: l The redundancy in set {Y}, given canonical variates U 1 to U i, is l The total (S-L) redundancy is: l The redundancy in set {Y}, given canonical variates U 1 to U i, is l The total (S-L) redundancy is:

24 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.24 Redundancy indices: summary l High redundancy indices that the corresponding set of canonical variates “explains” a large proportion of the variance in set {Y} l In principle, redundancy can also be calculated for canonical variates V i, i.e., how much variance in set {X} is explained by V i

25 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.25 Example : Pgi frequencies in California Euphydras editha colonies in relation to environmental factors. Dependent setIndependent set

26 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.26 Within dependent variable set correlations Within basic set y correlations PGI_80 PGI_100 PGI_116 PGI_4060 PGI_80 1.000 PGI_100 -0.823 1.000 PGI_116 -0.127 -0.264 1.000 PGI_4060 0.638 -0.561 -0.584 1.000

27 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.27 Within independent variable set correlations ALTITUDE ANN_PRECIP MAXTEMP MINTEMP ALTITUDE 1.000 ANN_PRECIP 0.567 1.000 MAXTEMP -0.828 -0.479 1.000 MINTEMP -0.936 -0.705 0.719 1.000

28 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.28 Between dependent and independent variable set correlations PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE -0.573 0.727 -0.458 -0.201 ANN_PRECIP -0.550 0.699 -0.138 -0.468 MAXTEMP 0.536 -0.717 0.438 0.224 MINTEMP 0.593 -0.759 0.412 0.246

29 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.29 Partial regression coefficients, individual Ys on X Betas predicting basic y (col) from basic x (row) variables PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE -0.089 -0.158 -0.042 -0.090 ANN_PRECIP -0.290 0.321 0.280 -0.609 MAXTEMP 0.214 -0.424 0.264 0.105 MINTEMP 0.151 -0.376 0.381 -0.343

30 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.30 Partial regression coefficient standard errors, individual Ys on X Standard errors of betas PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE 0.937 0.672 1.059 1.066 ANN_PRECIP 0.361 0.259 0.407 0.410 ANN_MAXTEMP 0.442 0.317 0.499 0.503 ANN_MINTEMP 0.879 0.631 0.993 1.001

31 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.31 Partial regression coefficient t- statistics, individual Ys on X T-statistics for betas PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE -0.095 -0.236 -0.039 -0.084 ANN_PRECIP -0.804 1.240 0.688 -1.483 ANN_MAXTEMP 0.485 -1.339 0.529 0.209 ANN_MINTEMP 0.172 -0.596 0.383 -0.343

32 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.32 Canonical correlations, Bartlett’s V Canonical correlations 1 2 3 4 0.862 0.449 0.386 0.088 Bartlett test of residual correlations Correlations 1 through 4 Chi-square statistic = 18.399 df = 16 prob= 0.301 Correlations 2 through 4 Chi-square statistic = 4.140 df = 9 prob= 0.902

33 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.33 Canonical variates (V) Canonical coefficients for dependent (y) set 1 2 3 4 PGI_80 0.422 2.241 1.320 1.412 PGI_100 -0.089 3.808 3.790 0.511 PGI_116 0.825 2.818 2.778 -0.631 PGI_4060 0.548 1.727 3.503 -0.659

34 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.34 Canonical variates (U) Canonical coefficients for independent (x) set 1 2 3 4 ALTITUDE -0.124 2.394 -2.979 1.374 ANN_PRECIP -0.293 -0.692 -1.350 0.242 ANN_MAXTEMP 0.468 0.468 -0.580 1.702 ANN_MINTEMP 0.260 1.362 -3.551 -0.083

35 Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.35 Canonical variate – y variable correlations and redundancies Canonical loadings (y variable by factor correlations) 1 2 3 4 PGI_80 0.740 -0.151 0.081 0.651 PGI_100 -0.961 0.251 0.005 -0.116 PGI_116 0.475 0.520 -0.436 -0.560 PGI_4060 0.384 -0.627 0.595 0.324 Canonical redundancies for dependent set 1 2 3 4 0.342 0.038 0.020 0.002


Download ppt "Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)"

Similar presentations


Ads by Google