Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)

Slides:



Advertisements
Similar presentations
Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
Advertisements

1 Regression as Moment Structure. 2 Regression Equation Y =  X + v Observable Variables Y z = X Moment matrix  YY  YX  =  YX  XX Moment structure.
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Canonical Correlation
Structural Equation Modeling
Lecture 3: A brief background to multivariate statistics
Chapter Nineteen Factor Analysis.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Ch11 Curve Fitting Dr. Deshi Ye
Generalized Linear Models (GLM)
Lecture 7: Principal component analysis (PCA)
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
A quick introduction to the analysis of questionnaire data John Richardson.
Canonical correlations
Chapter 11 Multiple Regression.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Techniques for studying correlation and covariance structure
Correlation. The sample covariance matrix: where.
Relationships Among Variables
Variance and covariance Sums of squares General linear models.
Correlation and Regression
Correlation and Linear Regression
Example of Simple and Multiple Regression
Objectives of Multiple Regression
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Correlation.
The Multiple Correlation Coefficient. has (p +1)-variate Normal distribution with mean vector and Covariance matrix We are interested if the variable.
Some matrix stuff.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Canonical Correlation Analysis and Related Techniques Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia.
Multiple Linear Regression. Purpose To analyze the relationship between a single dependent variable and several independent variables.
Examining Relationships in Quantitative Research
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
1 Inferences About The Pearson Correlation Coefficient.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Lecture 12 Factor Analysis.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Chapter 8 Relationships Among Variables. Outline What correlational research investigates Understanding the nature of correlation What the coefficient.
Principal Component Analysis
1 Canonical Correlation Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.1 Lecture 9: Discriminant function analysis (DFA) l Rationale.
Copyright © 2008 by Nelson, a division of Thomson Canada Limited Chapter 18 Part 5 Analysis and Interpretation of Data DIFFERENCES BETWEEN GROUPS AND RELATIONSHIPS.
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Canonical Correlation Analysis (CCA). CCA This is it! The mother of all linear statistical analysis When ? We want to find a structural relation between.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L12.1 Lecture 12: Generalized Linear Models (GLM) What are they? When do.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.
Stats Methods at IC Lecture 3: Regression.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
The simple linear regression model and parameter estimation
Mini-Revision Since week 5 we have learned about hypothesis testing:
12 Inferential Analysis.
Correlation and Regression
Techniques for studying correlation and covariance structure
12 Inferential Analysis.
Simple Linear Regression
Lecture 8: Factor analysis (FA)
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR) l Rationale and use of canonical correlation analysis l Canonical correlation versus multiple regression l Estimating canonical variates and correlations l Significance tests l Rationale and use of canonical correlation analysis l Canonical correlation versus multiple regression l Estimating canonical variates and correlations l Significance tests l Interpretation of canonical variates l Rotation of canonical variates l Redundancy indices l Example: Pgi frequencies in California Euphydras editha colonies l Interpretation of canonical variates l Rotation of canonical variates l Redundancy indices l Example: Pgi frequencies in California Euphydras editha colonies

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.2 Canonical correlation l A method of breaking down associations between two sets of variables, a “predictor” (independent variable) set and a “dependent” variable set.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.3 Example : Pgi frequencies in California Euphydras editha colonies in relation to environmental factors. Dependent setPredictor set

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.4 Canonical correlation (CANCOR)versus multiple regression (MR) l For MR, our interest is in predicting the effect of a particular independent variable on a particular dependent variable l For CANCOR, our interest is in determining the number and nature of independent relationships between independent and dependent variable sets l This is accomplished though the use of pairs of linear combinations of variables that are uncorrelated (canonical variates)

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.5 The data l Consists of a set of p independent variables X 1, X 2,…, X p (the independent variable set) and q dependent variables Y 1, Y 2,…, Y q, measured on a sample of N objects, from which we can derive a (p + q) X (p + q) correlation matrix. Within-set (X) correlation Within-set (Y) correlation Between-set (X,Y) correlation

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.6 What are canonical variates anyway? l Canonical variates are the eigenvectors of the corresponding correlation matrix, and thus represent orthogonal line segments that “span” the within-set variability in either X or Y. X1X1 Y1Y1 Y2Y2 X2X2 X2X2

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.7 Estimating canonical variates l The first canonical variate is obtained by finding coefficients of the linear functions which maximizes the correlation between U 1 and V 1 :

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.8 Estimating canonical variates (cont’d) l The second canonical variate is obtained by finding coefficients of the linear functions which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint: which maximizes the correlation between U 2 and V 2 : subject to the constraint:

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.9 which maximizes the correlation between U 3 and V 3 : subject to the constraint: which maximizes the correlation between U 3 and V 3 : subject to the constraint: Estimating canonical variates (cont’d) l The third canonical variate is obtained by finding coefficients of the linear functions

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.10 Calculating canonical covariates and correlations l From the within- and between-set correlation matrices, we solve the eigenvalue problem: which has r solutions l From the within- and between-set correlation matrices, we solve the eigenvalue problem: which has r solutions The eigenvalues j are the squares of the correlations between the canonical variates, i.e., the canonical correlations :

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.11 Calculating canonical covariates and correlations (cont’d) l The coefficients for the Y canonical variates V 1, V 2 etc., are simply the corresponding entries in the within-set (Y) correlation matrix B: l The coefficients for the ith canonical variate of for the X variables is then given by the elements of:

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.12 Calculating canonical covariates and correlations (cont’d) l The ith pair of canonical variates is then given by: l where X and Y are vectors of standardized (0,1) values. In this manner, we can generate canonical variate scores for each observation in the data set

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.13 Standardizing canonical covariates l The variance of U and V will be influenced by the scaling adopted for the eigenvectors a and b, but the canonical correlations r(U,V) will be unaffected. l To generate standardized canonical variates, calculate the standard deviation of U i (V i ) and divide a ij (b ij ) values by the corresponding standard deviation.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.14 The end result l A set of r = min(p,q) canonical variates, one for the dependent variable set {V}, the other for the independent variable set {U}. l A set of r canonical correlations C = r(U,V) each representing the correlation between pairs of canonical variates. l A set of r = min(p,q) canonical variates, one for the dependent variable set {V}, the other for the independent variable set {U}. l A set of r canonical correlations C = r(U,V) each representing the correlation between pairs of canonical variates. U1U1 U2U2 V2V2 X2X2 V1V1 High first canonical correlation Low second canonical correlation

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.15 Significance testing l Question: which canonical correlates are statistically “significant”? For testing significance of all r = min(p, q) canonical correlates based on p + q variables, calculate Bartlett’s V and compare to  2 distribution with pq degrees of freedom. l Question: which canonical correlates are statistically “significant”? For testing significance of all r = min(p, q) canonical correlates based on p + q variables, calculate Bartlett’s V and compare to  2 distribution with pq degrees of freedom. Eigenvalue associated with ith canonical variate

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.16 Significance testing (cont’d) l Each CV is tested in a hierarchical fashion by first testing significance of all CVs combined. l If all CVs combined not significant, then no CV is significant. l If all CVs combined are significant, then remove first CV, recalculate V (= V 1 ) and test. l Continue until residual V j no longer significant at df = (p – j)(q – j) l Each CV is tested in a hierarchical fashion by first testing significance of all CVs combined. l If all CVs combined not significant, then no CV is significant. l If all CVs combined are significant, then remove first CV, recalculate V (= V 1 ) and test. l Continue until residual V j no longer significant at df = (p – j)(q – j)

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.17 Caveats/assumptions: tests of significance l Tests of significance assume that observations have a multivariate normal distribution l Tests of signficance can be very misleading because j th canonical correlation in the population may not appear as j th canonical correlation in the sample due to sampling errors… l So be careful, especially if the sample is small! l Tests of significance assume that observations have a multivariate normal distribution l Tests of signficance can be very misleading because j th canonical correlation in the population may not appear as j th canonical correlation in the sample due to sampling errors… l So be careful, especially if the sample is small!

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.18 Interpreting canonical variates I l Procedure: Examine standardized coefficients of canonical variates l Inference: variables with large (in absolute value) coefficients are most important. l Procedure: Examine standardized coefficients of canonical variates l Inference: variables with large (in absolute value) coefficients are most important. U 1 mainly a contrast between X 3 and X 4 on the one hand, and X 2 on the other

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.19 Interpreting canonical variates II l Procedure: Examine correlations of original variables with canonical variates (canonical factor loadings) l Inference: variables with large (in absolute value) correlations are most important to a particular canonical variate. l Procedure: Examine correlations of original variables with canonical variates (canonical factor loadings) l Inference: variables with large (in absolute value) correlations are most important to a particular canonical variate. X 4 is not associated with U 2 Canonical variate VariableU1U1 U2U2 X1X X2X X3X X4X

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.20 Judging reliability of canonical correlates l Make sure that the ratio of sample size to total number of variables (N/ (p + q)) is large (> 20 for first correlate, > 40 for first two correlates l If this condition does not hold, both canonical variates and correlations are unreliable. l Make sure that the ratio of sample size to total number of variables (N/ (p + q)) is large (> 20 for first correlate, > 40 for first two correlates l If this condition does not hold, both canonical variates and correlations are unreliable. l In such cases, try to increase N/(p + q) by reducing number of variables through: (1) do preliminary PCA and use components as variables; (2) select small subset of X and Y variables; (3) use another technique for variable set reduction (e.g., canonical ridge regression)

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.21 Canonical variate rotation l As with PCA, DF etc., interpretation of canonical variates may be easier if they are rotated… l … but remember that under rotation, the maximization property may be lost, i.e., the correlation between first canonical variates need not be the largest possible. l Remember also that here, pairs of variables are being rotated simultaneously, so that a rotation which assists in the interpretation of U 1 may make interpretation of V 1 more difficult. l As with PCA, DF etc., interpretation of canonical variates may be easier if they are rotated… l … but remember that under rotation, the maximization property may be lost, i.e., the correlation between first canonical variates need not be the largest possible. l Remember also that here, pairs of variables are being rotated simultaneously, so that a rotation which assists in the interpretation of U 1 may make interpretation of V 1 more difficult.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.22 Stewart-Love redundancy index l In MR, multiple R 2 gives the proportion of variance in Y “accounted for”(“extracted by”) by the set of independent variables {X}. l But a squared canonical correlation C i 2 is a measure of variance shared between U i and V i l In MR, multiple R 2 gives the proportion of variance in Y “accounted for”(“extracted by”) by the set of independent variables {X}. l But a squared canonical correlation C i 2 is a measure of variance shared between U i and V i l Therefore, we cannot expect canonical variates to “extract” much variance from their associated sets (e.g. U 1 may not extract much variance from Y}

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.23 Canonical variate variance “extraction” l The amount of variance in set {Y} extracted by canonical variate U i is the average sum of squared canonical variate – y variable correlations: l The redundancy in set {Y}, given canonical variates U 1 to U i, is l The total (S-L) redundancy is: l The redundancy in set {Y}, given canonical variates U 1 to U i, is l The total (S-L) redundancy is:

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.24 Redundancy indices: summary l High redundancy indices that the corresponding set of canonical variates “explains” a large proportion of the variance in set {Y} l In principle, redundancy can also be calculated for canonical variates V i, i.e., how much variance in set {X} is explained by V i

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.25 Example : Pgi frequencies in California Euphydras editha colonies in relation to environmental factors. Dependent setIndependent set

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.26 Within dependent variable set correlations Within basic set y correlations PGI_80 PGI_100 PGI_116 PGI_4060 PGI_ PGI_ PGI_ PGI_

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.27 Within independent variable set correlations ALTITUDE ANN_PRECIP MAXTEMP MINTEMP ALTITUDE ANN_PRECIP MAXTEMP MINTEMP

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.28 Between dependent and independent variable set correlations PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE ANN_PRECIP MAXTEMP MINTEMP

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.29 Partial regression coefficients, individual Ys on X Betas predicting basic y (col) from basic x (row) variables PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE ANN_PRECIP MAXTEMP MINTEMP

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.30 Partial regression coefficient standard errors, individual Ys on X Standard errors of betas PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE ANN_PRECIP ANN_MAXTEMP ANN_MINTEMP

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.31 Partial regression coefficient t- statistics, individual Ys on X T-statistics for betas PGI_80 PGI_100 PGI_116 PGI_4060 ALTITUDE ANN_PRECIP ANN_MAXTEMP ANN_MINTEMP

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.32 Canonical correlations, Bartlett’s V Canonical correlations Bartlett test of residual correlations Correlations 1 through 4 Chi-square statistic = df = 16 prob= Correlations 2 through 4 Chi-square statistic = df = 9 prob= 0.902

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.33 Canonical variates (V) Canonical coefficients for dependent (y) set PGI_ PGI_ PGI_ PGI_

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.34 Canonical variates (U) Canonical coefficients for independent (x) set ALTITUDE ANN_PRECIP ANN_MAXTEMP ANN_MINTEMP

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.35 Canonical variate – y variable correlations and redundancies Canonical loadings (y variable by factor correlations) PGI_ PGI_ PGI_ PGI_ Canonical redundancies for dependent set