# Covariance and correlation

## Presentation on theme: "Covariance and correlation"— Presentation transcript:

Covariance and correlation
Sir Francis Galton His student, Pearson, gave his name to the pearson correlation, but it was created by Galton. The image shows various scatter plots between X and Y variables and the associated correlation values. Note that the slope (gradient) of the relationship does not influence the size of the correlation. Slopes are assessed with linear regression. The central scatter plot is a special case where X has zero variance. Therefore, there can be no covariance and correlation is 0 Covariance and correlation Dr David Field

Summary Correlation is covered in Chapter 6 of Andy Field, 3rd Edition, Discovering statistics using SPSS Assessing the co-variation of two variables Scatter plots Calculating the covariance and the “Pearson product moment correlation” between two variables Statistical significance of correlations Interpretation of correlations Limitations and non-parametric alternatives

Introduction Sometimes it is not possible for researchers to experimentally manipulate an IV using random allocation to conditions and measure a dependent variable e.g. relationship between income and self esteem But you can still measure two or more variables and ask what relationship, if any, they have One way to assess this relationship is using the “Pearson product moment correlation” another example would be to look at the relationship between alcohol consumption and exam performance in students it would be unethical to manipulate alcohol intake For example, if you are interested in how income influences self esteem (confidence), you can’t easily manipulate participants income. You just have to make observations of income and self esteem as they exist “in the field” rather than the lab. What you will do is collect a sample, measure the income variation and see if differences from the mean income are met by similar changes from the mean of self esteem (or directly opposite changes)

Units alcohol per week Exam % 13 63 10 60 24 55 3 70 5 80 35 41 20 50 14 58 17 61 19 For each participant we record units of alcohol consumed per week and exam % Don’t forget that this is not an experiment, and any observed dependence of the 2 variables on each other could be due to both variables being caused by a 3rd variable (e.g. stress) Before performing any statistical analysis the first step is to visualise the relationship between the two variables using a scatter plot.

4.8.1 Scatterplot: alcohol Exam % 13 63 10 60 24 55 3 70 5 80 35 41 20
50 14 58 17 61 19 In a scatter plot, each pair of data points results in a single point on the graph. You are plotting the value of some variable Y against the value of another variable X. In this imaginary example, as units of alcohol increases, exam mark decreases. If the relationship was perfect, then all the points would lie on a single straight line. As it is, there is some scatter around the best fitting straight line, indicating that other unmeasured factors are influencing exam performance as well as alcohol consumption. The correlation coefficient quantifies the extent to which the points on the graph are spread around the best fitting straight line compared to lying exactly on the line. Note that correlation does not tell you what that best fitting straight line was (linear regression does that) 4.8.1 Scatterplot:

Calculating the covariance of two variables
6.3.1 Covariance is a measure of how much two variables change together presumes that for each participant in the sample two variables have been measured If two variables tend to vary together (that is, when one of them is above its mean, then the other variable tends to be above its mean too), then the covariance between the two variables will be positive. If, when one of them is above its mean value the other variable tends to be below its mean value, then the covariance between the two variables will be negative. First, let’s revisit variance… Covariance is similar in conception to the variance. Calculating the correlation between two variables involves first calculating the covariance, and so understanding covariance is the key to understanding correlation.

Variance of one variable
To calculate variance subtract the mean from each score Square the results Add up the squared scores Divide by the number of scores -1 Squaring makes sure that the variance will not be negative, and it emphasizes the effect of very large and very small scores that are far from the mean If all the scores are close to the mean the variable has restricted variance and it is unlikely that any other variable will co-vary with it

Covariance of two variables, X and Y
For each pair of scores subtract the mean of variable X from each score in X subtract the mean of variable Y from each score in Y Multiply each of the pairs of difference scores together Sum the results Divide by the number of scores – 1 The -1 has negligible effect on the estimate of the population covariance when the sample is large But when the sample is small it has a noticeable effect The -1 is included because it has been shown that small samples tend to underestimate the underlying population covariance (as is also the case for variance) The 3rd step replaces squaring in the variance formula, but otherwise the two formulas are similar

Multiply difference scores
alcohol exam % alcohol – mean (16) exam – mean (60.1) Multiply difference scores 13 63 -3 2.9 -8.7 10 60 -6 -0.1 0.6 24 55 8 -5.1 -40.8 3 70 -13 9.9 -128.7 5 80 -11 19.9 -218.9 35 41 19 -19.1 -362.9 20 50 4 -10.1 -40.4 14 58 -2 -2.1 4.2 17 61 1 0.9 8.7 Sum right hand column and divide by number of participants -1 to find the “population” covariance -786 / 9 = -87.3 If a participant has scores above the mean on both variables then this will make a positive contribution to the covariance. Equally, if a P falls below the mean on both variables this will result in a positive contribution to the covariance ( - * - = +). On the other hand, if the participant is above the mean on one variable, but below the mean on the other, then this produces a negative contribution to the covariance score. Considering the sample as a whole, if roughly half of the participants make a neg contribution to the covar and roughly half make a positive contribution, then these contributions will tend to cancel each other out and the covariance will be small or close to zero, indicating that the two variables are not related. A positive value of covariance indicates that as one variable increases, so does the other. A negative value indicates that as one variable increases, so the other goes down, as in our example of alcohol consumption going up and exam performance coming down Note: explain this in terms of the example – e.g. row 4 is a person who drinks less than average and scores higher than average in the exam. Row 6 is a person with above average alcohol units and below average exam score. Both of these people make a negative contribution to the covariance. Green rectangles - these are both examples of participants making positive contributions to the covariance Blue rectangles – these are both examples of participants making negative contributions to the covariance Red oval – 3 participants that dominate the covariance statistic because the effects of multiplying exaggerate the bigger differences, which are the same as the effects of squaring in calculation of the variance.

å N - 1 = ( X - X X ) ( Y - Y ) Covariance formula cov(x,y)
The bar on top refers to the mean of the variable Sigma (the sum of) å ( X - X X ) ( Y - Y ) = cov(x,y) N - 1 X and Y both refer to each of the scores in turn Answer: if roughly half of the participants make a neg contribution to the covar and roughly half make a positive contribution of about the same magnitude as the negative contributions, then these contributions will tend to cancel each other out and the covariance will be small or close to zero, indicating that the two variables are not related. Cov will also be about zero if one or both variables have very little variance. Under what circumstances would cov(x,y) equal approximately zero?

Converting covariance to correlation
6.3.2 Knowing that the covariance of two variables is positive is useful as it indicates that as one increases, so does the other But, the actual value of covariance is dependent up the measurement units of the variables if the exam scores had been given out of 45, instead of as percentages, then the covariance with alcohol consumption would be instead of -87.3 but the real strength of the relationship is the same because the covariance is dependent upon the measurement units used it is hard to interpret unless we first standardize it. Conversion of covariance to a correlation coefficient is analogous to making the variance more useful by taking the square root of it to convert into the SD, which is in the same units as the original DV

Converting covariance to correlation
Ideally we’d like to be able to ask if the covariation of alcohol consumption and exam scores is stronger or weaker than the covariation of alcohol consumption and hours studied The standard deviation provides the answer, because it is a universal unit of measurement into which any other scale of measurement can be converted because the covariance uses the deviation scores of two variables, to standardize the covariance we need to make use of the SD of both variables Conversion of covariance to a correlation coefficient is analogous to making the variance more useful by taking the square root of it to convert into the SD, which is in the same units as the original DV

Pearsons r correlation coefficient
cov(x,y) r = SDx * SDy This means divide by the total variation in both variables The reason you multiply the SD’s of the two variables together (rather than adding them) is because the cov(x,y) formula also involved multiplying the variables together (or rather multiplying the differences from their means, see earlier) r can take a maximum value of 1. This is because the total shared variation represented by cov(x,y) can be as large as the total variation in the variables, but no bigger. If the shared variation is smaller than the total variation, magnitude of r will be less than 1 What is the biggest value r could take?

Pearsons r correlation coefficient
The result of standardisation is that r has a minimum of -1 and a maximum of 1 -1 perfect negative relationship 0 no relationship 1 perfect positive relationship -0.5 moderate negative relationship 0.5 moderate positive relationship To achieve a correlation of 1 (or -1) the shared variation, cov(x,y) has to be as big as the total variation in the data, represented by the two SD’s multiplied together

covariance of percentage exam score and alcohol units is -87.3
SD of exam scores is 10.58 SD of alcohol units per week is 9.37 Pearsons r = / 99.20 r = -.88 r -87.3 = 10.58 * 9.37

Scatter plots and correlation values
This figure, taken from Wikipedia, shows a series of scatter plots with their correlation values. The vertical and horizontal axes of the graphs have been removed for visual clarity.

Scatter plots and correlation values
The red circles highlight correlations of 1 or -1, where all the points of the X versus Y plot lie on a single straight line. 1 and -1 are both equally strong correlations. The sign indicates the direction of the correlation. Positive numbers are where the value of Y increases as Y increases. For negative correlations the value of Y decreases as X increases The green circles highlight two more correlations of 1, illustrating the fact that the slope of the best fitting line on the scatter plot does not influence the size of the correlation coefficient, and so the correlation coefficient gives no information about the slope of the line relating two variables. Slopes are important, but they are assessed through regression, not correlation. Finally, the black circle highlights a special case, where the relationship appears to be perfect but the correlation is 0 rather than 1. This is because the X variable has zero variation, and so it cannot covary with anything – as x varies y stays the same.

Scatter plots and correlation values
The scatter plot with 0 correlation provides a null hypothesis and null distribution for calculating an inferential statistic. The correlation coefficient between two variables is itself a descriptive statistic, analogous to the effect size of the difference between two sample means. We can also calculate the p value of an observed correlation (data) being obtained by random sampling from the null scatter plot. Red circles indicate that as the spread of the data points around the best fitting straight line increases the size of the correlation coefficient reduces until the situation described by the black circle is reached, where there is no best fitting straight line, just a cloud of points. In this case there is no relationship at all between the two variables (put another way the relationship is random). This gives a correlation coefficient of zero. The scatter plot where the relationship between X and Y is random, and the correlation is zero has a special status, because it represents a null hypothesis distribution for the relationship between any 2 variables. As with the t test, we are able to calculate the probability that an observed set of data could occur by random sampling from the null distribution. This is the p value of the correlation, and we can use it to judge if an observed correlation is statistically significant.

Statistical significance of correlations
6.3.3 SPSS reports a 2 tailed p value for correlations this is the probability of obtaining the data by random sampling from a population scatter plot with 0 correlation If p is less than 0.05 you can reject the null hypothesis, and declare the correlation to be statistically significant if you predicted the direction of correlation, then the p value can be divided by 2 (one tailed test) The p value is very dependent on sample size if sample size is large then very small values of the correlation coefficient (e.g ) will easily reach significance Only report correlations that reach significance, but beyond this you should place more emphasis on interpretation of the direction and size of the correlation coefficient itself Examples of reporting correlation will be given in the workshop

The coefficient of determination (R2)
0 correlation Venn diagrams showing proportion of variance shared between X and Y Weak correlation The correlation coefficient squared is known as the coefficient of determination or “r square”. This quantity reflects the proportion of variance shared by the two variables. The null hypothesis scatter plot, where the relationship between the two variables is nothing more than random, can be also be represented as a venn diagram of the variance of the two variables and their proportion of overlap (no overlap in the 0 correlation case). Strong (but not perfect) correlation

The coefficient of determination
To express quantitatively what is expressed visually by the Venn diagrams square the correlation coefficient (multiply it by itself) the result will always be a positive number it describes the proportion of variance that the two variables have in common it is also referred to as R2

0.8 0.6 0.4 0.2 Note the rapid decline of the coefficient as the correlation reduces. r = 0.9 – 81% shared variance r = 0.5 – 25% shared variance r = 0.3 – 9% shared variance Try to keep this non-linear relationship in mind when you read about correlation coefficients in the literature. r of 0.4 is not half as strong as r of 0.8, it is actually 25% as strong. Doubling the correlation multiplies the strength of the relationship by 4 not 2! Psychologists are usually quite pleased when they obtain a correlation of 0.3 between theoretically important variables – but this is only 9% of shared variance! This is an indication that Psychology is not terribly good at explaining the things it seeks to explain.

Correlation - limitations
Before running a correlation between any pair of variables produce a scatter plot in SPSS If there is a relationship between the two variables, but it appears to be non-linear, then correlation is not an appropriate statistic non-linear relationships can be u shaped or n shaped, or like the graph on the previous slide As an example of a non-liner relationship between 2 variables, my head of dept when I was an undergraduate, Ian Howard, claimed that exam performance increased as the number of hours studied increased, but only up to a point. After that point, further increasing your workload actually reduced your exam mark. This is an n shaped relationship.

Nonparametric correlations
6.5.3 Spearman's rho may be used instead of Pearson's r if frequency histograms of the individual variables are skewed A scatter plot of X and Y reveals outliers (Outliers will have a disproportionate influence on the value of Pearson's r) Individual variables are ordinal with few levels Spearman's rho is computationally identical to Pearson's r the difference is that the data is first converted to ranks so that any extreme scores are no longer very different from the bulk of scores Converting the data to ranks before analysis entails a loss of information, as you are probably reducing ratio level data to ordinal data. So, use Pearsons if possible

Pearson's r for the example data is -0.88
Spearmen's rho is This is very similar In the next slide, we will consider what happens if we replace one data point, which was already the most extreme, with an outlier

Pearson's r for the modified data has increased in size to -0.95
But you can see that this is “driven” by the extreme case What’s the value of Spearman's rho for the modified data? This example illustrates the wisdom of using non-parametric methods where there are outliers in the data. It remains unchanged at -0.82

An example of perfect correlation….
My age and my brothers age have a positive correlation of 1 But our ages are not causally related Remember that correlation ~= causation!