Presentation on theme: "Covariance and correlation"— Presentation transcript:
1 Covariance and correlation Sir Francis Galton His student, Pearson, gave his name to the pearson correlation, but it was created by Galton.The image shows various scatter plots between X and Y variables and the associated correlation values. Note that the slope (gradient) of the relationship does not influence the size of the correlation. Slopes are assessed with linear regression. The central scatter plot is a special case where X has zero variance. Therefore, there can be no covariance and correlation is 0Covariance and correlationDr David Field
2 SummaryCorrelation is covered in Chapter 6 of Andy Field, 3rd Edition, Discovering statistics using SPSSAssessing the co-variation of two variablesScatter plotsCalculating the covariance and the “Pearson product moment correlation” between two variablesStatistical significance of correlationsInterpretation of correlationsLimitations and non-parametric alternatives
3 IntroductionSometimes it is not possible for researchers to experimentally manipulate an IV using random allocation to conditions and measure a dependent variablee.g. relationship between income and self esteemBut you can still measure two or more variables and ask what relationship, if any, they haveOne way to assess this relationship is using the “Pearson product moment correlation”another example would be to look at the relationship between alcohol consumption and exam performance in studentsit would be unethical to manipulate alcohol intakeFor example, if you are interested in how income influences self esteem (confidence), you can’t easily manipulate participants income. You just have to make observations of income and self esteem as they exist “in the field” rather than the lab. What you will do is collect a sample, measure the income variation and see if differences from the mean income are met by similar changes from the mean of self esteem (or directly opposite changes)
4 Units alcohol per weekExam %136310602455370580354120501458176119For each participant we record units of alcohol consumed per week and exam %Don’t forget that this is not an experiment, and any observed dependence of the 2 variables on each other could be due to both variables being caused by a 3rd variable (e.g. stress)Before performing any statistical analysis the first step is to visualise the relationship between the two variables using a scatter plot.
5 4.8.1 Scatterplot: alcohol Exam % 13 63 10 60 24 55 3 70 5 80 35 41 20 501458176119In a scatter plot, each pair of data points results in a single point on the graph. You are plotting the value of some variable Y against the value of another variable X.In this imaginary example, as units of alcohol increases, exam mark decreases. If the relationship was perfect, then all the points would lie on a single straight line. As it is, there is some scatter around the best fitting straight line, indicating that other unmeasured factors are influencing exam performance as well as alcohol consumption. The correlation coefficient quantifies the extent to which the points on the graph are spread around the best fitting straight line compared to lying exactly on the line.Note that correlation does not tell you what that best fitting straight line was (linear regression does that)4.8.1Scatterplot:
6 Calculating the covariance of two variables 6.3.1Covariance is a measure of how much two variables change togetherpresumes that for each participant in the sample two variables have been measuredIf two variables tend to vary together (that is, when one of them is above its mean, then the other variable tends to be above its mean too), then the covariance between the two variables will be positive.If, when one of them is above its mean value the other variable tends to be below its mean value, then the covariance between the two variables will be negative.First, let’s revisit variance…Covariance is similar in conception to the variance. Calculating the correlation between two variables involves first calculating the covariance, and so understanding covariance is the key to understanding correlation.
7 Variance of one variable To calculate variancesubtract the mean from each scoreSquare the resultsAdd up the squared scoresDivide by the number of scores -1Squaring makes sure that the variance will not be negative, and it emphasizes the effect of very large and very small scores that are far from the meanIf all the scores are close to the mean the variable has restricted variance and it is unlikely that any other variable will co-vary with it
8 Covariance of two variables, X and Y For each pair of scoressubtract the mean of variable X from each score in Xsubtract the mean of variable Y from each score in YMultiply each of the pairs of difference scores togetherSum the resultsDivide by the number of scores – 1The -1 has negligible effect on the estimate of the population covariance when the sample is largeBut when the sample is small it has a noticeable effectThe -1 is included because it has been shown that small samples tend to underestimate the underlying population covariance (as is also the case for variance)The 3rd step replaces squaring in the variance formula, but otherwise the two formulas are similar
9 Multiply difference scores alcoholexam %alcohol – mean (16)exam – mean (60.1)Multiply difference scores1363-32.9-8.71060-6-0.10.624558-5.1-40.8370-139.9-128.7580-1119.9-218.9354119-19.1-362.920504-10.1-40.41458-2-2.14.2176110.98.7Sum right hand column and divide by number of participants -1 to find the “population” covariance-786 / 9 = -87.3If a participant has scores above the mean on both variables then this will make a positive contribution to the covariance. Equally, if a P falls below the mean on both variables this will result in a positive contribution to the covariance ( - * - = +). On the other hand, if the participant is above the mean on one variable, but below the mean on the other, then this produces a negative contribution to the covariance score. Considering the sample as a whole, if roughly half of the participants make a neg contribution to the covar and roughly half make a positive contribution, then these contributions will tend to cancel each other out and the covariance will be small or close to zero, indicating that the two variables are not related.A positive value of covariance indicates that as one variable increases, so does the other. A negative value indicates that as one variable increases, so the other goes down, as in our example of alcohol consumption going up and exam performance coming downNote: explain this in terms of the example – e.g. row 4 is a person who drinks less than average and scores higher than average in the exam. Row 6 is a person with above average alcohol units and below average exam score. Both of these people make a negative contribution to the covariance.Green rectangles - these are both examples of participants making positive contributions to the covarianceBlue rectangles – these are both examples of participants making negative contributions to the covarianceRed oval – 3 participants that dominate the covariance statistic because the effects of multiplying exaggerate the bigger differences, which are the same as the effects of squaring in calculation of the variance.
10 å N - 1 = ( X - X X ) ( Y - Y ) Covariance formula cov(x,y) The bar on top refers to the mean of the variableSigma (the sum of)å(X-XX)(Y-Y)=cov(x,y)N - 1X and Y both refer to each of the scores in turnAnswer: if roughly half of the participants make a neg contribution to the covar and roughly half make a positive contribution of about the same magnitude as the negative contributions, then these contributions will tend to cancel each other out and the covariance will be small or close to zero, indicating that the two variables are not related. Cov will also be about zero if one or both variables have very little variance.Under what circumstances would cov(x,y) equal approximately zero?
11 Converting covariance to correlation 6.3.2Knowing that the covariance of two variables is positive is useful as it indicates that as one increases, so does the otherBut, the actual value of covariance is dependent up the measurement units of the variablesif the exam scores had been given out of 45, instead of as percentages, then the covariance with alcohol consumption would be instead of -87.3but the real strength of the relationship is the samebecause the covariance is dependent upon the measurement units used it is hard to interpret unless we first standardize it.Conversion of covariance to a correlation coefficient is analogous to making the variance more useful by taking the square root of it to convert into the SD, which is in the same units as the original DV
12 Converting covariance to correlation Ideally we’d like to be able to ask if the covariation of alcohol consumption and exam scores is stronger or weaker than the covariation of alcohol consumption and hours studiedThe standard deviation provides the answer, because it is a universal unit of measurement into which any other scale of measurement can be convertedbecause the covariance uses the deviation scores of two variables, to standardize the covariance we need to make use of the SD of both variablesConversion of covariance to a correlation coefficient is analogous to making the variance more useful by taking the square root of it to convert into the SD, which is in the same units as the original DV
13 Pearsons r correlation coefficient cov(x,y)r=SDx * SDyThis means divide by the total variation in both variablesThe reason you multiply the SD’s of the two variables together (rather than adding them) is because the cov(x,y) formula also involved multiplying the variables together (or rather multiplying the differences from their means, see earlier)r can take a maximum value of 1. This is because the total shared variation represented by cov(x,y) can be as large as the total variation in the variables, but no bigger. If the shared variation is smaller than the total variation, magnitude of r will be less than 1What is the biggest value r could take?
14 Pearsons r correlation coefficient The result of standardisation is that r has a minimum of -1 and a maximum of 1-1 perfect negative relationship0 no relationship1 perfect positive relationship-0.5 moderate negative relationship0.5 moderate positive relationshipTo achieve a correlation of 1 (or -1) the shared variation, cov(x,y) has to be as big as the total variation in the data, represented by the two SD’s multiplied together
15 covariance of percentage exam score and alcohol units is -87.3 SD of exam scores is 10.58SD of alcohol units per week is 9.37Pearsons r = / 99.20r = -.88r-87.3=10.58 * 9.37
16 Scatter plots and correlation values This figure, taken from Wikipedia, shows a series of scatter plots with their correlation values. The vertical and horizontal axes of the graphs have been removed for visual clarity.
17 Scatter plots and correlation values The red circles highlight correlations of 1 or -1, where all the points of the X versus Y plot lie on a single straight line. 1 and -1 are both equally strong correlations. The sign indicates the direction of the correlation. Positive numbers are where the value of Y increases as Y increases. For negative correlations the value of Y decreases as X increasesThe green circles highlight two more correlations of 1, illustrating the fact that the slope of the best fitting line on the scatter plot does not influence the size of the correlation coefficient, and so the correlation coefficient gives no information about the slope of the line relating two variables. Slopes are important, but they are assessed through regression, not correlation.Finally, the black circle highlights a special case, where the relationship appears to be perfect but the correlation is 0 rather than 1. This is because the X variable has zero variation, and so it cannot covary with anything – as x varies y stays the same.
18 Scatter plots and correlation values The scatter plot with 0 correlation provides a null hypothesis and null distribution for calculating an inferential statistic.The correlation coefficient between two variables is itself a descriptive statistic, analogous to the effect size of the difference between two sample means.We can also calculate the p value of an observed correlation (data) being obtained by random sampling from the null scatter plot.Red circles indicate that as the spread of the data points around the best fitting straight line increases the size of the correlation coefficient reduces until the situation described by the black circle is reached, where there is no best fitting straight line, just a cloud of points. In this case there is no relationship at all between the two variables (put another way the relationship is random). This gives a correlation coefficient of zero.The scatter plot where the relationship between X and Y is random, and the correlation is zero has a special status, because it represents a null hypothesis distribution for the relationship between any 2 variables. As with the t test, we are able to calculate the probability that an observed set of data could occur by random sampling from the null distribution. This is the p value of the correlation, and we can use it to judge if an observed correlation is statistically significant.
19 Statistical significance of correlations 6.3.3SPSS reports a 2 tailed p value for correlationsthis is the probability of obtaining the data by random sampling from a population scatter plot with 0 correlationIf p is less than 0.05 you can reject the null hypothesis, and declare the correlation to be statistically significantif you predicted the direction of correlation, then the p value can be divided by 2 (one tailed test)The p value is very dependent on sample sizeif sample size is large then very small values of the correlation coefficient (e.g ) will easily reach significanceOnly report correlations that reach significance, but beyond this you should place more emphasis on interpretation of the direction and size of the correlation coefficient itselfExamples of reporting correlation will be given in the workshop
20 The coefficient of determination (R2) 0 correlationVenn diagrams showing proportion of variance shared between X and YWeak correlationThe correlation coefficient squared is known as the coefficient of determination or “r square”. This quantity reflects the proportion of variance shared by the two variables.The null hypothesis scatter plot, where the relationship between the two variables is nothing more than random, can be also be represented as a venn diagram of the variance of the two variables and their proportion of overlap (no overlap in the 0 correlation case).Strong (but not perfect) correlation
21 The coefficient of determination To express quantitatively what is expressed visually by the Venn diagramssquare the correlation coefficient (multiply it by itself)the result will always be a positive numberit describes the proportion of variance that the two variables have in commonit is also referred to as R2
22 0.80.60.40.2Note the rapid decline of the coefficient as the correlation reduces.r = 0.9 – 81% shared variancer = 0.5 – 25% shared variancer = 0.3 – 9% shared varianceTry to keep this non-linear relationship in mind when you read about correlation coefficients in the literature. r of 0.4 is not half as strong as r of 0.8, it is actually 25% as strong. Doubling the correlation multiplies the strength of the relationship by 4 not 2!Psychologists are usually quite pleased when they obtain a correlation of 0.3 between theoretically important variables – but this is only 9% of shared variance! This is an indication that Psychology is not terribly good at explaining the things it seeks to explain.
23 Correlation - limitations Before running a correlation between any pair of variables produce a scatter plot in SPSSIf there is a relationship between the two variables, but it appears to be non-linear, then correlation is not an appropriate statisticnon-linear relationships can be u shaped or n shaped, or like the graph on the previous slideAs an example of a non-liner relationship between 2 variables, my head of dept when I was an undergraduate, Ian Howard, claimed that exam performance increased as the number of hours studied increased, but only up to a point. After that point, further increasing your workload actually reduced your exam mark. This is an n shaped relationship.
24 Nonparametric correlations 6.5.3Spearman's rho may be used instead of Pearson's r iffrequency histograms of the individual variables are skewedA scatter plot of X and Y reveals outliers(Outliers will have a disproportionate influence on the value of Pearson's r)Individual variables are ordinal with few levelsSpearman's rho is computationally identical to Pearson's rthe difference is that the data is first converted to ranks so that any extreme scores are no longer very different from the bulk of scoresConverting the data to ranks before analysis entails a loss of information, as you are probably reducing ratio level data to ordinal data. So, use Pearsons if possible
25 Pearson's r for the example data is -0.88 Spearmen's rho isThis is very similarIn the next slide, we will consider what happens if we replace one data point, which was already the most extreme, with an outlier
26 Pearson's r for the modified data has increased in size to -0.95 But you can see that this is “driven” by the extreme caseWhat’s the value of Spearman's rho for the modified data?This example illustrates the wisdom of using non-parametric methods where there are outliers in the data.It remains unchanged at -0.82
27 An example of perfect correlation…. My age and my brothers age have a positive correlation of 1But our ages are not causally relatedRemember that correlation ~= causation!