Presentation is loading. Please wait.

Presentation is loading. Please wait.

Paul Fryers Deputy Director, EMPHO Technical Advisor, APHO Introduction to Correlation and Regression Contributors Shelley Bradley, EMPHO Mark Dancox,

Similar presentations


Presentation on theme: "Paul Fryers Deputy Director, EMPHO Technical Advisor, APHO Introduction to Correlation and Regression Contributors Shelley Bradley, EMPHO Mark Dancox,"— Presentation transcript:

1 Paul Fryers Deputy Director, EMPHO Technical Advisor, APHO Introduction to Correlation and Regression Contributors Shelley Bradley, EMPHO Mark Dancox, SWPHO

2 What is ‘correlation’? ‘A relationship between two or more things’ When the value of one variable increases, either the value of the other variable tends to increase, or the value of the other variable tends to decrease Positive correlationNegative correlationNo correlation

3 Correlation coefficient Correlation coefficients measure strength of (linear) association between continuous variables The product-moment correlation coefficient r measures linear association i.e. do the points lie on a straight line? where and are the mean values of and respectively The statistic can range from –1 (perfect negative correlation) to 1 (perfect positive correlation) This is calculated in Excel using the PEARSON function

4 Exercise 1 Open the spreadsheet: ‘Day 6 Correlation and regression exercise.xls’ 1. Produce scatter plots of life expectancy against each of the deprivation measures. 2. Describe the relationships between life expectancy and deprivation. 3.Calculate the correlation coefficient for life expectancy and the deprivation measures. Which measure of deprivation shows the strongest relationship with life expectancy? Is this the same for both men and women?

5 Interpreting correlation coefficients r = 1r = –1 r = 0r = 0.3 r = –0.5r = 0.7

6 A common mistake Correlation coefficients can be used as a test for a relationship, against a null hypothesis that there is no relationship: i.e. we assume there is no relationship until we can prove otherwise. They must not be used to demonstrate the accuracy of a proxy variable or clinical measurement. e.g. we have a sphygmomanometer that we want to check against a gold standard machine by measuring 100 patients’ blood pressures with both machines. Achieving a ‘significant’ correlation coefficient does not demonstrate that the machine is accurate: of course the two variables are related – they are two measures of the same thing..

7 Analysing the relationship between two variables

8 Fitting a regression line Look at the scatter plots of life expectancy against deprivation Is there evidence that the two variables are related to each other – i.e. as one increases, the other tends to increase or decrease consistently? Does the relationship look linear, i.e. for every increase of 1 in the x-variable, y changes by the same amount across the whole graph? If so, we can find the straight line which best fits the relationship and use that to describe the relationship Excel will add linear regression line to a scatter plot

9 Exercise 2 Go back to the spreadsheet: ‘Day 6 Correlation and regression exercise.xls’ 4. Is the relationship between life expectancy and deprivation different for males and females?

10 where a is the intercept (the nominal value of y when x is zero) and b is the gradient (the amount by which y increases for each unit increase in x Defining a line on a graph 100 25 Intercept = 25 Gradient = = 0.25 100 25 A straight line is defined as y = a + bx y = 25 + 0.25x

11 Fitting a linear regression line The ‘least squares’ method finds the straight line which fits the points most closely Specifically, it finds the line which minimises the squared distances between the points and the line – ‘the line of best fit’ Excel’s LINEST function calculates the intercept and the gradient for the line of best fit (LINEST is an ‘array function’ – you need to select two adjacent cells, type in the formula and then hold the CTRL key down while you press ENTER – first cell is gradient, second is intercept)

12 Exercise 3 Go back to the spreadsheet: ‘Day 6 Correlation and regression exercise.xls’ 5. By how much does life expectancy tend to decrease for each point increase in IMD or for each percentage point increase in employment deprivation?

13 Complicating factors – inconsistency Is a straight line appropriate? Very often not Example 1: the relationship is not consistent across the range of values We can only overcome this by splitting the analysis into separate parts, and look for reasons why the two subsets may be different

14 Example 2: it could be linear, but it looks odd – if there are reasons to think there might be subsets then treat it as example 1, or we can use an alternative correlation measure If we rank the data according to each variable, we can analyse the ranks: this is a common statistical technique which has the advantage of not needing to understand the nature of the underlying data, but the tests are less powerful For analysing correlation, the most useful statistic is Kendall’s  Rank correlation

15 Ecological fallacy Using group data to make inferences about individuals within that group can lead to false conclusions This is the ‘ecological fallacy’ Example 3 shows an apparently strong negative correlation between x and y Suppose these are data for different countries and x represents the average food consumption and y the overall cancer mortality rate We may conclude that higher food consumption protects against cancer

16 Ecological fallacy But it is people, not countries, who get cancer It could be that within countries, people who eat more are more likely to die from cancer Food consumption is an indicator of general affluence at a national level, and other risk factors for cancer are strongly associated with affluence Hence, analysing grouped data can lead to precisely the wrong conclusion

17 Transforming data Example 4: the relationship looks to have a distinct non-linear shape We can use Kendall's , but it is better to transform the data, i.e. perform a calculation on each value to ‘make them linear’ – this could apply to either or both variables In this example, the most common transformation, taking the logarithm of the y-variable results in a linear relationship

18 Complicating factors Example 5: the variability of the y- variable appears to increase as the x- variable increases, as well as the value of y Again we need to transform the data if possible to give consistent variability across the range of values

19 Summary To examine relationships between two variables, we should always start by plotting a graph of the data Never fit a linear regression model or calculate a correlation coefficient unless the data look linear and consistent Correlation coefficients give an indication of the strength of a relationship If the data are clearly related but look odd, look for non- parametric statistics such as Kendall’s 


Download ppt "Paul Fryers Deputy Director, EMPHO Technical Advisor, APHO Introduction to Correlation and Regression Contributors Shelley Bradley, EMPHO Mark Dancox,"

Similar presentations


Ads by Google