Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stats Club Marnie Brennan

Similar presentations


Presentation on theme: "Stats Club Marnie Brennan"— Presentation transcript:

1 Stats Club Marnie Brennan
Linear regression Stats Club Marnie Brennan

2 Have you used any type of regression before?

3 References Petrie and Sabin - Medical Statistics at a Glance: Chapter 27 & 28 Good Petrie and Watson - Statistics for Veterinary and Animal Science: Pages Good Dohoo, Martin and Stryhn – Veterinary Epidemiologic Research: Chapter 14

4 What is regression? Use it to look at the relationship between variables To see if one variable can be explained by another/several, or to see if one variable predicts the other Linear regression – to look at the relationship between continuous (numerical) variables This describes the linear relationship between two variables You can determine the ‘line of best fit’ to describe the relationship Our focus for today!

5 When would you use regression?
To determine if certain outcomes are associated with certain explanatory variables Risk factors for disease To determine if certain variables can ‘predict’ an outcome variable Other reasons??

6 Single linear regression versus multiple linear regression
Simple/univariable linear regression A single factor affecting a single outcome Multiple/multivariable linear regression Several factors affecting a single outcome Can have binary, nominal and ordinal explanatory variables With some manipulations e.g. creating dummy variables We won’t focus on this today!

7 First step – descriptive stats
Scatter plot What does the relationship look like between the two variables? Calculate, using an equation, the regression line – the line that best suits the data E.g. the minimisation between our actual values, and some predicted values that are created from the regression line

8 Scatter plot Taken from Petrie and Watson

9 Regression line Taken from Petrie and Watson

10 Regression equation (this is it for equations, I promise!)
Y = a + bx Y = outcome variable a = point at which the regression line crosses the Y axis b = slope of the regression line How much the average value of Y varies with each unit change in x x = explanatory variable used to predict y

11 Creation of the regression line
Minimisation of residuals Residual = The difference between an observed and predicted value Residual = Observed value – fitted value Aiming to minimise the sum of the squared deviations (residuals) between the line, and your points Terminology = Method of least squares OR Ordinary least squares You can put confidence intervals around the regression line (indicates the appropriateness of fit)

12 Regression line Residuals Taken from Petrie and Watson

13 What are the assumptions that have to be satisfied?
There is a linear relationship between x and y X is measured without error There is not multiple data from the same subjects (independence) For each value of x, there is a distribution of values of y in the population, and this distribution is Normal The variability of the distribution of the y values is the same for all values of x e.g. the variance is constant

14 Determining if we meet our assumptions
Residuals! No relationship apparent in the diagram = linear relationship assumption met Outcome variable Distribution of residuals is approximately Normal = Normality met Taken from Petrie and Watson

15 Determining if we meet our assumptions
Residuals! No tendency for the residuals to increase or decrease systematically with fitted values = constant variance assumption met Taken from Petrie and Watson

16 What if our assumptions cannot be met?
There is a linear relationship between x and y Can try and transform the data e.g. logarithmic If outliers are a problem: Rule of thumb – if the outliers are outside +/-2 standard deviations away from the group mean, you can remove them, OR If the point has a greater leverage than 4/n (n=number of pairs of observations), it should be investigated further Run the analysis, remove the outliers and re-run the analysis See if there is any difference Other ways e.g. Cook’s distance

17 Outliers Taken from Shelton et al. (2012) Journal of Agricultural Science Vol 150

18 What if our assumptions cannot be met?
X is measured without error There is not multiple data from the same subjects (independence) The variability of the distribution of the y values is the same for all values of x e.g. the variance is constant For each value of x, there is a distribution of values of y in the population, and this distribution is Normal For the previous two points: If there are moderate departures from these assumptions, we can potentially proceed with the analysis, but The estimates of the standard errors and the p-values may be affected This is quite vague and not necessarily useful!

19 How good is the model? ANOVA table and residual variance (F-ratio)
Investigating the slope Testing the null hypothesis that the true slope is zero Examine the F-ratio in the ANOVA table – is the ratio of how good the model is to how bad it is Calculate the test statistic (P-value is given) – if small, means that unlikely to get this F value (large) if the null hypothesis was true A good model will have a large F-ratio (at least greater than 1)

20 Assessing how well the predictor explains the outcome
R2 value (square of the correlation coefficient) = Coefficient of determination The percentage of the variability of y that can be explained by its relationship with x E.g. R2 = 0.33; 33% of the variation is explained by the variables within the analysis 100-R2 is equal to the unexplained variance Adjusted R2 value Accounts for the number of variables included in the model Important for multiple linear regression

21 Other analyses Additional things to consider: Model ‘validation’
Do they show that they explain what they seek to explain Different ways of doing this e.g. using a proportion of the data and using the other proportion for testing

22 Minitab Or via Regression, Regression

23 Minitab

24 SPSS

25

26 SPSS R2 = The individual contribution of the variables in the model
F statistic = Ratio of how good the model is to how bad it is (and significance level) Coefficients = How much the model can be explained by this particular variable

27 How does this fit with what you do or have seen/experienced?

28 Summary Used to look at whether there is an explanatory relationship between linear variables Eyeball first with a scatter plot, followed by a regression line (residuals) The residuals can be used to check the assumptions that must be met If it doesn’t fit the assumptions, there are adjustments that can be made How well the explanatory variables explain the outcome variable can be measured via the coefficient of determination

29 Next time… REML and mixed effects models


Download ppt "Stats Club Marnie Brennan"

Similar presentations


Ads by Google