Download presentation
Presentation is loading. Please wait.
1
Stats Methods at IC Lecture 3: Regression
2
Linear regression
9
Worked Example Measurements of percentage body fat and thigh circumference (in cm) were taken from twenty healthy females aged 20 to 34 The body fat measurements were obtained by a procedure requiring the immersion of the person in water. It would therefore be very helpful if a regression model could provide reliable predictions of the amount of body fat from other measurements which are easier to obtain.
10
The scatter plot indicates that there is a positive relationship between the two variables, i.e. increased thigh circumference is associated with increased body fat. Analysis shows: Number of observations (N) = 20 Intercept (c) = -23.6 Slope (m) = The equation for this example will therefore be: body fat = 0.86 × thigh circumference-23.6 This means for every unit increase in thigh circumference we can predict an increase in body fat of 0.86 %. We now have an equation with which we can predict body fat from thigh circumference. For example, the predicted %body fat for a thigh circumference of 50 would be 0.86 ×50 – 23.6= 19.4%.
11
Limitations There are important limitations which must be considered when drawing conclusions from a regression model such as this. We can only use this equation to accurately predict values of body fat within the range , i.e. the lowest and highest values of thigh circumference in the range that we are looking at; outside of this range we have no knowledge of what might be happening. We have produced this equation from a small sample of the total population. Had we picked a different set of 20 women then we would almost certainly get different values for the slope and intercept. We can therefore construct a confidence interval for the slope to see whether it includes zero, which would indicate no relationship between the explanatory and response variables. Again, a larger sample will lead to a smaller confidence interval and thus we will have more confidence in our estimates of the coefficients.
12
Non-linear regression
It is important to realise that not all relationships will follow such a straight-forward pattern. In the example given, within the range of the data given, the relationship appears to be well represented by a linear relationship however this cannot be the case at very high values (for which the response has an upper limit of 100%) or very low values (where the limit is 0%). For predictions over the entire range of possible values of the explanatory variable we would have to consider possible non-linear relationships. Techniques are available to overcome these problems and it is possible to fit highly complex curves to data.
13
T-tests for regression coefficients
To assess whether association between two variables in a regression model is significant. When a regression model has been fit to assess the relationship between a response variable and one or more explanatory variables. It tells you whether a coefficient associated with a particular explanatory variable is significantly different from zero and thus a significant relationship exists. The test relies on the assumption that the sampling distribution of the estimates of the regression coefficients are normally distributed.
14
Example The output from a regression model is often reported in terms of the coefficents (in this case the intercept and the slope associated with thigh circumference) together with their standard errors and possibly t-statistics and associated p-values. For this example: We are interested in the coefficient (slope) for thigh and not the intercept. The t-statistic is calculated as t = (m from the sample – m from the null hypothesis)/SE = 7.79 The degrees of freedom for a test of a regression coefficient is (N-p) where p is the number of parameters that have been fit in the model, in this case two (the intercept and slope). Therefore df = (20-2)=18, giving a one sided p-value of < This means that there is a significant positive association (positive as the sign of the slope is positive) between thigh circumference and body fat in this sample of women.
15
Multiple regression Multiple regression is a natural extension of the simple linear regression model which is used to predict values of an outcome from several explanatory variables. Each explanatory variable has its own coefficient and the outcome variable is predicted from a combination of all the variables multiplied by their respective coefficients.
16
Example In addition to body fat and thigh circumference there were also measurements of triceps skinfold measurement and midarm circumference for the sample of twenty women. A multiple regression model using all three possible explanatory variables will give estimates and standard errors for each variable (plus the intercept)
17
In this example, the coefficient for a particular variable is estimated allowing for the fact that the other variables are also in the model. We can see that of the two variables, thigh has a much stronger relationship with body fat than midarm, which is non-significant and thus would not be considered to add any additional predictive power.
18
Assessment of the regression model
The output from a regression analysis will often be expressed in terms of an analysis of variance (ANOVA). The idea is to express the total variability in the response variable into components that can be explained by the regression line and that left as unexplained, or random, variation. We desire as much variability as possible to be explained by the regression line and not left as unexplained. The ratio of the different components of variation can then be assessed by an F-test.
19
Model choice Other summary statistics that are commonly presented for a regression analysis are R2 = The proportion of the total variation that can be explained by the regression line R2 (adjusted) adjusts for the number of explanatory terms in a model. Unlike R2, the adjusted R2 increases only if the new term improves the model more than would be expected by chance. We are often interested in whether a particular explanatory variable should be included in the regression model, i.e. does it have a significant contribution to the prediction of the response given the other variables in the model? We can assess the contribution of a variable by omitting it from the regression model and seeing whether the explanatory power has decreased, either by comparing the values of R2 or more formally by performing an ANOVA comparing the residual sum of squares from the two models.
20
Dealing with a number of explanatory variables brings into question the important issue of which ones should be included in the model. If adding a particular variable causes a significant improvement in the fit of the model, i.e. a significant reduction in the error, then it should be included in the model. There are several approaches to this, which include Including all explanatory variables, known as the ‘full model’ Experimenter decides which variables are included in the model (and in which order they are entered) Automatic methods, such as ‘stepwise selection’ where variables are automatically selected according to pre-defined criteria. Care is needed because the values of the regression coefficients will depend on which other variables are included in the model and especially when explanatory variables are correlated, the order in which they are entered into the model can have a great effect.
21
Model checking When assessing the validity of a multiple regression model, it is important to try and check whether the assumptions of the model seem to hold. These include; are the relationships linear?, are the deviations (residuals) from the regression line normally distributed around the line?, are the variances constant over all values of the explanatory variables? if two explanatory variables are correlated with the outcome but also with each other, which (if any) is having an effect?
22
Example The weight and various physical measurements for 22 male subjects aged 16 – 30 were recorded. Subjects were randomly chosen volunteers Running a multiple regression with mass as response variable against all the variables (known as the full model) resulted in the following estimates of the coefficients
23
What do we conclude?
24
The ANOVA table for this regression line including all of the possible explanatory variables is as follows. With the values of R2 and R2(adj) being 98% and 96% respectively. The F-statistic gives a p-value of less than The regression model (as a whole) is clearly a significant predictor of the variability in the response variable.
25
Often ANOVA is used to assess whether variables need to be included in the model, i.e. do they contribute significant explanatory power. In this example, it does not look as though the width of a subjects shoulders is a powerful predictor of their body mass, at least when the other variables are included in the model.
26
In addition to the t-test of the coefficient in the model which is non-significant (t=-0.122, p=0.905), we can compare models with and without shoulder included in the variable list. The following table gives an ANOVA table for a comparison between two such models.
27
Again, we conclude that the effect of shouder is non-significant as predictor of body mass when the other variables are included in the model. Note that the p-value for the F-test is the same as that for the t-test of the coefficients
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.