Presentation is loading. Please wait.

Presentation is loading. Please wait.

28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.

Similar presentations


Presentation on theme: "28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition."— Presentation transcript:

1 28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition

2 Objectives (PSLS Chapter 28) Multiple regression  The multiple linear regression model  Indicator variables  Two parallel regression lines  Interaction  Inference for multiple linear regression

3 The multiple linear regression model  In previous chapters we examined a simple linear regression model expressing a response variable y as a linear function of one explanatory variable x. In the population, this model has the form y = α +  x  We now examine multiple linear regression models in which the response variable y is a linear combination of k explanatory variables. In the population, this model takes the form y =  0 +  1 x 1 +  2 x 2 +  +  k x k  The parameters can be estimated from sample data, giving y = b 0 + b 1 x 1 + b 2 x 2 + … + b k x k

4 Assumptions  The mean response μ y has a linear relationship with the k explanatory variables taken together.  The y responses are independent of each other.  For any set of fixed values of the k explanatory variables, the response y varies Normally.  The standard deviation σ of y is the same for all values of the explanatory variables. In inference, the value of σ is unknown.

5 Indicator variables The multiple regression model can accommodate categorical response variables by coding them in a binary mode (0,1). In particular, we can compare individuals from different groups (independent SRSs in an observational study or randomized groups in an experiment) by using an indicator variable. To compare 2 groups, we simply create an indicator variable Ind such that  Ind = 0 for individuals in one group and  Ind = 1 for individuals in the other group

6 Two parallel regression lines When plotting the linear regression pattern of y as a function of x for two groups, we sometimes find that the two groups have roughly parallel simple regression lines. In such instances, we can model the data using a unique multiple linear regression model with two parallel regression lines, using the quantitative variable x 1 and an indicator variable Indx 2 for the groups: y =  0 +  1 x 1 +  2 Indx 2  1 is the slope for both lines  0 is the intercept for the Indx 2 = 0 line  0 +  2 ) is the intercept for the Indx 2 = 1 line 22 Indx 2 = 0 line Indx 2 = 1 line

7 Unique multiple regression model with an indicator variable for two parallel lines: y = – 44.29 + 133.39x 1 –23.55Indx 2 Two separate simple linear regression models (notice the similar slopes). Male fruit flies were randomly assigned to either reproduce (IndReprod = 1) or not (IndReprod = 0). Their thorax length and longevity were recorded.

8 Interaction When plotting the linear regression pattern of y as a function of x for two groups, we may find two non-parallel simple regression lines. We can model such data with a unique multiple linear regression model using a quantitative variable x 1, an indicator variable Indx 2 for the groups, and an interaction term x 1 Indx 2 : y =  0 +  1 x 1 +  2 Indx 2 +  3 x 1 Indx 2 Each line has its own slope and intercept.  1 is the slope for the Indx 2 = 0 line  0 +  3 ) is the slope for the Indx 2 = 1 line Indx 2 = 0 line Indx 2 = 1 line

9 Note that an interaction term can be computed between any two variables (not just between a quantitative variable and an indicator variable). An interaction effect between the variables x 1 and x 2 means that the relationship between the mean response  y and the explanatory variable x 1 is different for varying values of the explanatory variable x 2. When comparing two groups (x 2 is an indicator variable), this means that the two regression lines will not be parallel.

10 A random sample of children was taken and their lung capacity (forced expiratory volume, or FEV) was plotted as a function of their age and sex (IndSex = 0 for female and IndReprod = 1 for male).

11 Using an interaction term to take into account the non-parallel lines, software gives the following multiple regression model : y = 0.6739 + 0.18209x 1 –0.7314Indx 2 + 0.10613x 1 Indx 2

12 Inference for multiple regression  We first want to run an overall test. We use an ANOVA F test to test: H 0 : β 1 = 0 and β 2 = 0 … and β k = 0 H a : H 0 is not true (at least one coefficient is not equal to 0)  The squared multiple correlation coefficient R 2 is given by the ANOVA output as and indicates how much of the variability in the response variable y can be explained by the specific model tested. A higher R 2 indicates a better model.

13 Estimating the regression coefficients  If the ANOVA is significant, we can run individual t tests on each regression coefficient: H 0 : β i = 0 in this specific model H a : β i ≠ 0 in this specific model using, which follows the t distribution with n – k – 1 degrees of freedom when H 0 is true.  We can also compute individual level-C confidence intervals for each of the k regression coefficients in the specific model. where t* is the critical value for a t distribution with n – k – 1 degrees of freedom.

14 The ANOVA test is significant, indicating that at least one regression coefficient is not zero. R 2 = 0.81, so this is a very good model that explains 81% of the variations in longevity of male fruit flies in the lab. The individual t tests are all significant, indicating that in this model, the regression coefficients are significantly different from zero. The confidence intervals give an range of likely values for these parameters. Because this is a model with 2 parallel lines, we can conclude that reproducing male fruit flies live between 19 and 28 days less on average than those that do not reproduce, when thorax length is taken into account. SPSS

15 The ANOVA test is significant, indicating that at least one regression coefficient is not zero. R 2 = 0.67, so this is a good model that explains 67% of FEV variations in children. The individual t tests are all significant, indicating that in this model, the regression coefficients are significantly different from zero. Because this is a model with a significant interaction effect, we conclude that both age and sex influence FEV in children, but that the effect of age on FEV is different for males and for females. The scatterplot indicates that the effect of age is more pronounced for males.

16 Checking the conditions for inference  The best way to check the conditions for inference is by examining graphically the scatterplot(s) of y as a function of each x i, and the residuals (y - ŷ) from the multiple regression model.  Look for:  Linear trends in the scatterplot(s)  Normality of the residuals (histogram of residuals)  Constant σ for all combinations of the x i s (residual plot with no particular pattern and approximately equal vertical spread)  Independence of observations (check the study design or a plot of the residuals sorted by order of data acquisition)


Download ppt "28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition."

Similar presentations


Ads by Google