Multiple Linear Regression Linear regression with two or more predictor variables
Introduction Often in linear regression, you want to investigate the relationship between more than one predictor variable and some outcome. In this case, your model will contain more than one independent variable. It is also often important to investigate a possible interaction between two or more independent variables.
Consider the following situation: The file air.txt contains a subsample of data from a study of the effect of air pollution on lung function. The variables measured were age, gender, height, weight, forced vital capacity (FVC), and forced expiratory volume in 1 second (FEV1). FVC is the total volume of air in liters which an individual can expel regardless of how long it takes. FEV1 is the volume of air expelled during the first second when an individual has been told to breath in deeply and then expel as much air as possible. (Dunn and Clark (1987), Applied Statistics: Analysis of Variance and Regression, p.354.)
Input the file air.txt into SAS with the following code (adjusting the location of the file as necessary): DATA air; INFILE ‘C:\air.txt' dlm = ' ' firstobs = 2; INPUT sex age height weight fvc fev1; height_age = height*age; RUN; “Height_age” creates a new variable which represents the interaction between height and age.
Exploring the Data We are interested in what factors may predict FVC. In order to explore this before analyzing the data, create two plots: one of FVC vs. height; the other of FVC vs. age: PROC GPLOT DATA = air; PLOT fvc * height; PLOT fvc * age; RUN;
Plot of FVC * Height
Plot of FVC * Age
It appears a linear relationship is justified between FVC and height, although it is unclear whether a linear relationship exists between FVC and age. Create a multiple linear regression model using both height and age to predict FVC: PROC REG DATA = air; MODEL fvc = height age; RUN; QUIT;
Multiple Regression Output
Interpreting Output The multiple regression equation is: Yhat = -6.67 + 0.18(height) – 0.03(age) The R-Square value is interpreted the same as with simple linear regression: 67% of the variance in FVC is explained by height and age in the model. Because the model includes more than one predictor variable, you may want to consider using the adjusted R2 (Adj R-Sq) value instead of the R-Square for interpreting amount of variance explained by the independent variables.
Overall F-test To test whether all of the variables taken together significantly predict the outcome variable (FVC), use the overall F-test. The test statistic (F* = 36.96) is found under F Value. The associated pvalue (<0.001) is found under Pr > F. Ho: β1 = β2 = 0 vs. Ha: At least one β ≠ 0. Because the p-value is less than 0.05, we reject the null hypothesis and conclude that taken together, height and age are significantly related to FVC.
Testing Significance of One Variable To test the significance of an individual variable in predicting FVC, use the test statistic (t Value) and associated pvalue for that particular variable (Pr > |t|). For example, the test of whether height is significantly related to FVC [Ho: β1 = 0 vs. Ha: β1 ≠ 0], has t* = 8.15, p < 0.0001. Reject the null hypothesis and conclude that height is significantly related to FVC.
Testing for an Interaction Because we have more than one predictor variable, it is important to consider whether they interact in some way. To test whether the interaction between height and age is significant, create another model in SAS that contains both the main effects of height and age as well as the interaction term you created: PROC REG DATA = air; MODEL fvc = height age height_age; RUN; QUIT;
Output with Interaction Term
Is the interaction significant? Notice that the pvalue for the interaction is 0.39, which is greater than 0.05. Therefore, the interaction between age and height is not significant, and we do not need to include it in the model. Additionally, notice that the R-Square is 0.679, indicating that 68% of the variability in FVC is explained by height, age and height_age. This number is not much larger than the R-Square from the model with just height and age. This also is a good indicator that the interaction term is not necessary. The final model only needs to include height and age predicting FVC.
Conclusions Multiple Linear Regression in SAS is very similar to Simple Linear Regression. The major difference is that more variables are added to the model statement, and interaction terms need to be considered. Use the same options (clb, cli, clm) for creating confidence intervals in SAS and determining outliers (r) and influential points (influence).
Linear Regression is used with continuous outcome variables Linear Regression is used with continuous outcome variables. If the outcome variable of interest is categorical, logistic regression analysis is used. The next tutorial is an introduction to logistic regression.