Multiple regression analysis requires meeting several assumptions. We will: (1) identify some of these assumptions; (2) describe how to tell if they have been met; and (3) suggest how to overcome or adjust for violations of the assumptions, if violations are detected.
Some Assumptions Underlying Multiple Regression 1. No specification error 2. Continuous variables 3. Additivity 4. No multicollinearity 5. Normally distributed error term 6. No heteroscedasticity
A. Absence of Specification Error This refers to three different things: 1. No relevant variable is absent from the model. 2. No unnecessary variable is present in the model. 3. The model is estimated using the proper functional form.
A1. No relevant variable is absent from the model. There is no statistical check for this type of error. Theory and knowledge of the dependent variable (that is, the phenomenon of interest) are the only checks. The strength of causal claims is directly proportional to the adequacy and completeness of model specification.
A2. No unnecessary variable is present in the model. Another, less serious, type of specification error is the inclusion of some variable or variables that are NOT associated with the dependent variable. You discover this when an independent variable proves NOT to be statistically significant. However, if theory and knowledge of the subject demanded that the variable be included, then this is not really specification error. Be careful not to remove statistically insignificant variables in order to re-estimate models without them. This smacks of one of the sins of multiple regression analysis, step-wise regression.
A3. Proper functional form. A third aspect of proper model specification relates to what is called the functional form of the analysis. Multiple regression analysis assumes that the model has been estimated using the correct mathematical function. Recall that our discussions of the line of best fit, etc., have all emphasized the idea that the data can be described by a straight line, however imperfectly, rather than by some other mathematical function. This is what functional form is all about.
To determine whether the assumption of linear form is violated, simply create scatterplots for the relationship between each independent variable and the dependent variable. Examine each scatterplot to see if there is overwhelming evidence of nonlinearity. Use PROC PLOT for this: libname old 'a:\'; libname library 'a:\'; proc plot data=old.cities; plot crimrate*(policexp incomepc stress74); title1 'Plots of Dependent and All Independent Variables'; run;
| A A | 80 + A | A | | A N | A U | M 70 + B | A E | A R | A | A A O | A F 60 + A A | A A A S | E | A A A R | A A I | A O 50 + A U | B AA S | | C | A A A A R | A A I 40 + A A A M | A E | A A S | A | A AC P | A A E 30 + A A A R | A A A | 1 |, | A 0 | A A 0 20 + 0 | | | A BA | | 10 + | AA --+---------+---------+---------+---------+---------+---------+---------+-- 10 20 30 40 50 60 70 80 POLICE EXPENDITURES PER CAPITA
| | A A | 80 + A | A | | A N | A U | M 70 + B | A E | A R | A | AA O | A F 60 + A A | A A A S | E | A A A R | A A I | A O 50 + A U | A A B S | | C | A A A A R | A A I 40 + A A A M | A E | A A S | A | AAA A A P | A A E 30 + A A A R | A A A | 1 |, | A 0 | A A 0 20 + 0 | | | A B A | | 10 + | A A | 0 + --+---------+---------+---------+---------+---------+---------+---------+-- 200 250 300 350 400 450 500 550 INCOME PER CAPITA, IN $1OS
| | A A | 80 + A | A | | A N | A U | M 70 + B | A E | A R | A | A A O | A F 60 + AA | A A A S | E | A A A R | A A I | A O 50 + A U | A A A A S | | C | AAA A R | B I 40 + A A A M | A E | A A S | A | A A AA A P | A A E 30 + A A A R | A A A | 1 |, | A 0 | A A 0 20 + 0 | | | A A A A | | 10 + | A A --+---------+---------+---------+---------+---------+---------+---------+-- 0 200 400 600 800 1000 1200 1400 LONGTERM DEBT PER CAPITA, 1974
If one or more of the relationships look nonlinear, then the INDEPENDENT variable can be transformed. Let's pretend that the INCOMEPC variable in our annual salary model showed an especially nonlinear relationship with salary. The most frequent transformations involve converting raw values of the independent variable to their natural logarithms, their squares, or their inverses. This is easily done in a SAS DATA step: data temp1; set old.cities; logofinc = log(incomepc); incsqrd = incomepc**2; incinv = 1 / incomepc; run;
Then the transformed values are replotted against the dependent variable. Inspection will tell which transformation has produced the most linear relationship: proc plot data=temp1; plot salary * (logofinc incsqrd incinv); title1 'New Scatterplots After Transformations'; run;
We would estimate the model using the logarithmic values of income rather than raw values; the other variables would be used in their original forms. proc reg data=temp1; model salary = educ logofinc age; title1 'Regression with Transformed Variable'; run; Notice that the variable 'logofinc' exists ONLY in the temporary data set, 'temp1', which we use in the analysis. With this variable transformed, the model is estimated with the proper functional form, and specification error is avoided.
B. Continuous Variables We have stressed throughout our discussions of simple and multiple regression analysis that continuous variables are required. Our use of graphs to introduce the concept of the scatterplot in fact REQUIRED variables that were measured on at least an equal- interval scale. However, in constructing multiple regression models, one invites specification error if one does not include discrete variables such as ethnicity or gender as independent variables. How does one do this when these are not continuous variables? The answer is to create what are called dummy variables.
Dummy variables are binary variables, that is, they have values of 0 and 1. The value 0 means that the phenomenon is absent; the value 1 means that it is present. These are by definition equal-interval variables since the distance between 0 and 1 is equal to the distance between 1 and 0. Actually, this works: if you calculate the mean for a binary variable, the result is the proportion of observations in the "1" category. With a variable like GENDER, creating a dummy is simple: recode the category "f" (female) as 1 and the category "m" (male) as 0 (or vice versa) to create the new variable, FEMALE:
libname old 'a:\'; libname library 'a:\'; data old.somedata; set old.somedata; if gender = 'f' then female = 1; else if gender = 'm' then female = 0; else female =.; run; The new variable, “female," can then be added as a (continuous) control variable.
What about a variable like ethnicity (ETHNIC)? Let's say that ETHNIC is coded as follows: 1 = Anglo 2 = Hispanic 3 = African American 4 = Asian American 9 = All other To create continuous variables, we would create J – 1 dummy variables, where J is the number of categories of the original variable. This is easy:
libname old 'a:\'; libname library 'a:\'; data old.somedata; set old.somedata; if ethnic = 1 then anglo = 1; else if ethnic = 2 or ethnic = 3 or ethnic = 4 or ethnic = 9 then anglo = 0; else anglo =.; if ethnic = 2 then hispan = 1; else if ethnic = 1 or ethnic = 3 or ethnic = 4 or ethnic = 9 then hispan = 0; else hispan =.;
if ethnic = 3 then afam = 1; else if ethnic = 1 or ethnic = 2 or ethnic = 4 or ethnic = 9 then afam = 0; else afam =.; if ethnic = 4 then asianam = 1; else if ethnic = 1 or ethnic = 2 or ethnic = 3 or ethnic = 9 then asianam = 0; else asianam =.; run;
Now we can include the four new dummy variables in the multiple regression model: libname old 'a:\'; libname library 'a:\'; options nodate nonumber ps=66; proc reg data=old.somedata; model salary = educ age logofinc anglo hispan afam asianam; title1 'Regression with Dummy Variables'; run;
These four dummy variables will behave as continuous variables. Notice that the number of dummies is one less than the total number of categories in the original variable. This is because "All Other" is already present. Those cases in the "All Other" ethnic category are represented by observations whose values are ANGLO = 0, HISPAN = 0, AFAM = 0, and ASIANAM = 0. To create a fifth variable, OTHER, would be redundant and would create a problem of multicollinearity, which we will look at shortly.
C. Additivity If you recall the algebraic version of our multiple regression model, Y i = + b 1 X 1i + b 2 X 2i + b 3 X 3i + i you will remember that the individual terms are joined together by addition (+) signs. This is an additive model, in other words. What this means as far as cause and effect is concerned is that each independent variable has its own separate, individual influence on the dependent variable. The additive model does not recognize the influence of combinations of independent variables that may exist over and above the separate influence of those variables.
Let's use an analogy: Hydrogen and oxygen have chemical properties different from their combination, H 2 O. A regression model that had only H 2 and O 2 in it would not contain an H 2 O molecule. b 4 X 2 X 3 X2X2 X3X3 X2X2 X3X3
In our example model, the specification states that education and parents’ income have separate and independent (net) effects on respondents’ annual salary. However, there are probably clusters of education-parent income connections that affect salary over and above the two separate variables. To capture such influences—and to avoid committing specification error—, we can create an interaction term and add it to the model. Interaction terms for continuous variables are created by multiplication. Because of this, such variables are sometimes called “product” terms. For example, to create an interaction term for education and parents’ income with SAS, the DATA step would be :
libname old 'a:\'; libname library 'a:\'; data old.somedata; set old.somedata; educinc = educ*income; run; The interaction term would then be added to the model: proc reg data=old.somedata; model salary = educ age income educinc; title1 'Regression with Interaction Term'; run;
Symbolically, the multiple regression model now is: Y i = + b 1 X 1i + b 2 X 2i + b 3 X 3i + b 4 X 2i X 3i + i where b 4 is the multiple regression coefficient for the interaction term, X 2i X 3i. If this coefficient is statistically significant (as evaluated by a t-test), then we conclude that there is a joint influence over and above the separate influence of the two variables, X 2i and X 3i.
D. Absence of Multicollinearity When we discussed the creation of dummy variables, we mentioned that J - 1 dummies were created to avoid redundancy. Sometimes we have two or more independent variables in a multiple regression model that are, unknown to us, in reality measures of the same underlying phenomenon. They are seemingly different measures of the same thing (e.g., gender and education in a world in which men have all the advanced education and women have none). Having "advanced education" really means the same thing as being "male," and "lack of education" is really the same as being in the "female" category. This is known as multicollinearity. It results in extremely strong associations between two (or more) independent variables thought to be measures of different things.
Multicollinearity is identified by regressing each INDEPENDENT variable on all other INDEPENDENT variables. Model R-squares are then examined to see if any two (or more) are greater than 0.90. If so, the variables which are the dependent variable in the one model and an independent variable in the other model are said to be collinear. The process is easier than it sounds. For our model: proc reg data=temp2; model educ = age pincome; model age = educ pincome; model pincome = educ age; title1 'Test for Multicollinearity'; run;
If Models 1 and 2 have R-squares greater than 0.90, this means that education and age are collinear. SAS now has two options that diagnosis multicollinearity automatically. One is the creation of the "tolerance" statistic. Tolerance is simply 1 - R 2. Thus, two or more tolerance measures equal to or less than 0.10 (for R 2 0.90) indicate the presence of multicollinearity. To produce the tolerance measure, simply add the optional subcommand tol to the MODEL statement after the "/".
proc reg data=old.somedata; model salary = educ age pincome / tol; title1 'Regression with Test for Multicollinearity'; run; Common solutions for the presence of multicollinearity include: (1) dropping one of the variables from the model: or (2) creating a new variable by combining the variables, such as in an interaction term or through factor analysis. Creating a new variable through factor analysis is probably preferable, provided that it makes sense substantively.
OLS REGRESSION RESULTS Model: MODEL1 Dependent Variable: CRIMRATE NUMBER OF SERIOUS CRIMES PER 1,000 Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 3 7871.03150 2623.67717 11.126 0.0001 Error 59 13912.52405 235.80549 C Total 62 21783.55556 Root MSE 15.35596 R-square 0.3613 Dep Mean 44.44444 Adj R-sq 0.3289 C.V. 34.55091 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 14.482581 12.95814942 1.118 0.2683 POLICEXP 1 0.772946 0.15818555 4.886 0.0001 INCOMEPC 1 0.020073 0.03573539 0.562 0.5764 STRESS74 1 0.005875 0.00800288 0.734 0.4658 Standardized Variable DF Estimate Tolerance INTERCEP 1 0.00000000. Intercept POLICEXP 1 0.55792749 0.83163748 POLICE EXPENDITURES PER CAPITA INCOMEPC 1 0.05911456 0.97889220 INCOME PER CAPITA, IN $1OS STRESS74 1 0.08431770 0.82285609 LONGTERM DEBT PER CAPITA, 1974
E. Normally Distributed Error Term There are several assumptions underlying multiple regression analysis that involve the pattern of the distribution of the residuals ( i ). You will recall that the residual is the error term, the unexplained variance, that is, the difference between the actual location of Y i given the values of variables in the model versus the predicted location, Y i - hat. One serious departure from normality is the presence of outliers. Outliers are data points that are extremely different from all the others. For example, if a random sample of cities from across the U.S. ranged in size from 25,000 to 500,000—except for one city, New York—, then New York would be an outlier. Its values on almost any variable would be vastly different from those of the other cities.
Outliers can be detected by requesting studentized residuals. These are like standard scores (i.e., z-scores) except that they are in Student's t values (due to non-normality). As a rule of thumb, any studentized residual greater than + 3.00 or less than - 3.00 is considered an outlier. The studentized residuals may be requested by simply adding the optional subcommand "r" to the MODEL statement after the "/": proc reg data=old.somedata; model salary = educ age pincome / r; title1 'Regression with Test for Outliers'; run;
If one or more outliers are detected, the solution is to re- estimate the model, deleting the outlying observation(s). Both sets of results would be presented so that you can show results for all cases as well as separately for the cases that are most alike.
F. Absence of Heteroscedasticity A second assumption involving the distribution of residuals has to do with their variance. It is assumed that the variance of the residuals is constant throughout the range of the model. Since the residual reflects how accurately the model predicts actual Y values, constant variance of residuals means that the model is equally predictive in low, medium, and high values of the model (i.e., of Y i - hat). If it is not, then this suggests that the dependent variable is explained better in some of the ranges of the model but not in others. The property of constant variance is called homoscedasticity. Its absence is called heteroscedasticity.
The presence of heteroscedasticity is detected by examination of a plot of the residuals against the predicted-Y values. This is easy to do. In SAS, you include an OUTPUT statement in the regression procedure in which you give names to the predicted Y-values (following p= ) and to the residuals (following r = ).
libname old 'a:\'; libname library 'a:\'; proc reg data=old.somedata; model salary = educ age pincome; output out=temp5 p=yhat r=yresid; title1 'Regression Output for Heteroscedasticity Test'; run; proc plot data=temp5; plot yresid*yhat = '*'/vref = 0; title1 'Plot of Residuals ; run;
If heteroscedasticity is absent, the plot should look like this: R e s i d* u- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - a * * * * * * * * * * l * * * * * * ** * * * 0.0___________________________________ * * * * * * * * * * * * * * * * * * * * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * Predicted Y-value Here there is no marked change in the magnitude of the residuals throughout the range of the model (defined by Y-hat values). Compare this with the following pattern:
R e * s * i * * d * * u - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - a * * * * ** * * * * l * * * * * * ** * 0.0_____________________________________ * * * ** * * * * * * * * * * * * * * * - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - * ** * * * * * * * * * * * * Predicted Y-value Here, the residuals are smaller at greater predicted Y- values. That is, the residuals are not uniformly scattered about the reference line (at residual = 0). This means that the model does not give consistent predictions. This is heteroscedasticity. The solution is to create a weighting variable and to perform weighted least squares.
Summary of Multiple Regression Assumptions 1. Absence of specification error no relevant variables omitted no irrelevant variables included proper functional form (e.g., linear relationships) 2. Continuous variables can use dummy variables as independent variables
Multiple Regression Assumptions (continued) 3. Additivity can create interaction (product) variables, if necessary 4. No multicollinearity may need to combine two (or more) of the independent variables 5. Normally distributed error term may need to repeat analysis without outlier(s)
Multiple Regression Assumptions (continued) 6. No heteroscedasticity may need to perform weighted regression