Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.

Similar presentations


Presentation on theme: "Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions."— Presentation transcript:

1 Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions on the diagnostic statistics and plots, both before and after the regression is computed. I created a simulated data set of 100 cases using SPSS's random number generation facility that contains variables with predefined statistical properties. The SPSS syntax file which produces this output can be found at web page for downloading files. The following examples demonstrate what happens when a violation of an underlying regression occurs. In an actual problem, the impact of a violation of an underlying assumption may be more or less severe that the problem simulated here, so that the visible impact on the diagnostic tests will be more or less apparent than shown here. In all of the examples, the violation of the assumption weakens the relationship between the set of independent variables and the dependent variable, and weakens the individual relationship between the individual independent variable and the dependent variable. Furthermore, the impact on the relationship is stronger when the variable violating the assumption is the dependent variable than when the variable violating the assumption is an independent variable. Regression Assumptions and Diagnostic Statistics

2 Slide 2 Regression with All Assumptions Met For the first problem, we will use a normally distributed dependent variable (DV1), two normally distributed independent variables (IV1 and IV2), and a dichotomous independent variable (IV3). Regression Assumptions and Diagnostic Statistics

3 Slide 3 Assumption of Normality The distributions and the tests of normality for the three metric variables are shown below. All three metric variables meet the statistical test for normality. Regression Assumptions and Diagnostic Statistics

4 Slide 4 Assumption of Linearity The following scatterplots indicate that we satisfy the linearity assumptions between the metric independent variables and the dependent variable: Regression Assumptions and Diagnostic Statistics

5 Slide 5 Assumption of Homogeneity of Variance For the dichotomous independent variable, the box plot and homogeneity of variance test meets the assumption of constant variance: Regression Assumptions and Diagnostic Statistics

6 Slide 6 Regression Results When we run standard multiple regression with the three independent variables, we find that there is a very strong relationship between the dependent variables and the set of independent variables. Furthermore, each of the independent variables has a statistically significant relationship with the dependent variable: Regression Assumptions and Diagnostic Statistics

7 Slide 7 Residual Analysis The plot of residuals is a null plot, i.e. it contains no pattern of nonlinearity and demonstrates constant variance across the predicted values of the dependent variable. The normality plot of the residuals supports a conclusion of normality. The partial plots show no evidence of a nonlinear relationship: In sum, all of the diagnostic statistics and plots support the conclusion that this analysis meets all of the assumptions for multiple regression. Regression Assumptions and Diagnostic Statistics

8 Slide 8 Outliers and Influential Cases No casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable. This is verified in the table of residual statistics which shows that the largest standardized residual is 1.870 and the smallest standardized residual is -1.856. Regression Assumptions and Diagnostic Statistics

9 Slide 9 Outliers and Influential Cases In the table of extreme values for the probability of Mahalanobis D², we see that one case, 78, is potentially an outlier on the set of independent variables. Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables. For this problem the criteria is: 4/(100 - 3- 1) = 0.042. Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see one case, 74, whose value for Cook's distance is right on the borderline for being considered an influential case. Regression Assumptions and Diagnostic Statistics

10 Slide 10 Regression with a Discrete Dependent Variable We can round the values of the continuous dependent variable (DV1) to create a discrete dependent variable (DV2) that has a limited number of categories. In the following section, we will examine the impact that a discrete dependent variable has on the analysis. The statistical measures of the distribution of the dependent variable (mean, standard deviation, etc.) change only slightly. Regression Assumptions and Diagnostic Statistics

11 Slide 11 Assumption of Normality The distribution of the discrete dependent variable (DV2) looks close to a normal distribution, but fails the normality test. The normality tests for the metric independent variables are not changed. Regression Assumptions and Diagnostic Statistics

12 Slide 12 Assumption of Linearity The discrete dependent variable retains a linear relationship with the independent variables, but the distinctive banding for the limited number of values for a discrete variable is evident. Regression Assumptions and Diagnostic Statistics

13 Slide 13 Assumption of Homogeneity of Variance The use of the discrete dependent variable did not introduce any problem with homogeneity of variance for the nonmetric variable. Regression Assumptions and Diagnostic Statistics

14 Slide 14 Differences in Correlations with the Continuous and Discrete Dependent Variable In the correlation matrix, we can see that for all three independent variables, the relationship with the discrete dependent variable in the DV2 column is slightly smaller that the relationships with the continuous dependent variable in the DV1 column. Regression Assumptions and Diagnostic Statistics

15 Slide 15 Regression Results Consistent with weaker correlations between the discrete dependent variable and the independent variables, the results of the regression analysis in the tables below show that the R² value decreased from 0.951 to 0.906. Each of the individual independent variables retains its statistically significant relationship to the dependent variable. Regression Assumptions and Diagnostic Statistics

16 Slide 16 Residual Analysis The residual plot shows the impact of the discrete coding for the dependent variable, but otherwise has the same general shape of the null plot for the continuous dependent variable. The normality plot does not indicate any problem with normality. Similarly, the partial plots show evidence of the weaker relationships, but otherwise do not demonstrate any departure from linearity. Regression Assumptions and Diagnostic Statistics

17 Slide 17 Outliers and Influential Cases No casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable. This is verified in the table of residual statistics which shows that the largest standardized residual is 2.427 and the smallest standardized residual is -2.091. Regression Assumptions and Diagnostic Statistics

18 Slide 18 Outliers and Influential Cases In the table of extreme values for the probability of Mahalanobis D², we see that one case, 78, is potentially an outlier on the set of independent variables. Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables. For this problem the criteria is: 4/(100 - 3- 1) = 0.042. Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see three cases, 21, 61, and 49, whose value for Cook's distance is right on the borderline for being considered an influential case. Regression Assumptions and Diagnostic Statistics

19 Slide 19 Regression with a Skewed Dependent Variable The following skewed distribution was created by randomly increasing the value of the original dependent (DV1) for about 20 of the cases in the original distribution, creating a new dependent variable DV3. Regression Assumptions and Diagnostic Statistics

20 Slide 20 Assumption of Normality The histogram and normality plot show the impact of the skewing in the dependent variable. As we would expect when we introduce skewness in the variable, the K-S Lilliefors test would support a conclusion of nonnormality. Regression Assumptions and Diagnostic Statistics

21 Slide 21 Assumption of Linearity The scatterplots of the dependent variable with the metric independent variables retain their linear pattern, though the skewed cases spread upward, away from the rectangular band. Regression Assumptions and Diagnostic Statistics

22 Slide 22 Assumption of Homogeneity of Variance The box plot for the nonmetric independent variable shows the effects of skewing, i.e. the presence of extreme values, but the boxplot, as well as the homogeneity of variance test, does not indicate a problem with constant variance: Regression Assumptions and Diagnostic Statistics

23 Slide 23 Differences in Correlations with the Normal and Skewed Dependent Variable Skewing the dependent variable considerably weakens the relationships with the independent variables as shown in the second column of the correlation matrix: Regression Assumptions and Diagnostic Statistics

24 Slide 24 Regression Results Similarly, the overall relationship between the set of independent variables and the dependent variable declines in strength from.951 to.474, but is still statistically significant. The relationships between two of the individual independent variables and the dependent variable remain significant. The individual relationship between the second metric independent variable and the dependent variable is no longer statistically significant. Regression Assumptions and Diagnostic Statistics

25 Slide 25 Residual Analysis The residual plot shows the funnel shaped pattern associated with heteroscedasticity. At the left of the plot, the variance of the residuals is very limited, growing larger as we move to the right side of the plot. Hetereoscedasticity, in this instance, is associated with skewing of the dependent variable. The normal probability plot of the residuals departs substantially from the green line of expected frequencies, indicating that the residuals are not normally distributed. The partial plots of the dependent variable with the metric independent variables also show the effects of skewing, but do not indicate any nonlinearity. Regression Assumptions and Diagnostic Statistics

26 Slide 26 Outliers and Influential Cases The casewise plot shows the presence of outliers on the dependent variable because of the skewing. If we examine case 11 in the data editor, we see that the formula for skewing the dependent variable increased the value of the dependent variable from 5.675 to 11.351, making it about three units larger than any other value for the dependent variable. Regression Assumptions and Diagnostic Statistics

27 Slide 27 Outliers and Influential Cases In the table of extreme values for the probability of Mahalanobis D², we continue to see that the same case, 78, is potentially an outlier on the set of independent variables. Thus far we have not changed the values for any independent variables. Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables. For this problem the criteria is: 4/(100 - 3- 1) = 0.042. Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have at least five cases that have a Cook's distance larger than the criteria of 0.042. Case 11 with the largest value for the dependent variable has the largest Cook's distance measure. All of these cases had their value doubled to produce the skewing in the distribution, but they were not the only cases in the distribution that had the value of the dependent variable doubled. Regression Assumptions and Diagnostic Statistics

28 Slide 28 Regression with a Skewed Independent Variable The next sequence uses the normally distributed dependent variable (DV1) and substitutes a skewed version of the first independent variable (IV1S) for the original normally distributed independent variable (IV1). Regression Assumptions and Diagnostic Statistics

29 Slide 29 Assumption of Normality The histogram and the K-S Lilliefors test both indicate non-normality for the new independent variable. Regression Assumptions and Diagnostic Statistics

30 Slide 30 Assumption of Linearity The scatterplot on the left shows the original metric independent variable IV1. The scatterplot on the right shows the effect of skewing some values of the original IV1 variable. The main band of points is pushed somewhat to the left by the addition of larger skewed values for IV1S. Regression Assumptions and Diagnostic Statistics

31 Slide 31 Assumption of Homogeneity of Variance The homogeneity of variance assumption is not impacted by the change in the metric independent variable. Regression Assumptions and Diagnostic Statistics

32 Slide 32 Differences in Correlations with the Normal and Skewed Independent Variable The correlation matrix shows that the relationship between IV1S and DV1 (.749) is weaker than the relationship between IV1 and DV1 (.900), which we would expect with a variable that is no longer linear. Regression Assumptions and Diagnostic Statistics

33 Slide 33 Regression Results The strength of the relationship measured by R² dropped from.951 to.910 due to the skewed independent variable, though the overall relationship between the independent variables and the dependent variable is still statistically significant. Each of the individual independent variables retained its statistically significant relationship to the dependent variable. Regression Assumptions and Diagnostic Statistics

34 Slide 34 Residual Analysis The scatterplot of residuals is still a null plot. The normality plot of the residuals indicates a normal distribution. The partial plot for the skewed independent variable shows evidence of nonlinearity. When we changed the values for one variable in the relationship to produce the skewing and retained the values of the other variable, we introduced the nonlinearity. Whether the nonlinearity is evident or not depends on the severity of the change which we made. If we saw this nonlinear pattern in a partial plot, we might consider a transformation of this independent variable. Regression Assumptions and Diagnostic Statistics

35 Slide 35 Outliers and Influential Cases In this analysis, we reverted back to the normally distributed dependent variable, so we would anticipate that our results would be similar to the prior results when we used the same form of the dependent variable. No casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable. This is verified in the table of residual statistics which shows that the largest standardized residual is 2.362 and the smallest standardized residual is -2.040. Regression Assumptions and Diagnostic Statistics

36 Slide 36 Outliers and Influential Cases In the table of extreme values for the probability of Mahalanobis D², we find additional cases that are potential outliers for the combination of the set of independent variables. We can attribute this to skewing one of the variables in the set of independent variables. Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables. For this problem the criteria is: 4/(100 – 3 - 1) = 0.042. Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have two cases that have a Cook's distance larger than the criteria of 0.042. This is one more case than we had with the analysis with all normal variables, but fewer cases than we had when we skewed the dependent variable. Regression Assumptions and Diagnostic Statistics

37 Slide 37 Regression with a Nonlinear Dependent Variable To form a nonlinear dependent variable, I took the original dependent variable DV1 and squared it to produced DV5. Regression Assumptions and Diagnostic Statistics

38 Slide 38 Assumption of Normality This also has the effect of skewing the variable, as shown in the histogram below. The skewness produces a distribution that is not normally distributed, as shown in the normality plot and the K-S Lilliefors test. Regression Assumptions and Diagnostic Statistics

39 Slide 39 Assumption of Linearity The scatterplots of the metric independent variables with the nonlinear dependent variable show the nonlinear pattern in the dependent variable. At both ends of the fit lines, there are points above the line, but not below the line. Regression Assumptions and Diagnostic Statistics

40 Slide 40 Assumption of Homogeneity of Variance While some difference in the heights of the bars are visible in the boxplot, the statistical test does not indicate any difference in variance for the two groups on the nonmetric variable when we test the nonlinear dependent variable. Regression Assumptions and Diagnostic Statistics

41 Slide 41 Differences in Correlations with the Normal and Nonlinear Dependent Variable The correlations between the metric independent variable and the nonlinear dependent variable, DV5 in column 3, are smaller that the corresponding correlations with the linear form of the dependent variable, DV1 in column 2, except for the nonmetric variable which had a higher correlation with the nonlinear dependent variable. Regression Assumptions and Diagnostic Statistics

42 Slide 42 Regression Results The coefficient of determination between the independent variables and the nonlinear dependent variable declined from the value between the independent variables and the linear dependent variable. The ANOVA test confirms that this R² is statistically larger than zero. The statistical tests for the individual coefficients indicated that all were statistically significant. Regression Assumptions and Diagnostic Statistics

43 Slide 43 Residual Analysis The uncorrected nonlinearity problem of the dependent variable is evident in the residual plot. There is clearly a nonlinear pattern in the residual plot. The normality plot would support a conclusion that the residuals are normally distributed. The partial plots do not reflect the nonlinearity of the dependent variable. Regression Assumptions and Diagnostic Statistics

44 Slide 44 Outliers and Influential Cases When we squared the normally distributed dependent variable to introduce nonlinearity into the variable, the cases with the smallest (case 78) and largest (case 74) values of the dependent variable became outliers in the distribution and had the largest residual values. Regression Assumptions and Diagnostic Statistics

45 Slide 45 Outliers and Influential Cases In this analysis, we again utilized the original form of the independent variables, so we reverted to the circumstances where only a single case was a potential outlier on the combined set of independent variables. Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables. For this problem the criteria is: 4/(100 – 3- 1) = 0.042. Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have at least five cases that have a Cook's distance larger than the criteria of 0.042. These cases had either the largest or smallest values for the original dependent variable, such that squaring their value to produce the new dependent variable had the largest impact on their position in the distribution. Regression Assumptions and Diagnostic Statistics

46 Slide 46 Regression with a Nonlinear Independent Variable To form a nonlinear independent variable named IV1CUBE, I cubed the value of IV1, and entered it into a regression with the dependent variable DV1. Regression Assumptions and Diagnostic Statistics

47 Slide 47 Assumption of Normality The histogram, the normality plot, and the K-S Lilliefors test all indicate the lack of normality in the nonlinear variable 'IV1cube'. Regression Assumptions and Diagnostic Statistics

48 Slide 48 Assumption of Linearity The scattergram showing the curvilinear relationship between IV1CUBE and DV1 is shown on the left. The spread of the points above the center of the fit line is greater than the spread below the center of the fit line. The linearity of the relation between the normally distributed dependent variable DV1 and the normally distributed independent variable IV2 is unaffected by the change from IV1 to IV1cube. Regression Assumptions and Diagnostic Statistics

49 Slide 49 Assumption of Homogeneity of Variance The change from IV1 to IV1cube has no effect on the relationship between DV1 and IV3, the nonmetric homogeneous independent variable. Regression Assumptions and Diagnostic Statistics

50 Slide 50 Differences in Correlations with the Normal and Nonlinear Independent Variable The correlation of IV1CUBE with DV1 of.878 is not much smaller than the correlation of IV1 and DV1, suggesting that the curvature of IV1CUBE is slight. Regression Assumptions and Diagnostic Statistics

51 Slide 51 Regression Results The change in IV1CUBE was minimal, as we just noted. Consistent with this observation, our regression statistics are not appreciably different from the model in which IV1 had a linear relationship to the dependent variable. Regression Assumptions and Diagnostic Statistics

52 Slide 52 Residual Analysis The residual plot is very close to a null plot. There might be a slight curve to the plot associated with the three points in the lower lefthand corner with no points to the right. The nonlinear pattern in the partial plot for the IV1CUBE variable was expected. The pattern in the partial plot of DV1 and IV2 is similar to the original partial plot obtained with the linear form of IV1. Regression Assumptions and Diagnostic Statistics

53 Slide 53 Outliers and Influential Cases In this analysis, we reverted back to the normally distributed dependent variable, so we would anticipate that our results would be similar to the prior results when we used the same form of the dependent variable. No casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable. This is verified in the table of residual statistics which shows that the largest standardized residual is 1.902 and the smallest standardized residual is -2.562. Regression Assumptions and Diagnostic Statistics

54 Slide 54 Outliers and Influential Cases In the table of extreme values for the probability of Mahalanobis D², we find additional cases that are potential outliers for the combination of the set of independent variables. We can attribute this to the nonlinear variable in the set of independent variables. Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables. For this problem the criteria is: 4/(100 – 3 - 1) = 0.042. Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have at least five cases that have a Cook's distance larger than the criteria of 0.042. The cases with large Cook's distance values have either a very large value for the variable IV1CUBE or a very small value of IV1CUBE, relative to the other cases in the data set. Regression Assumptions and Diagnostic Statistics

55 Slide 55 Regression with a Nonmetric Independent Variable with Unequal Subgroup Variance A new nonmetric independent variable (IV3uneq) was created based on the original nonmetric independent variable (IV3). The difference between the variables is that some of the subjects were reassigned from one group to the other to make the variance of the two groups heterogeneous. Some of the subjects with higher variance from the mean on the dependent variable DV1 were assigned to group 1 and some of the subjects with lower variance from the mean on the dependent variable DV1 were assigned to group 0. Regression Assumptions and Diagnostic Statistics

56 Slide 56 Assumption of Normality The distributions and the tests of normality for the three metric variables are not affected by the change in the nonmetric independent variable. Regression Assumptions and Diagnostic Statistics

57 Slide 57 Assumption of Linearity The check of linearity is not affected by the change in the nonmetric independent variable. Regression Assumptions and Diagnostic Statistics

58 Slide 58 Assumption of Homogeneity of Variance The results of this change are shown in the following boxplot and test of homogeneity of variance, where the variance in group 1 is much larger than the variance in group 0. Regression Assumptions and Diagnostic Statistics

59 Slide 59 Differences in Correlations with the Nonmetric Independent Variable with Homogeneous and Heterogeneous Variance The correlation between the homogenous version of the IV3 variable and the dependent variable DV1 (.321) is higher than the correlation between IV3UNEQ and the dependent variable DV1 (.073). Regression Assumptions and Diagnostic Statistics

60 Slide 60 Regression Results The strength of the overall relationship between the dependent variable and the set of independent variables declined from an R² of.951 to an R² of.876, consistent with the decrease in the correlation for IV3UNEQ and DV1. All of the independent variables retained their individual relationship with the dependent variable. Regression Assumptions and Diagnostic Statistics

61 Slide 61 Residual Analysis The residual plot looks very much like a null plot and the normality plot would support a conclusion of a normal distribution. The residual plots for both metric variables do not show any pattern of nonlinearity. Regression Assumptions and Diagnostic Statistics

62 Slide 62 Outliers and Influential Cases In this analysis, we reverted back to the normally distributed dependent variable, so we would anticipate that our results would be similar to the prior results when we used the same form of the dependent variable. No casewise plot was printed in the output, indicating that no case had a standardized residual larger than +/- 3.0, which would indicate an outlier on the dependent variable. This is verified in the table of residual statistics which shows that the largest standardized residual is 2.985 and the smallest standardized residual is -2.702. Regression Assumptions and Diagnostic Statistics

63 Slide 63 Outliers and Influential Cases In the table of extreme values for the probability of Mahalanobis D², we find additional cases that are potential outliers for the combination of the set of independent variables. Since we have 100 cases and three independent variables in the analysis, the criteria for Cook's distance is 4/(n - k - 1), where n is the number of cases in the analysis and k is the number of independent variables. For this problem the criteria is: 4/(100 – 3 - 1) = 0.042. Applying this criteria to the values for Cook's distance in the table of Extreme Values, we see that we have at least five cases that have a Cook's distance larger than the criteria of 0.042. The method used to change the variance of the two groups on the IV3 variable, reassigning cases in the tails of the distribution of the dependent variable to a different group, contributed to the presence of influential cases. Regression Assumptions and Diagnostic Statistics

64 Slide 64 Summary Table The following table summarizes the changes that we have seen in our diagnostic plots and statistics with each change we have made to the dependent or independent variables:


Download ppt "Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions."

Similar presentations


Ads by Google