Presentation is loading. Please wait.

Presentation is loading. Please wait.

Testing Assumptions of Linear Regression

Similar presentations


Presentation on theme: "Testing Assumptions of Linear Regression"— Presentation transcript:

1 Testing Assumptions of Linear Regression
The four assumptions of regression The assumption of linearity The assumption of homoscedasticity The assumption of independence The assumption of normality Logic for testing assumptions

2 Assumptions of regression
The text lists four assumptions for multiple regression: 1. The relationship is linear 2. The errors have the same variance 3. The errors are independent of each other 4. The errors are normally distributed When we satisfy the assumptions, it means that we have used all of the information available from the patterns in the data. When we violate an assumption, it usually means that there is a pattern to the data that we have not included in our model, and we could actually find a model that fits the data better.

3 Assumptions of regression
There are two strategies for testing the conformity of a particular relationship to the assumptions: 1. Examining the degree to which the variables satisfy the criteria, .e.g. normality and linearity, before the regression is computed by plotting relationships and computing diagnostic statistics 2. Studying plots of residuals and computing diagnostic statistics after the regression has been computed. This presentation takes the second approach.

4 Assumption of linearity - 1
This is an example of a strong linear relationship. The red lowess (loess in SPSS) smoother is almost completely straight throughout the range of the data. The rate of change in the dependent variable is the same for all values of the independent variable.

5 Assumption of linearity - 2
This is an example of a non-linear relationship. The red lowess (loess in SPSS) smoother has been added to show the curvature of the relationship. In this section of the relationship, the change in the dependent variable per unit of change is larger. In this section of the curve, the change in the dependent variable is less than it was at lower values of fertrate. In algebraic terminology, we would characterize this as a quadratic relationship because the equation that fits it includes the square of the independent variable, e.g. y = a + bx + cx²

6 Assumption of linearity - 3
In algebraic terminology, we would characterize this as a quadratic relationship because the equation that fits it includes the square of the independent variable, e.g. pctchild = constant + fertrate + fertrate²

7 Assumption of linearity - 4
This is an example of pattern that is more ambiguous. The red lowess (loess in SPSS) smoother shows that there is slight curvature, but the blue regression line would appear to fit the data as well. For ambiguous situations, a statistical test can be helpful by providing clearer guidance of whether we should accept the linear regression model of this relationship or try to find a better model.

8 Assumption of linearity - 5
The lack of fit test is a hypothesis test of whether the pattern between the variables is linear, or a higher order polynomial term (square or cube) is needed in the regression equation. The null hypothesis for this test is: A linear model is appropriate The alternative hypothesis is: A linear model is not appropriate Failure to reject Ho satisfies the assumption

9 Testing linearity with the lack of fit test - 1
The Lack of fit test is available in the procedure for Univariate General Linear Models. Select Univariate > General Linear Model… from the Analyze menu.

10 Testing linearity with the lack of fit test - 2
First, move the dependent variable, pctchild, to the Dependent Variable list box. Third, click on the Options button to add the Lack of fit test to the output. Second, move the independent variable, gdpagric, to the Covariates list box.

11 Testing linearity with the lack of fit test - 3
First, we add the check box to display the Lack of fit test. Second, click on the Continue button to close the dialog box.

12 Testing linearity with the lack of fit test - 4
When we return to the Univariate dialog box, we click on OK to obtain the output.

13 Testing linearity with the lack of fit test - 5
In the lack of fit test, the probability of the F test statistic (F=0.87) was p = .742, greater than the alpha level of significance of The null hypothesis that "a linear regression model is appropriate" is not rejected. The research hypothesis that "a linear regression model is not appropriate" is not supported by this test. The assumption of linearity is satisfied.

14 Assumption of homogeneity of errors - 1
This assumption has a variety of names: homogeneity of variance, homoscedasticity, uniform variance, etc. All of these refer to the pattern of the errors, or residuals, when plotted against the predicted values. If the assumption is met the pattern of residuals will have about the same spread on either side of the a horizontal line drawn through the average residual. Both residuals and predicted values are standardized or studentized to make the plot easier to understand.

15 Assumption of homogeneity of errors - 2
This is an example of homogeneous residuals. The vertical spread of the residuals is approximately the same as you look from left to right on the plot. The height is the spread of the residuals. Homogeneous residuals are also referred to as uniform or homoscedastic.

16 Assumption of homogeneity of errors - 2
This is an example of heterogeneous residuals. The vertical spread of the residuals is narrower on the left side of the chart and is wider on the right side of the chart. At the left end of this chart, the residuals fall outside the -2 and -2 lines. At the left end of this chart, the residuals fall within the -2 and -2 lines. Heterogeneous residuals are also referred to as non-uniform or heteroscedastic.

17 Assumption of homogeneity of errors - 3
The regression line will assume the variance is uniform across scores, so that its predictions at one end are consistently too high, and at the other end are consistently too low.. Regression will predict scores that fall throughout the band between -2 and +2. For higher values of the dependent variable, the estimates will be consistently too small. For lower values of the dependent variable, the estimates will be consistently too large.

18 Assumption of homogeneity of errors - 2
The plot of the residuals in this chart is more ambiguous. The spread is consistent except for the upper right hand corner of the chart. Again, a statistical test can be helpful by providing clearer guidance for ambiguous situations.

19 Assumption of homogeneity of errors - 6
The Breush-Pagan test of homoscedasticity is a hypothesis test of whether the pattern of the residuals is consistent across the range of predicted values. The null hypothesis for this test is: the variance of the residuals is the same for all values of the independent variable The alternative hypothesis is: the variance of the residuals is different for some values of the independent variable Failure to reject Ho satisfies the assumption

20 Testing homogeneity of error variance
The Breusch-Pagan test for homogeneity of variance is not available in SPSS. I have created an SPSS script file that you can use to compute it. To download the script, navigate to the course web page and click on the link SPSS Scripts and Syntax Files.

21 Testing homogeneity of error variance
On the page that opens from the link, follow the directions to download the script file to your computer. Right click on the Breusch-Pagan & Koenker Test.SBS link and choose Save Link As from the pop=up menu.

22 Testing homogeneity of error variance
Note the directory where you save the file, so we can open it in SPSS. Click on the Save button to save the script on your computer.

23 Testing homogeneity of error variance
To use the script, select Run Script from the Utilities menu.

24 Testing homogeneity of error variance
Open the directory where you saves the file and click on the name of the script. Click on the Run button to run the script on your computer.

25 Testing homogeneity of error variance
The script dialog opens in a new window. First, move the dependent variable, pctchild, from the list of variables to the Dependent variable (DV) list box. Second, move the independent variable, gdpagric, to the Independent variables (IV) list box. Third, click on the Ok button to run the test.

26 Testing homogeneity of error variance
The output from the script will be located in the SPSS output window. It is ready when the message “All calculations are complete.” in the Feedback text box. The script does not close itself. To close it, click on the Cancel button.

27 Testing homogeneity of error variance
The homogeneity of error variance is tested with the Breusch-Pagan test. For this analysis, the Breusch-Pagan statistic was The probability of the statistic was p = .397, which was greater than the alpha level for diagnostic tests (p = .010). The null hypothesis that "the variance of the residuals is the same for all values of the independent variable" is not rejected. The research hypothesis that "the variance of the residuals is difference for some values of the independent variable" is not supported. The assumption of homogeneity of error variance is satisfied.

28 Assumption of independence of errors - 1
This assumption concerns another type of systematic error in the residuals that is produced by estimated values of the dependent variable that are correlated from one case to the next, i.e. serially correlated. This violation occurs most frequently when the independent variable is time, though it can occur if the cases are sorted on one of the predictors. Though it rarely happens in our type of problem, it is easy to check for it because SPSS computes the Durbin-Watson statistic that we need for the test.

29 Assumption of independence of errors - 2
This is an example of a relationship that fails the Durbin-Watson test. The relationship between these two variables is also very weak. In the center of the chart, points fall on both sides of the regression line. On the right side of the chart, the regression line is consistently and sequentially above the actual scores. On the left side of the chart, the regression line is consistently and sequentially above the actual scores.

30 Testing independence of errors - 1
The Durbin-Watson statistic is produced by the regression procedure. Select Regression > Linear… from the Analyze menu.

31 Testing independence of errors - 2
Move the dependent and independent variables to their list boxes. Click on the Statistics button to specify the Durbin-Watson statistic.

32 Testing independence of errors - 3
Click on the Continue button to close the dialog box. Mark the check box for the Durbin-Watson statistic.

33 Testing independence of errors - 4
We will need the studentized residuals for the test of normality, so we can save them to the data editor now. Click on the Save button to specify variables to be saved to the data editor.

34 Testing independence of errors - 5
Mark the check box for Studentized residuals. Click on the Continue button to close the dialog box.

35 Testing independence of errors - 6
Click on the OK button to produce the output.

36 Testing independence of errors - 7
The value of the Durbin-Watson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are not correlated if the Durbin-Watson statistic is approximately 2, and an acceptable range is The Durbin-Watson statistic for this problem is 2.10 which falls within the acceptable range from 1.50 to The analysis satisfies the assumption of independence of errors. The assumption of independence of errors is satisfied.

37 Testing independence of errors - 8
The studentized residuals are saved to the data editor as SRE_1. We will use these in the next test.

38 Assumption of normality of errors
The final assumption for regression is the assumption that the residuals are normally distributed. For individual variables, we have used +/- 1.0 rule of thumb for assessing normality. Though we could use this for testing residuals, we will use the Shapiro-Wilk test of normality. The null hypothesis for this test is: the distribution of the residuals is normal The alternative hypothesis is: the distribution of the residuals is not normal Failure to reject Ho satisfies the assumption.

39 Testing normality of errors -1
The Explore procedure tests and plots statistics for evaluating normality. Select Explore > Descriptive Statistics… from the Analyze menu.

40 Testing normality of errors - 2
Move the variable for studentized residuals to the Dependent List box. Click on the Plots button to request the normality plot.

41 Testing normality of errors - 3
Click on the Continue button to close the dialog box. Mark the check box for Normality plots with tests. I prefer to clear the check other check boxes and option buttons to limit the volume of output.

42 Testing normality of errors - 4
Click on the OK button to produce the output.

43 Testing normality of errors - 5
The Shapiro-Wilk test of studentized residuals yielded a statistical value of 0.989, which had a probability of p = .144, which was greater than the alpha level for diagnostic tests (p = .010). The null hypothesis that "the distribution of the residuals is normally distributed" is not rejected. The research hypothesis that "the distribution of the residuals is not normally distributed" is not supported. The assumption of normality of errors is satisfied.

44 Sample homework problem: Simple linear regression – testing assumptions
Based on information from the data set 2001WorldFactbook.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that the assumptions of linear regression are satisfied. Use .05 for alpha in the regression analysis and .01 for the diagnostic tests. Simple linear regression revealed a strong, positive relationship between "percent of the GDP from agriculture" [gdpagric] and "percent of the population aged 0-14 years" [pctchild] (β = 0.625, t(197) = 11.23, p < .001). Countries who had a higher percentage of GDP from agriculture had a higher percentage of the population between the ages of 0 and 14. The accuracy of predicting scores for the dependent variable "percent of the population aged 0-14 years" will improve by approximately 39% if the prediction is based on scores for the independent variable "percent of the GDP from agriculture" (r² = 0.390). The assumptions of regression analysis are satisfied. True True with caution False Incorrect application of a statistic The general framework for the problems in the homework assignment on testing assumptions in simple linear regression have only two differences from the problems on simple linear regression.

45 Sample homework problem: Simple linear regression – testing assumptions
Based on information from the data set 2001WorldFactbook.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that the assumptions of linear regression are satisfied. Use .05 for alpha in the regression analysis and .01 for the diagnostic tests. Simple linear regression revealed a strong, positive relationship between "percent of the GDP from agriculture" [gdpagric] and "percent of the population aged 0-14 years" [pctchild] (β = 0.625, t(197) = 11.23, p < .001). Countries who had a higher percentage of GDP from agriculture had a higher percentage of the population between the ages of 0 and 14. The accuracy of predicting scores for the dependent variable "percent of the population aged 0-14 years" will improve by approximately 39% if the prediction is based on scores for the independent variable "percent of the GDP from agriculture" (r² = 0.390). The assumptions of regression analysis are satisfied. True True with caution False Incorrect application of a statistic First, we set a lower probability for the diagnostic tests used to test assumptions. This implies that we are only identifying problems when they are severe. Second, the problem requires us to satisfy all of the assumptions in order to be correct. The steps for making the decision about each assumption were shown in the sections above on each assumption. Note that we are using a different data set.

46 Logic for simple linear regression: Level of measurement
Since testing assumptions is added to the end of solving simple linear regression problems, we will repeat the steps from last week’s assignment. Measurement level of independent variable? Nominal Interval/Ordinal /Dichotomous Inappropriate application of a statistic Measurement level of dependent variable? Interval/ordinal Nominal/ Dichotomous Strictly speaking, the test requires an interval level variable. We will allow ordinal level variables with a caution. Inappropriate application of a statistic

47 Logic for simple linear regression: Sample size requirement
This also a slide repeated from last week’s assignment. Compute linear regression including descriptive statistics Valid cases satisfies computed requirement? No The sample size requirement is the larger of : the number of independent variables x the number of independent variables + 105 Yes Inappropriate application of a statistic

48 Logic for simple linear regression: Significant, non-trivial relationship
This also a slide repeated from last week’s assignment. Probability for t-test of B coefficient less than or equal to alpha? No Yes False Effect size (Multiple R) is not trivial by Cohen’s scale, i.e. equal to or larger than 0.10? No In simple linear regression, r and Beta have the same numeric value as Multiple R, but may have a different sign. The are also measures of effect size. Yes False

49 Logic for simple linear regression: Strength of relationship
This also a slide repeated from last week’s assignment. Strength of relationship (effect size) correctly interpreted based Multiple R? No Yes False

50 Logic for simple linear regression: Direction of the relationship
This also a slide repeated from last week’s assignment. Direction of relationship correctly interpreted based on B or Beta coefficient? No Yes False

51 Logic for simple linear regression: Proportional reduction in error
This also a slide repeated from last week’s assignment. Reduction in error correctly interpreted based Multiple R²? No Yes False The statistics in the SPSS output match all of the statistics cited in the problem? No Add caution if dependent or independent variable is ordinal. Yes False Next we will begin testing assumptions.

52 compute lack of fit test
Logic for simple linear regression: Testing the assumption of linearity This is the start of the new slides. Use Univarate GLM to compute lack of fit test Probability for f-test for lack of fit greater than the alpha for diagnostic tests? No Yes False

53 Logic for simple linear regression: Testing the assumption of homoscedasticity
Use the script for the Breusch-Pagan Test Probability for Breusch-Pagan statistic greater than the alpha for diagnostic tests? No Yes False

54 Logic for simple linear regression: Testing the assumption of independence
Request the Durbin-Watson Statistic in a regression Size of the Durbin-Watson statistic between 1.5 and 2.5? No Yes False

55 Use the Explore procedure to request normality test
Logic for simple linear regression: Testing the assumption of normality Use the Explore procedure to request normality test Probability for Shapiro-Wilk test statistic greater than the alpha for diagnostic tests? No Yes False True


Download ppt "Testing Assumptions of Linear Regression"

Similar presentations


Ads by Google