Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/11/2016Slide 1 Extending the relationships found in linear regression to a population is procedurally similar to what we have done for t-tests and chi-square.

Similar presentations


Presentation on theme: "1/11/2016Slide 1 Extending the relationships found in linear regression to a population is procedurally similar to what we have done for t-tests and chi-square."— Presentation transcript:

1 1/11/2016Slide 1 Extending the relationships found in linear regression to a population is procedurally similar to what we have done for t-tests and chi-square tests. In regression, the null hypothesis is that there is no relationship between the dependent and independent variables. When there is no relationship, the predicted values for the dependent variable are the same for all values of the independent variable. In order for this to happen, the slope in the regression equation would have to be zero, i.e. estimated dependent variable = intercept + 0 x independent variable. The value for the independent variable would be multiplied by zero and would not change. The null hypothesis of no relationship translates to the slope = 0, or b = 0. Without a relationship, our best estimate of the value of the dependent variable is the mean of the dependent variable (best = smallest total error). The alternative hypothesis is that there is a relationship, i.e. knowing the value of the independent variable helps us do a more accurate job of predicting values of the dependent variable (more accurate = less total error) If we reject the null hypothesis, we interpret the strength and direction of the relationship for the population represented by the sample. If we fail to reject the null hypothesis, we find that the data does not support the research hypothesis.

2 1/11/2016Slide 2 To test the inference in linear regression, we are required to satisfy the conditions stated for linear regression (linearity, equal variance of the residuals, and an absence of outliers). In addition, to use the normal distribution to accurately compute probabilities for the statistical test, the distribution of the residuals must be normal. Support for the normality of the residuals mirrors the criteria used for the normality of the dependent variable in t-tests – the variables are normally distributed, or if they are not, the sample size is large enough to apply the Central Limit theorem. Since it is difficult to accurately evaluate the scatterplots to support equality of variance and normality of residuals, we introduce the use of diagnostic statistical tests which provide the same numeric criteria for making decisions we use in hypothesis tests. We will use the Breusch-Pagan test for evaluating equality of variance for the residuals and the Shapiro-Wilk test for normality. Diagnostic tests have a null hypothesis that the data meets the condition we are testing for, e.g. equality of variance or normality. Rejection of the null hypothesis implies that the condition is not satisfied..

3 1/11/2016Slide 3 Our objective in these tests is to fail to reject the null hypothesis. i. e. conclude that the variance in the residuals is uniform or the residuals are normally distributed. The goal is, thus, opposite to what we hope to find in testing regular hypothesis tests. Our purpose is to assess or diagnose our data rather than to make inferences about the population. SPSS computes the Shapiro-Wilk test, but does not compute the Breusch-Pagan test. The script for Simple Linear Regression has been modified to include the Breusch- Pagan test in a table of statistics for homoscedasticity. The modified script file is named SimpleLinearRegressionInferenceTest.SBS and is available on the course web site. Due to the difficulties in running scripts, I have also provided a syntax file that computes the Breusch-Pagan statistic and probability. Syntax files do not usually have the same problems running on different versions of SPSS that we experience with script files, but their solutions are more cumbersome. Demonstration of the syntax file is included in this tutorial. The syntax file is named BreuschPaganSyntax.sps, available on the course web site. There is an SPSS macro on the web for computing Breusch-Pagan, but I find that it does not produce correct answers (or at least the answers in SAS and R). While I would usually set a more conservative alpha of 0.01 for diagnostic tests to make sure we only respond to serious violations, we will use 0.05 for this week’s problems.

4 1/11/2016Slide 4 The introductory statement in the question indicates: The data set to use (world2007.sav) The task to accomplish (a regression slope t-test ) The variables to use in the analysis: the independent variable slum population as percentage of urban population [slumpct] and the dependent variable infant mortality rate [infmort]. The alpha level of significance for the hypothesis test: 0.05 The criteria for evaluating strength: Cohen’s criteria

5 1/11/2016Slide 5 These problem also contain a second paragraph of instructions that provide the formulas to use if the analysis requires us to re-express or transform the variable to satisfy the conditions for linear regression.

6 1/11/2016Slide 6 The first statement asks about the level of measurement. The t-test of a regression slope requires that both the dependent variable and the independent variable be quantitative.

7 1/11/2016Slide 7 Since both the independent variable slum population as percentage of urban population [slumpct] and the dependent variable infant mortality rate [infmort] are quantitative, we mark the check box for a correct answer.

8 1/11/2016Slide 8 The first statement asks about the size of the sample. To answer this question, we run the linear regression in SPSS.

9 1/11/2016Slide 9 To compute a simple linear regression, select Regression> Linear from the Analyze menu.

10 1/11/2016Slide 10 First, move the dependent variable, infmort, to the Dependent text box. Second, move the independent variable, slumpct, to the Independent(s) list box. Third, click on the Statistics button to request basic descriptive statistics.

11 1/11/2016Slide 11 First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis. Third, click on the Continue button to close the dialog box. Second, click on the Casewise diagnostics check box to produce the table with information about outliers and influential cases.

12 1/11/2016Slide 12 Next, click on the Plots button to request the residual plot.

13 1/11/2016Slide 13 Second, move *ZPRED (for standardized predictions) to the Y axis text box. First, move *ZRESID (for standardized residuals) to the Y axis text box. Fourth, click on the Continue button to close the dialog box. Third, mark the check box for a histogram and a normal probability plot of the residuals.

14 1/11/2016Slide 14 Next, click on the Save button to include Cooks distance in the output.

15 1/11/2016Slide 15 Click on the Continue button to close the dialog box. Mark the check box for Cook’s Distances to include this value in the data view and the output. Mark the check box for Standardized Residuals, which we will need in the test for the condition of normality of the residuals.

16 1/11/2016Slide 16 Click on the OK button to request the output.

17 1/11/2016Slide 17 The number of cases with valid data to analyze the relationship between "slum population as percentage of urban population" and "infant mortality rate" was 99, out of the total of 192 cases in the data set.

18 1/11/2016Slide 18 The number of cases with valid data to analyze the relationship between "slum population as percentage of urban population" and "infant mortality rate" was 99, out of the total of 192 cases in the data set. Mark the check box for a correct statement.

19 1/11/2016Slide 19 The next statement asks us to determine whether or not the data for the variables satisfies the conditions required for linear regression. Making inferences about the population based on linear regression requires four conditions or assumptions: a linear relationship between the variables, equal variance of the residuals across the predicted values, no outliers or influential cases distorting the relationship, and a normally distribution for the residuals.

20 1/11/2016Slide 20 To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu. To evaluate the linearity condition, we create a scatterplot.

21 1/11/2016Slide 21 In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create. Click on the Define button to go to the next step.

22 1/11/2016Slide 22 First, move the dependent variable infmort to the Y axis text box. Second, move the independent variable slumpct to the X axis text box. Third, click on the OK button to produce the plot.

23 1/11/2016Slide 23 The scatterplot appears in the SPSS output window. To facilitate our determination about the linearity of the plot, we will add a linear fit line, a loess fit line, and a confidence interval to the plot. See slides 8 through 18 in the powerpoint titled: SimpleLinearRegression-Part2.ppt for directions on adding the fit lines and confidence interval to the plot.

24 1/11/2016Slide 24 The criteria we use for evaluating linearity is a comparison of the loess fit line to the linear fit line. If the loess fit line falls within a 99% confidence interval around the linear fit line, we characterize the relationship as linear. Minor fluctuations over the lines of the confidence interval are ignored. The loess fit line in the scatterplot of the relationship between "slum population as percentage of urban population" and "infant mortality rate" does not lie within the confidence interval around the linear fit line. The pattern of points in the scatterplot shows an obvious curve, indicating non-linearity. We will re-express one or both variables if they are badly skewed to see if the relationship using transformed variables satisfies the assumption of linearity.

25 1/11/2016Slide 25 Since we did not satisfy the linearity condition, the statement is not marked. We do not need to test the other conditions, since we know we will not meet all of them. We will re-express one or both variables if they are badly skewed to see if the relationship using transformed variables satisfies the assumption of linearity.

26 1/11/2016Slide 26 When the raw data does not satisfy the conditions of linearity and equal variance, we examine the skewness of the variables to identify problematic skewing for one or both variables that might be corrected with re-expression. This statement suggests that the correct transformation should be a log of infant mortality rate. We should re-express variables that have skewness equal to or less than - 1.0 or equal to or greater than +1.0.

27 1/11/2016Slide 27 We will use the Descriptives procedure to obtain skewness for both variables. Select Descriptive Statistics > Descriptives from the Analyze menu.

28 1/11/2016Slide 28 First, move the variables infmort and slumpct to the Variable(s) list box. Second, click on the Options button to specify our choice for statistics.

29 1/11/2016Slide 29 Next, mark the check boxes for Kurtosis and Skewness in addition to the defaults marked by SPSSS. Finally, click on the Continue button to close the dialog box.

30 1/11/2016Slide 30 Click on the OK button to produce the output.

31 1/11/2016Slide 31 The skewness for "infant mortality rate" [infmort] was 1.470. The skewness for "slum population as percentage of urban population" [slumpct] was -0.178. Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale. Since the skew for the independent variable "slum population as percentage of urban population" [slumpct] (-0.178) was between -1.0 and +1.0, we do not attempt to correct violation of assumptions by re-expressing it.

32 1/11/2016Slide 32 Since the skew for the dependent variable "infant mortality rate" [infmort] (1.470) was equal to or greater than +1.0, we attempt to correct violation of assumptions by re-expressing "infant mortality rate" on a logarithmic scale. We mark the statement as correct.

33 1/11/2016Slide 33 The next statement asks us to determine whether or not the data using the re-expressed variable satisfies the conditions required for linear regression. We check to see if the re-expressed variables satisfy the four conditions or assumptions required to make inferences about the population based on linear regression: a linear relationship between the variables, equal variance of the residuals across the predicted values, no outliers or influential cases distorting the relationship, and a normally distribution for the residuals.

34 1/11/2016Slide 34 We first create the transformed variable, the logarithm of infmort. Select the Compute Variable command from the Transform menu.

35 1/11/2016Slide 35 First, type the name for the re-expressed variable in the Target Variable text box. The directions for the problem give us the formula for the transformation: The formulas to transform "infant mortality rate" are "LG10(infmort)" and "(infmort)**2". Second, type the formula in the Numeric Expression text box. Third, click on the OK button to compute the transformation.

36 1/11/2016Slide 36 Next, we create the scatterplot for the relationship with the re-expressed variable. To create the scatterplot, select the Legacy Dialogs > Scatter/Dot from the Graphs menu.

37 1/11/2016Slide 37 In the Scatter/Dot dialog box, we click on Simple Scatter as the type of plot we want to create. Click on the Define button to go to the next step.

38 1/11/2016Slide 38 First, move the dependent variable LG_infmort to the Y axis text box. Second, move the independent variable slumpct to the X axis text box. Third, click on the OK button to produce the plot.

39 1/11/2016Slide 39 The scatterplot looks linear, but to make sure we will add fit lines and a confidence interval. The criteria we use for evaluating linearity is a visual comparison of the loess fit line to the linear fit line. If the loess fit line falls within the 99% confidence interval around the linear fit line, we characterize the relationship as linear. Minor fluctuations within the confidence interval or over the boundary of the confidence interval are ignored.

40 1/11/2016Slide 40 The loess fit line in the scatterplot of the relationship between "slum population as percentage of urban population" and the log transformation of "infant mortality rate" lies within the confidence interval around the linear fit line. The relationship is sufficiently linear to satisfy the assumption of linearity.

41 1/11/2016Slide 41 To compute a simple linear regression, select Regression> Linear from the Analyze menu. We next do the regression analysis using the transformed variable, creating the residual plot and the normality plot in the process.

42 1/11/2016Slide 42 First, move the dependent variable, LG_infmort, to the Dependent text box. Second, move the independent variable, slumpct, to the Independent(s) list box. Third, click on the Statistics button to request basic descriptive statistics.

43 1/11/2016Slide 43 First, in addition to the defaults marked by SPSS, mark the check box for Descriptives so that we get the number of cases used in the analysis. Third, click on the Continue button to close the dialog box. Second, click on the Casewise diagnostics check box to produce the table with information about outliers and influential cases.

44 1/11/2016Slide 44 Next, click on the Plots button to request the residual plot and the normality plot.

45 1/11/2016Slide 45 Second, move *ZPRED (for standardized predictions) to the Y axis text box. First, move *ZRESID (for standardized residuals) to the Y axis text box. Fourth, click on the Continue button to close the dialog box. Third, mark the check box for a histogram and a normal probability plot of the residuals.

46 1/11/2016Slide 46 Next, click on the Save button to include Cook’s distance in the output.

47 1/11/2016Slide 47 Click on the Continue button to close the dialog box. Mark the check box for Cook’s Distances to include this value in the data view and the output. Mark the check box for Standardized Residuals, which we will need to test for the condition of normality of the residuals.

48 1/11/2016Slide 48 Click on the OK button to request the output.

49 1/11/2016Slide 49 The criteria we use for evaluating equal variance is a visual inspection of the residual plot to determine whether the horizontal pattern of the points is more rectangular or more funnel shaped, i.e. narrowly spread at one end of the plot and widely spread at the other end. If the plot of the residuals is more rectangular, the assumption of equal variance is satisfied. If the plot of the residuals is more funnel-shaped, the assumption of equal variance is not satisfied.

50 1/11/2016Slide 50 Because it is often difficult to distinguish when the pattern of the points is rectangular or funnel-shaped, we will supplement the evaluation of equal variance with a diagnostic statistical test: the Breusch-Pagan test. The Breusch-Pagan statistic tests the null hypothesis that the variance of the residuals is the same for all values of the independent variable. When the probability of Breusch-Pagan statistic is less than or equal to alpha, we reject the null hypothesis, supporting a finding that the variance of residuals is different for residuals and we do not satisfy the equal variance assumption.

51 1/11/2016Slide 51 To use the syntax file downloaded from the course web site, select the Open > Syntax command from the Open menu. Download the syntax file, BreuschPaganSyntax.SPS from the course web site.

52 1/11/2016Slide 52 Click on the Open button to open the syntax file. Highlight the syntax file, BreuschPaganSyntax.SPS.

53 1/11/2016Slide 53 The file opens in the SPSS Syntax Editor. The syntax file uses the Data Editor to store its results, creating all of these additional variables. These DELETE command remove the extra variables. If the syntax is run without these variables, SPSS will issue warning messages which have no real consequence. If the file is run more than once, SPSS will generate a number of warning messages that it will not replace variables that were previously, and we may not be looking at the correct results for our problem. We need to replace the names for the dependent and independent variables. Highlight the text for dependentVariableName.

54 1/11/2016Slide 54 Type the name of the dependent variable, LG_infmort. Highlight the text for independentVariableName.

55 1/11/2016Slide 55 First, replace the highlighted text with the name of the independent variable. Entering the names of the variables is all that we need to change. To execute the commands in the syntax file, select All from the Run menu second. Note: be careful so that the period at the end of the command lines are not deleted.

56 1/11/2016Slide 56 Since we had not run the syntax file before, SPSS produces a warning message for each of the variable names on the DELETE commands. It thinks that we are asking it to delete a variable that does not exist and it wants to let us know. These warning messages have no consequence.

57 1/11/2016Slide 57 The syntax file added all of these variables (and more to the left) to the data editor. The syntax file omits cases with missing data from the analysis.

58 1/11/2016Slide 58 The interpretation of equal variance based on visual inspection of the residual plot is supported by the Breusch-Pagan statistic of 3.300 with a probability of p =.069, greater than the alpha of p =.050. The null hypothesis is not rejected, and the assumption of equal variance is supported. The variable bp contains the Breusch- Pagan statistic and the column bpSig contains the p-value for the statistic. Having satisfied the condition for equal variance, we next check for influential cases.

59 1/11/2016Slide 59 Outliers and influential cases require can alter the regression model that would otherwise represent the majority of cases in the analysis. SPSS will save Cook's distances as a measure of influence to the data editor so we can identify that have a large Cook's distance. We will operationally define a large Cook's distance as a value of 0.5 or more. When we ran the regression using LG_infmort as the dependent variable, we requested that Cook’s distances be saved to the Data Editor and that our output include Casewise diagnostics. In the table titled “Residuals Statistics”, we see that the maximum Cook’s distance was.152, less than the criteria of 0.5. In this problem, there were no cases that had a Cook's distance of 0.5 or greater, qualifying as influential cases. Since we have no outliers or influential cases, we will test the final criteria of normality of the residuals.

60 1/11/2016Slide 60 The linear regression model expects the residuals to have a normal distribution. The distribution of the residuals is evaluated with the normality plot which compares the points for the actual distribution of the cases to a diagonal line that represents the expected pattern for a normally distributed variable. If the points deviate substantially and consistently from the diagonal line, the residuals are not normally distributed. Minor fluctuations around the line or at either end of the line can be ignored. In this problem, the plot of standardized regression residuals follows the diagonal, indicating that the residuals are normally distributed.

61 1/11/2016Slide 61 Because it is often difficult to distinguish whether or not points deviate substantially and consistently from the diagonal line, we will supplement the evaluation of normality of residuals with a diagnostic statistical test: the Shapiro-Wilk test. The Shapiro-Wilk statistic tests the null hypothesis that the distribution of the residuals is normal. When the probability of Shapiro-Wilk statistic is less than or equal to alpha, we reject the null hypothesis, supporting a finding that the residuals are not normally distributed and we do not satisfy the assumption of normality.

62 1/11/2016Slide 62 The normality tests are part of the Explore procedure. Select Descriptive Statistics > Explore from the Analyze menu.

63 1/11/2016Slide 63 Move the variable ZRE_2 to the Dependent List. The normality statistical tests are included with the plots, so we click on the Plots button. The normal condition requires that the residuals be normally distributed. We saved standardized residuals when we ran the regression. The correct choice is the standardized residuals from the second analysis (ZRE_2), in which we used the transformed variable, LG_infmort. If we had satisfied the regression conditions without re-expressing the data, we would not have run the second regression and would have selected ZRE_1 to test for normality.

64 1/11/2016Slide 64 Mark the check box for Normality plots with tests. Click on the Continue button to close the dialog box.

65 1/11/2016Slide 65 Click on the OK button to produce the output.

66 1/11/2016Slide 66 The interpretation of normal residuals is supported by the Shapiro-Wilk statistic of 0.989 with a probability of p =.612, greater than the alpha of p =.050. The null hypothesis is not rejected, and the assumption of normal residuals. is supported.

67 1/11/2016Slide 67 We have satisfied all four of the conditions for making inferences based on linear regression. Mark the check box for a correct answer.

68 1/11/2016Slide 68 When the p-value for the statistical test is less than or equal to alpha, we reject the null hypothesis and interpret the results of the test. If the p-value is greater than alpha, we fail to reject the null hypothesis and do not interpret the result.

69 1/11/2016Slide 69 The p-value for this test (p <.001) is less than or equal to the alpha level of significance (p =.050) supporting the conclusion to reject the null hypothesis.

70 1/11/2016Slide 70 The p-value for this test (p <.001) is less than or equal to the alpha level of significance (p =.050) supporting the conclusion to reject the null hypothesis. Mark the question as correct. Rejection of the null hypothesis supports the research hypothesis and we interpret the results.

71 1/11/2016Slide 71 Since we know that we re-expressed the data to satisfy the conditions for linear regression, we skip the question that interprets the raw variables.

72 1/11/2016Slide 72 The final question focuses on the strength and direction of the relationship.

73 1/11/2016Slide 73 The strength of the relationship is based on the multiple R statistic in the Model Summary table. Applying Cohen's criteria for effect size (less than ±0.10 = trivial; ±0.10 up to ±0.30 = weak or small; ±0.30 up to ±0.50 = moderate; ±0.50 or greater = strong or large), the relationship was correctly characterized as a strong relationship (R =.795). Note: in SPSS output, the R statistic is always positive, so it does not show the direction of the relationship. The direction of the relationship is based on the b coefficient.

74 1/11/2016Slide 74 Since the sign of the b coefficient was positive (b =.01), the relationship is positive and the values for the variables move in the same direction. Higher scores on the variable "slum population as percentage of urban population" were associated with higher scores on the log transformation of "infant mortality rate".

75 1/11/2016Slide 75 The strength and direction of the relationship were both correctly stated. The question is marked as correct.

76 1/11/2016Slide 76 Logic outline for homework problems Both variables are quantitative? Yes Do not mark check box. Mark statement check box. No Mark only “None of the above.” Stop. Number of valid cases stated correctly? Do not mark check box. No Yes Mark statement check box.

77 1/11/2016Slide 77 Yes No Relationship between variables is linear? Variance of residuals is homogeneous? No outliers impacting regression solution? Residuals are normally distributed? Yes Mark check box for regression conditions Do not mark check box. No Linear pattern in scatterplot Residual plot and Breusch-Pagan test Normality plot and Shapiro-Wilk test or Central Limit Theorem Cook’s distance < 0.5

78 1/11/2016Slide 78 Skew of variables ≤ - 1.0, ≥ +1.0? Yes Do not mark re-expression check box. No Stop. Re-express badly skewed variables With no skewed variables, we do not have a strategy for meeting conditions. Since we have satisfied regression conditions, the question on re-expressing data is skipped.

79 1/11/2016Slide 79 Yes No Relationship between variables is linear? Variance of residuals is homogeneous? No outliers impacting regression solution? Residuals are normally distributed? Yes Mark check box for regression conditions Do not mark check box. No Stop. Linear pattern in scatterplot Residual plot and Breusch-Pagan test Normality plot and Shapiro-Wilk test or Central Limit Theorem Cook’s distance < 0.5 We can’t meet conditions. Since we have satisfied regression conditions, we do not re-express and do not check conditions.

80 1/11/2016Slide 80 Yes Do not mark check box. No Mark statement check box. Reject H 0 is correct decision (p ≤ alpha)? Stop. We interpret results only if we reject null hypothesis. Interpretation is stated correctly? Yes Do not mark check box. Mark statement check box. No The interpretation is stated for both raw data and re-expressed data.


Download ppt "1/11/2016Slide 1 Extending the relationships found in linear regression to a population is procedurally similar to what we have done for t-tests and chi-square."

Similar presentations


Ads by Google