Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regression Diagnostics

Similar presentations


Presentation on theme: "Regression Diagnostics"— Presentation transcript:

1 Regression Diagnostics
SRM 625 Applied Multiple Regression, Hutchinson

2 SRM 625 Applied Multiple Regression, Hutchinson
Prior to interpreting your regression results, you should examine your data for potential problems that could affect your findings using various diagnostic techniques SRM 625 Applied Multiple Regression, Hutchinson

3 Types of possible problems
Assumption violations Outliers and influential cases Multicollinearity SRM 625 Applied Multiple Regression, Hutchinson

4 Regression Assumptions
Error-free measurement Correct model specification Assumptions about residuals SRM 625 Applied Multiple Regression, Hutchinson

5 Assumption that variables are measured without error
Presence of measurement error in Y leads to increase in standard error of estimate If standard error of estimate is inflated what happens to the F test for R2? (hint: think about the relationship between the standard error and mean square error) SRM 625 Applied Multiple Regression, Hutchinson

6 SRM 625 Applied Multiple Regression, Hutchinson
In a bivariate regression, measurement error in X always leads to underestimation of regression coefficient What are the implications of this for interpreting results regarding X? SRM 625 Applied Multiple Regression, Hutchinson

7 SRM 625 Applied Multiple Regression, Hutchinson
What are the possible consequences of measurement error when one or more IVs has poor reliability in a multiple regression model? SRM 625 Applied Multiple Regression, Hutchinson

8 SRM 625 Applied Multiple Regression, Hutchinson
Evidence to assess violation of the assumption of error-free measurement Reliability estimates for your independent and dependent variables What would constitute "acceptable" reliability? SRM 625 Applied Multiple Regression, Hutchinson

9 SRM 625 Applied Multiple Regression, Hutchinson
How might you attempt to minimize violation of the assumption during the design and planning phase of your study? SRM 625 Applied Multiple Regression, Hutchinson

10 Assumption that the regression model has been correctly specified
Linearity Inclusion of all relevant independent variables Exclusion of irrelevant independent variables SRM 625 Applied Multiple Regression, Hutchinson

11 Assumption of Linearity
Violation of this assumption can lead to downward bias of regression coefficients If data are curvilinearly related there are methods for dealing with curvilinear data Require use of multiple regression and transformation of variables Note: we will discuss methods for addressing nonlinear relationships later in the course SRM 625 Applied Multiple Regression, Hutchinson

12 Detecting nonlinearity
In bivariate, can examine scatterplots of X and Y Not sufficient in multiple regression However, can examine partial regression plots between each IV and the DV, controlling for other IVs In multiple regression, residuals plots are primarily used SRM 625 Applied Multiple Regression, Hutchinson

13 SRM 625 Applied Multiple Regression, Hutchinson
Residuals plots Typically involve scatterplots with either standardized, studentized, or unstandardized residuals plotted against predicted Y, i.e., versus SRM 625 Applied Multiple Regression, Hutchinson

14 SRM 625 Applied Multiple Regression, Hutchinson
A residuals scatterplot should reflect a broad horizontal band of points (i.e., should look like scatterplot for r = 0). If plot forms some type of pattern, it could indicate an assumption violation Specifically, for nonlinearity the plot would reflect a curve SRM 625 Applied Multiple Regression, Hutchinson

15 SRM 625 Applied Multiple Regression, Hutchinson
Sample residuals plot Does this appear to be a correlation = 0? SRM 625 Applied Multiple Regression, Hutchinson

16 Sample partial regression plot
SRM 625 Applied Multiple Regression, Hutchinson

17 Assumption that all important independent variables have been included
If omitted variables are correlated with variables in equation, violation of this assumption can lead to biased parameter estimates (e.g., incorrect values of regression coefficients) Fairly serious violation SRM 625 Applied Multiple Regression, Hutchinson

18 SRM 625 Applied Multiple Regression, Hutchinson
Violation can also lead to non-random residuals (i.e., residuals that include systematic variance associated with the omitted variables) If omitted variables are not correlated with variables in the model, parameter estimates are not biased, but standard errors associated with the independent variables are biased upward (i.e., inflated) SRM 625 Applied Multiple Regression, Hutchinson

19 For example: Error includes: autonomy task enjoyment working conditions etc. Job Satisf Salary Therefore, if autonomy, task enjoyment, etc. are correlated with job satisfaction, residuals (which reflect autonomy, task enjoyment, etc.), would be correlated with predicted job satisfaction

20 How do we determine if this assumption is violated?
Can examine residuals plots Again, plot residuals against predicted values of Y Again, hope to see a broad horizontal band of points If plot reflects some type of discernable pattern, e.g., a linear pattern, it could suggest omitted variables SRM 625 Applied Multiple Regression, Hutchinson

21 What can you do if it appears you have violated this assumption?
SRM 625 Applied Multiple Regression, Hutchinson

22 How might we attempt to prevent violation of this assumption?
SRM 625 Applied Multiple Regression, Hutchinson

23 Assumption that no irrelevant independent variables have been included
Will lead to inflated standard errors for the regression coefficients (not just those corresponding to the irrelevant variables) What effect could this have on conclusions you draw about the contributions of your independent variables? SRM 625 Applied Multiple Regression, Hutchinson

24 How can you determine if you have violated this assumption?
SRM 625 Applied Multiple Regression, Hutchinson

25 What might you do to avoid this potential assumption violation?
SRM 625 Applied Multiple Regression, Hutchinson

26 Assumptions about errors
Residuals have mean of zero Residuals are random Residuals are normally distributed Residuals have equal variance (i.e., homoscedasticity) SRM 625 Applied Multiple Regression, Hutchinson

27 Residuals (or errors) are random
Residuals should be uncorrelated with both Y and predicted Y Residuals should be uncorrelated with independent variables Residuals should be uncorrelated with one another This is comparable to the independence of observations assumption What this means is that the reason for prediction error for one person should be unrelated to the reason for prediction error for another person SRM 625 Applied Multiple Regression, Hutchinson

28 SRM 625 Applied Multiple Regression, Hutchinson
If violate, tests of significance cannot be trusted F and t tests are not robust to violations of this assumption This assumption is most likely to be violated: in longitudinal studies, or when important variables have been left out of the equation, or if observations are clustered, e.g., When subjects are sampled from intact groups or in cluster sampling SRM 625 Applied Multiple Regression, Hutchinson

29 Residuals are normally distributed
Residuals are assumed to be normally distributed around the regression line for all values of X This is analogous to the normality assumption in a t-test or ANOVA SRM 625 Applied Multiple Regression, Hutchinson

30 Illustration of data which violate assumption
of normality

31 Normal probability plot of residuals
SRM 625 Applied Multiple Regression, Hutchinson

32 Residuals have equal variance
Residuals should be evenly spread around the regression line Known as the assumption of homoscedasticity Same as assumption of homogeneity of variance in ANOVA but with equal variances on Y for each value of X SRM 625 Applied Multiple Regression, Hutchinson

33 SRM 625 Applied Multiple Regression, Hutchinson
Illustration of homoscedastic data SRM 625 Applied Multiple Regression, Hutchinson

34 Illustration of heteroscedasticity
SRM 625 Applied Multiple Regression, Hutchinson

35 Further evidence of heteroscedasticity and nonnormality
SRM 625 Applied Multiple Regression, Hutchinson

36 Why is violation of the homoscedasticity assumption a problem?
SRM 625 Applied Multiple Regression, Hutchinson

37 What can you do if your data are heteroscedastic?
Can use weighted least squares instead of ordinary least squares as your estimation procedure WLS weights each case so that cases with larger error variances receive less weight (in OLS each case is weighted 1) SRM 625 Applied Multiple Regression, Hutchinson

38 Outliers and Influential Cases
Influential observations Leverage Extreme on both X and Y SRM 625 Applied Multiple Regression, Hutchinson

39 SRM 625 Applied Multiple Regression, Hutchinson
What is an outlier? A case with an extreme value of Y Presence of outliers can be detected by examination of residuals SRM 625 Applied Multiple Regression, Hutchinson

40 Types of residuals used in outlier detection
Standardized residuals Studentized residuals Studentized deleted residuals SRM 625 Applied Multiple Regression, Hutchinson

41 Standardized Residuals
Unstandardized residuals that have been converted to z-scores Not recommended by some because their calculation makes the assumption that all residuals have the same variance (as measured by the overall Sy.x) SRM 625 Applied Multiple Regression, Hutchinson

42 Studentized Residuals
Similar to standardized residuals but use different standard deviations for each residual Generally more sensitive than standardized residuals Follow an approximate t distribution SRM 625 Applied Multiple Regression, Hutchinson

43 Studentized Deleted Residuals
Studentized deleted residuals are the same as studentized residuals except they remove the case with the extreme value from their calculation Addresses a potential problem of studentized residuals which include the outlier in their calculation (thus increasing risk of inflated standard error) SRM 625 Applied Multiple Regression, Hutchinson

44 Comparing the three types of residuals
SRM 625 Applied Multiple Regression, Hutchinson

45 SRM 625 Applied Multiple Regression, Hutchinson
Leverage Reflects cases with extreme values on one or more of the independent variables May or may not exert influence on the equation SRM 625 Applied Multiple Regression, Hutchinson

46 How does one identify cases with high leverage?
SPSS produces values of leverage (h) which can range between 0 and 1 One "rule of thumb" suggests h > 2(k + 1)/N as a high leverage value Another rule of thumb is that h ≤ .2 indicates trivial leverage whereas values > suggests substantial leverage requiring further examination Other researchers recommend looking at relative differences SRM 625 Applied Multiple Regression, Hutchinson

47 Leverage Example (based on 3 IVS, N = 171)
SRM 625 Applied Multiple Regression, Hutchinson

48 Mahalanobis distance (D2)
A method for detecting multivariate outliers, i.e., cases with unexpected combinations of independent variables Represents the distance of a case from the centroid of the remaining cases, where the centroid represents the intersection of the means of all the variables One rule of thumb suggests high values exceed the 2 critical with degrees of freedom equal to the number of IVs in the model SRM 625 Applied Multiple Regression, Hutchinson

49 SRM 625 Applied Multiple Regression, Hutchinson
Mahalanobis D2 example Note: model based on 6 IVs SRM 625 Applied Multiple Regression, Hutchinson

50 SRM 625 Applied Multiple Regression, Hutchinson
It should be noted that just because a case is an outlier and/or exhibits high leverage does not necessarily mean it is influential SRM 625 Applied Multiple Regression, Hutchinson

51 Influential Observations
Tend to be outliers on both X and Y (although do not have to be) Are considered influential because their presence (or lack thereof) makes a difference in the regression equation, e.g., coefficients, R2, etc. tend to change when influential observations are versus aren't in the sample SRM 625 Applied Multiple Regression, Hutchinson

52 How are influential cases identified?
DFBETA'S Cook's D SRM 625 Applied Multiple Regression, Hutchinson

53 SRM 625 Applied Multiple Regression, Hutchinson
DFBeta Represents the estimated change in an unstandardized regression coefficient when a particular case is deleted Note that standardized values of dfbeta can also be requested There will be values of dfbetas for each IV and for each subject/participant Larger values indicate greater influence exerted by a particular case One rule of thumb is to flag values > SRM 625 Applied Multiple Regression, Hutchinson

54 SRM 625 Applied Multiple Regression, Hutchinson
Cook's D A measure of influence that flags observations which might be influential due to their values on one or more X's, Y, or a combination One rule of thumb is to consider values of Cook's D > 1 as indicating potential influence; another is to look for “gaps” SRM 625 Applied Multiple Regression, Hutchinson

55 SRM 625 Applied Multiple Regression, Hutchinson
Cook’s D example SRM 625 Applied Multiple Regression, Hutchinson

56 SRM 625 Applied Multiple Regression, Hutchinson
If cases are identified as outliers, high leverage cases, or potentially influential observations, what should you do with them? Keep or drop? SRM 625 Applied Multiple Regression, Hutchinson

57 General Recommendations
Identify cases which are outliers on Y check first for coding errors Identify cases which are outliers on X again check for coding errors Identify points that are flagged as potentially influential SRM 625 Applied Multiple Regression, Hutchinson

58 SRM 625 Applied Multiple Regression, Hutchinson
For those cases flagged as potentially influential, run the regression analysis with and without those points (deleting one at a time) to see what effect they have on the regression results SRM 625 Applied Multiple Regression, Hutchinson

59 SRM 625 Applied Multiple Regression, Hutchinson
What will you look for? How will you decide what to do with the outlying case(s)? SRM 625 Applied Multiple Regression, Hutchinson

60 SRM 625 Applied Multiple Regression, Hutchinson
Regardless of whether or not an outlier is influential, you should attempt to find out reasons for such extreme scores. How might you do that and why? SRM 625 Applied Multiple Regression, Hutchinson

61 SRM 625 Applied Multiple Regression, Hutchinson
Collinearity In general, collinearity refers to overlap or correlations among 2 independent variables In the extreme case, 2 variables are identical I.e., in a scatterplot observations for the 2 variables would fall exactly on the same line Multicollinearity refers to collinearity among > 2 variables SRM 625 Applied Multiple Regression, Hutchinson

62 SRM 625 Applied Multiple Regression, Hutchinson
collinearity – cont’d Redundancy and repetitiveness are two related concepts Redundancy indicates two variables that are telling us something similar but which may or may not represent the same concept Repetitiveness occurs when the researcher includes > 1 measure of the same construct In this case, it might be preferable to test the variables as a set rather than as individual variables SRM 625 Applied Multiple Regression, Hutchinson

63 Effects of Collinearity
Can produce misleading regression results, e.g., where 2 (highly correlated) independent variables correlate similarly with the dependent variable, but only one is statistically significant in the multiple regression Can lead to underestimates of regression coefficients Can inflate standard errors of regression coefficients Standard errors are at a minimum when IVs are completely uncorrelated When r = 1 between 2 or more IVs, standard errors cannot be computed Determinant of matrix = 0, matrix cannot be inverted SRM 625 Applied Multiple Regression, Hutchinson

64 Detection of Collinearity
Bivariate correlations inadequate in detecting multicollinearity Large changes in regression coefficients as variables are added to (or deleted from) the model Presence of large standard errors or signs of coefficients in unexpected directions VIF Tolerance Condition numbers SRM 625 Applied Multiple Regression, Hutchinson

65 VIF (Variance Inflation Factor)
Indicates inflation in the variance of b’s or betas as a result of collinearity among independent variables Larger VIF values indicate greater levels of collinearity VIF = 1 (its lowest value) when r = 0 among IVs Some have suggested VIF > 10 as indicating collinearity; however, problematic collinearity occurs even with VIF considerably < 10 VIF = 1 / tolerance SRM 625 Applied Multiple Regression, Hutchinson

66 SRM 625 Applied Multiple Regression, Hutchinson
Tolerance For any given independent variable, tolerance reflects the proportion of variance that is NOT accounted for in the remaining independent variables Therefore, small numbers indicate collinearity SPSS uses as its default for halting analyses on the basis of collinearity however, collinearity will lead to problems long before tolerance reaches such an extreme level As tolerance values become small, problems will occur in the accuracy of calculating the parameter estimates SRM 625 Applied Multiple Regression, Hutchinson

67 SRM 625 Applied Multiple Regression, Hutchinson
tolerance – cont’d SRM 625 Applied Multiple Regression, Hutchinson

68 Condition Numbers and Eigenvalues
Eigenvalues can also be used as a diagnostic for collinearity with smaller eigenvalues indicating greater collinearity An eigenvalue of 0 indicates linear dependency An index based on eigenvalues is the Condition Number Larger values indicate greater collinearity with > 15 suggesting some collinearity and values > 30 suggesting a serious problem SRM 625 Applied Multiple Regression, Hutchinson

69 condition number – cont’d
SRM 625 Applied Multiple Regression, Hutchinson

70 What to do if faced with collinearity
Could omit one of the “problem” variables but might then risk model misspecification Avoid multiple indicators of the same construct If not too correlated could test as a block of variables But if correlations between indicators are excessively high the collinearity could still cause problems for other variables in the model SRM 625 Applied Multiple Regression, Hutchinson

71 SRM 625 Applied Multiple Regression, Hutchinson
If it makes conceptual sense to do so, you can combine or aggregate the correlated independent variables Use another type of regression such as ridge regression for which collinearity is not as much of a problem Could use centering but only appropriate for non-essential collinearity SRM 625 Applied Multiple Regression, Hutchinson


Download ppt "Regression Diagnostics"

Similar presentations


Ads by Google