Presentation on theme: "Model Adequacy Running a Real Regression Analysis"— Presentation transcript:
1Model Adequacy Running a Real Regression Analysis Testing Assumptions, Checking for Outliers, and More
2Hold upBefore the data was collected, did you bother to do a power analysis to estimate the sample size needed?Same ingredients are necessary for determining N as before: alpha, effect size, desired power/betaSee an example in the commentaryExample in R: p is the number of predictorslibrary(MBESS)ss.power.R2(Population.R2=.5, alpha.level=.05, desired.power=.85, p=5)
3LinearityOne should obviously check to see if the relation is a linear oneCheck the regression of the DV on the compositeTests can be done which examine curvilinear possibilities and whether those would be viable
4Normal distribution of residuals Our normality assumption applies to the residualsOne can simply save them and plot a density curve/histogramOften a quantile-quantile plot is readily available, and here we hope to find most of our data along a 45 degree line*After fitting the model, models/graphs/basic diagnostic plots in R-commander. 1 click provides all of them
5HomoscedasticityWe can check a plot of the residuals vs our predicted values to get a sense of the spread along the regression lineWe prefer to see kind of a blob about the zero line (our mean), with no readily discernable patternThis would mean that the residuals don’t get overly large for certain areas of the regression line relative to others
6CollinearityMultiple regression is capable of analyzing data with correlated predictor variablesHowever, problems can arise from situations in which two or more variables are highly intercorrelatedPerfect collinearityOccurs if predictors are linear functions of each other (ex., age and year of birth), when the researcher creates dummy variables for all values of a categorical variable rather than leaving one out, and when there are fewer observations than variablesNo unique regression solutionLess than perfect (the usual problem)Inflates standard errors and makes assessment of the relative importance of the predictors unreliableAlso means that a small number of cases potentially can affect results strongly
7Collinearity Simple and Multi- Collinearity When two or more variables are highly correlatedCan be detected by looking at the zero order correlationsBetter is to regress each predictor on all other variables and look for large R2s*Although our estimates of our coefficients are not necessarily biased, they become inefficientJump around a lot from sample to sample*You don’t have to actually do that. The tolerance statistic is just 1- that R2, and if the Variance Inflation Factor is given
8Collinearity diagnostics Tolerance*Proportion of a predictors’ variance not accounted for by other variablesLooking for tolerance values that are small, close to zero as problematicMeans they are not contributing anything new to the modeltolerance = 1/VIFVIFVariance inflation factorLooking for VIF values that are largeE.g. individual VIF greater than 10 should be inspectedVIF=1/toleranceOther Indicators of CollinearityEigenvaluesSmall values, close to zeroCondition indexLarge values (15+)*Essentially, it is 1 - the R2 for the model in which the other predictors are predicting that predictor. While I prefer it on an intuitive level, the VIF is often reported.
9Dealing with collinearity Collinearity not necessarily a problem if the goal is to predict, not explainInefficiency of coefficients may not pose a real problemLarger N might help reduce standard error of our coefficientsCombine variables to create a composite, Remove variableMust be theoretically feasibleCentering the data (subtracting the mean)Interpretation of coefficients will change as variables are now centered on zeroRecognize its presence and live with the consequences
10Independence of residuals We need to have our data points be independent of one another, as we have in other statistical analysesAs an example, subsets of various demographic categories could in theory result in a lack of independenceInclude the variable in the modelThe Durbin-Watson statistic is usually examined0 to 4, 2 good, deviations from it are to be examined< 1 (indicates positive serial correlation), >3 (negative)Better, use a statistical test for itLike many tests regarding assumptions, we would prefer not to have a significant result
11Regression Diagnostics Of course all of the previous information would be relatively useless if we are not meeting our assumptions and/or have overly influential data pointsIn fact, you shouldn’t be really looking at the results unless you test assumptions and look for outliers, even though this requires running the analysis to begin withVarious tools are available for the detection of outliersClassical methodsStandardized Residuals (ZRESID)Studentized Residuals (SRESID)Studentized Deleted Residuals (SDRESID)Ways to think about outliersLeverageDiscrepancyInfluenceThinking ‘robustly’
12Regression Diagnostics Standardized Residuals (ZRESID)Standardized errors in predictionMean 0, Sd = std. error of estimateTo standardize, divide each residual by its s.e.e.At best an initial indicator (e.g. the +2 rule of thumb), but because the case itself determines what the variance would be, almost uselessStudentized Residuals (SRESID)Same thing but studentized residual recognizes that the error associated with predicting values far from the mean of X is larger than the error associated with predicting values closer to the mean of Xstandard error is multiplied by a value that will allow the result to take this into accountStudentized Deleted Residuals (SDRESID)Studentized in which the standard error is calculated with the case in question removed from the others
13Regression Diagnostics Mahalanobis’ DistanceMahalanobis distance is the distance of a case from the centroid of the remaining points (point where the means meet in n-dimensional space)Cook’s DistanceIdentifies an influential data point whether in terms of predictor or DVA measure of how much the residuals of all cases would change if a particular case were excluded from the calculation of the regression coefficients.With larger (relative) values, excluding a case would change the coefficients substantially.DfBetaChange in the regression coefficient that results from the exclusion of a particular caseNote that you get DfBetas for each coefficient associated with the predictors
14Regression Diagnostics Leverage assesses outliers among the predictorsMahalanobis distanceRelatively high Mahalanobis suggests an outlier on one or more variablesDiscrepancyMeasures the extent to which a case is in line with othersInfluenceA product of leverage and discrepancyHow much would the coefficients change if the case were deleted?Cook’s distance, dfBetas
15Outliers Influence plots With a couple measures of ‘outlierness’ we can construct a scatterplot to note especially problematic casesAfter fitting a regression model in R-commander, i.e. running the analysis, this graph is available via point and clickHere we have what is actually a 3-d plot, with 2 outlier measures on the x and y axes (studentized residuals and ‘hat’ values, a measure of leverage) and a third in terms of the size of the circle (Cook’s distance)For this example, case 35 appears to be a problem
16Summary: OutliersNo matter the analysis, some cases will be the ‘most extreme’. However, none may really qualify as being overly influential.Whatever you do, always run some diagnostic analysis and do not ignore influential casesIt should be clear to interested readers whatever has been done to deal with outliersAs noted before, the best approach to dealing with outliers when they do occur is to run a robust regression with capable software
17Suppressor variablesThere are a couple of ways in which suppression can occur or be talked of, but the gist is that this masks the impact the predictor would have on the dependent if the third variable did not existIn general suppression occurs when i falls outside the range of 0 ryiSuppression in MR can entail some different relationships among predictorsFor example one suppressor relationship would be where two variables, X1 and X2, are positively related to Y, but when the equation comes out we getY-hat = b1X1 – b2X2 + aThree kinds to be discussedClassicalNetCooperative
18Suppression: Technical side When dealing with standardized regression coefficients, note that
19Suppression Consider the following relationships a. Complete independence: R2Y.12 = 0b. Partial independence: R2Y.12 = 0 but r12 0,d. Partial independence again, both rY1 and rY2 ≠ 0, but r12 = 0
20Suppression e. Normal situation, redundancy: no simple correlation = 0 Each semi-partial correlation, and the corresponding beta, will be less than the simple correlation between Xi and Y. This is because the variables share variance and influencef. Classical suppression: rY2 = 0
21Suppression Recall from previously If ry2 = 0, then With increasingly shared variance between X1 and X2 we will have an inflated beta coefficient for X1X2 is suppressing the error variance in X1In other words, even though X2 is not correlated with Y, having it in the equation raises the R2 from what it would have been with just X1.
22Suppression Other suppression situations Net Cooperative All rs positive2 ends up with a sign opposite that of its simple correlation with YIt is always the X which has the smaller ryi which ends up with a of opposite sign falls outside of the range 0 ryi, which is always true with any sort of suppressionCooperativePredictors negatively correlated with one another, both positive with DVOr positively with one another and negatively with YExample of Cooperative:Correlation between social aggressiveness (X1) and sales success (Y) = .29Correlation between record keeping (X2) and sales success (Y) = .24r12 = -.30Regression coefficients for predictors = .398 and .359 respectively
23SuppressionGist: weird stuff can happen in MR, so take note of the relationship of the predictors and how it may affect your overall interpretationCompare the simple correlations of each predictor with the DV and compare to their respective beta coefficients*If coefficient noticeably larger than simple correlation (absolute value) or of opposite sign one should suspect possible suppression*For predictors contributing notably to the model.
25Overfitting External validity In some cases, some of the variation the parameters chosen are explaining is variation that is idiosyncratic to the sampleWe would not see this variability in the populationSo the fit of the model is good, but it doesn’t generalize as well as one would thinkCapitalization on chance
26Overfitting Example from Lattin, Carroll, Green Randomly generated 30 variables to predict an outcome variableUsing a best subsets approach, 3 variables were found that produce an R2 of .33 or 33% variance accounted forAs one can see, even random data has the capability of appearing to be a decent fit
27ValidationOne way to deal with such a problem is with a simple random splitWith large datasets one can randomly split the sample into two setsCalibration sample: used to estimate the coefficientsHoldout sample: used to validate the modelSome suggest a 2:1 or 4:1 splitThis would require typically large samples for the holdout sample to be viableUsing the coefficients from the calibration set one can create predicted values for the holdout setThe squared correlation between the predicted values and observed values can then be compared to the R2 of the calibration setIn previous example of randomly generated data the R2 for the holdout set was 0
28Other approaches Subsets approach Jackknife Validation Create estimates with a particular case removedUse the coefficients obtained from analysis of the n-1 remaining cases to create a predicted value for the case removedDo for all cases, and then compare the jackknifed R2 to the originalSubsets approachCreate several samples of the data of roughly equal sizeUse the holdout approach with one sample, and obtain estimates from the othersDo this for each sample, obtain average estimates
29BootstrapWith relatively smaller samples*, cross-validation may not be as feasibleOne may instead resample (with replacement) from the original data to obtain estimates for the coefficientsUse what is available to create a sampling distribution of for the values of interest*But still large enough such that the bootstrap estimates would be viable. There are numerous ways to do so in R, but easiest are the specific functions to do so such as the validate function from the ‘Design’ library.
30SummaryThere is a lot to consider when performing multiple regression analysisActually running the analysis is just the first step, and if that’s all we are doing, we haven’t done muchInferences are likely incomplete at best, innaccuarate at worstA lot of work will be necessary to make sure that the conclusions drawn will be worthwhileAnd that’s ok, you can do it!
31Summary of how one could do regression Idea pops into your headHave some loose hypotheses about correlations among some variablesCollect some dataRun the regression analysisUse R2 and standard-fare metrics of variable importanceRely on statistical significance when you don’t have any real effects to talk aboutThis would be a bad way to do regression.
32Summary of how to do a real regression analysis 1. Have an idea2. Propose a theoretical (possibly causal) model in which you have thought about other viable models (including how predictors might predict one another, moderating and mediating possibilities* etc.)3. Collect appropriate and enough data (based on a power analysis)Must have reliable measures4. Spend time with initial examination of data including obtaining a healthy understanding of the variables descriptively, missing values analysis if necessary, inspection of correlations etc.5. Run the analysis. Might as well ignore for now.6. With the model in place, test assumptions, look for collinearity, identify outliers. Take appropriate steps necessary to deal with any issues including bootstrapped regression or robust regression7. Rerun the analysis.8. Validate the model. Note any bias.9. Interpret results. Focus on bias corrected estimates of R2, interval estimates of coefficients, interpretable measures of variable importance (test for differences on them)*Note that moderating and mediating situations must be theoretically plausible, it’s not something one ‘explores’ unless you really are doing an exploratory regression (e.g. of a stepwise nature). Several have come into RSS wanting to ‘see if there might be moderators or mediators’. They are entirely different theoretical models. We’ll talk more on the distinction later.