2 CorrelationAny multivariate technique concerns the relationships among variablesCorrelation is a measure of the strength and direction of a linear relationship between two variablesTwo variables’ values vary about their mean (variance)They may also covary with one another, having the tendency to move in the same or opposite directions (covariance)As such a correlation is an effect size in and of itself*When squared it represents the shared variance between the two variables*Again I suggest you not ‘flag’ a correlation matrix. If you are really concerned with the statistical outcome and if it is different from zero, two rules of thumb can suffice for a test at alpha = .05: N = 30 means you only need r = .35 for statistical significance, N = 100, about .20.
3 Factors affecting correlation LinearityNonlinear relationships will have an adverse effect on a measure designed to find a linear relationshipReliability of measurementIf you are not reliably distinguishing individuals on some measure, you will not capture the covariance that measure may have with another adequatelyHeterogeneous subsamplesSub-samples may artificially increase or decrease overall r.Solution - calculate r separately for sub-samples & overall, look for differencesCan be caused by lack of reliabilityRange restrictionsLimiting the variability of your data can in turn limit the possibility for covariability between two variables, thus attenuating r.Common example occurs with Likert scalesE.g vsHowever it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept outWilcox 2001Outliers can artificially increase or decrease r
4 RegressionUsing the covariances among a set of variables, regression is a technique that allows us to predict an outcome based on information provided by one or more other variablesWe also can get a sense of the overall correlation between a set of variables and some outcomeMultiple RSquaring that value has the same interpretation as beforeRegression equation:
6 Interpreting Regression Results: Basics InterceptValue of Y if X(s) is 0Often not meaningful, particularly if it’s practically impossible to have an X of 0 (e.g. weight)Slope, the regression coefficientAmount of change in Y seen with 1 unit change in XIn MR that is with the others held constantStandardized regression coefficientAmount of change in Y seen in standard deviation units with 1 standard deviation unit change in XIn simple regression it is equivalent to the r for the two variablesStandard error of estimateGives a measure of the accuracy of predictionR2Proportion of variance in the outcome explained by the modelEffect size6
7 Real regression: Starting out Doing a real regression involves much more than obtaining outputFirst one must discern the goal of the regression, and there are two main kinds that are for the most part exclusive to one anotherPredictionWith prediction there is much less concern over variable importance etc.One is more concerned with whether it (the model) works rather than how it worksCoefficients will be used to predict future dataExplanationThe goal here is to understand the individual relationships of variables with some outcomeCausal notions are implicit in this approachMajority of psych usage, and the one that will typically be of focus here
8 Real regression: Starting out Specifying the model is obviously something to be done before data collection* unless one is doing a truly exploratory endeavorHowever one must consider relations (potentially causal ones) among all the variables under considerationGiven 3 variables of which none are initially set as an outcome, there are possibly 20 or more models which might best represent the state of affairsIn addition one must consider interactions (moderating effects) among the variablesReliable measures must be used, or it will be all for naughtFor some reason people try anywayVariable XVariable YVariable ZCriterionABC*even if it is in the hypothetical sense with an available data set.
9 Real regression: IED Initial examination of data includes: Getting basic univariate descriptive statistics for all variablesGraphical depiction of their distributionAssessment of simple bivariate correlationsInspection for missing data and determining how to deal with it*Start noting potential outliers, but realize that univariate outliers may not be model outliersThe gist is you should know the data inside and out before the analysis is conducted*Methods for dealing with missing data employ maximum likelihood techniques such as the EM algorithm to estimate missing data. You’ll have to fork up the change for the Missing Values package if you want to do it in SPSS. All SEM packages have a means to estimate missing data. Things not to do: mean replacement (attenuates correlation), pairwise deletion (if doing MR in matrix fashion using the covariance or correlation matrix), listwise deletion for more than say 3% of the data.
10 Real regression: Running the Analysis ANOVA summary table- you should know every aspect of an ANOVA table and how it is derivedThe following is for the simple case, for MR, the Model dF is the number of predictors, the Error df is N- K – 1The square root of the MSerror is the standard error of estimateR2 is the Model SS/ Total SS, the amount of variance explainedAll of this is the same for the ‘analysis of variance’
11 Starting point: Statistical significance of the model The ANOVA summary table tells us whether our model is statistically adequateR2 different from zeroThe regression equation is a better predictor than the meanAs with simple regression, the analysis involves the ratio of variance predicted to residual varianceAs we can see, it is reflective of the relationship of the predictors to the DV (R2), the number of predictors in the model, and sample size11
12 The Multiple correlation coefficient The multiple correlation coefficient is the correlation between the DV and the linear combination of predictors which minimizes the sum of the squared residualsMore simply, it is the correlation between the observed values and the values that would be predicted by our modelIts squared value (R2) is the amount of variance in the dependent variable accounted for by the independent variables
13 Back up Wait a second what just happened? You ‘did’ an MR analysis, but how did you do it?Click the iconYou don’t have to get crazy with it, but you should get a sense of how the data is manipulated to produce coefficients, regular or standardizedYou also need to be able to produce the formula a predicted value given the coefficients**Anyone that can’t do that is not doing regression analysis.
14 Variable ImportanceAs the goal of much of social science research is explanation, determining variable importance is a key concernUnfortunately, the way it is typically done is crude, difficult to make sense of, and not done with much thoughtWe will go over the standard fare then provide a ‘new’* technique that is highly interpretable and a means to test for differences among variable importance metrics*about 25 years old
15 “Controlling for”Let’s start with raw coefficients- What do they tell us?How much does Y change with a one unit change in X… controlling for the other variables in the modelNow for some fun go ask random researchers what ‘controlling for’ actually meansAt a conceptual level it simply means we are considering this variable’s effects on the DV with respect to the other predictorsWhat it actually means is that this is the average effect (slope) seen at each of the levels of the other variablesE.g. average effect of sex role identity on marital satisfaction for each value of ageIt does not mean experimental control, and it also doesn’t mean we’ve controlled for what might be potentially important variables that weren’t included in the model
16 Variable importance: Statistical significance We can examine the output to determine which variables statistically significantly contribute to the modelStandard errormeasure of the variability that would be found among the different slopes estimated from other samples drawn from the same populationThis is typically not a good way to determine variable importance and should probably only be noted casually if at allHowever the standard error does allow us to get confidence intervals for the coefficients, which are desirable and should be a standard part of reporting
17 The standardized coefficient If we standardized our variables before running the regression analysis (i.e. used the correlation matrices) we would have a standardized regression coefficientHow much Y would typically change given a one standard deviation change in XIn the simple setting it equals the Pearson r, in MR it is a type of partial correlationWe typically prefer this because most of the measures used in psychology are on arbitrary scales (e.g. Likert)In this manner we can more easily compare one variable to another
18 Variable importance: Partial correlation Partial correlation is the contribution of a predictor after the contributions of the other predictors have been taken out of both that predictor and the DVA+B+C+D represents all the variability in the DV to be explained (fig. 1 and 2)A+B+C = R2 for the modelThe squared partial correlation is the amount a variable explains relative to the amount in the DV that is left to explain after the contributions of the other predictors have been removed from both the predictor and criterionIt is A/(A+D) (fig. 3)For Predictor 2 it would be B/(B+D) (fig. 4)Figure 1Figure 2DVPredictor 1Predictor 2Figure 3Figure 4
19 Variable importance: Semipartial correlation The semipartial correlation (squared) is perhaps the more useful measure of contributionIt refers to the unique contribution of A to the model, i.e. the relationship between the DV and predictor after the contributions of the other predictors have been removed from the predictor in questionA/(A+B+C+D)For predictor 2B/(A+B+C+D)Interpretation (of the squared value):Out of all the variance to be accounted for, how much does this variable explain that no other predictor does?orHow much would R2 drop if the variable were removed?
20 Variable importanceNote that exactly how partial and semi-partial will be figured will depend on the type of multiple regression employed.The previous examples concerned a ‘standard’ multiple regression situationAs a preview, for sequential (i.e. hierarchical) regression the partial correlation would bePredictor1 = (A+C)/(A+C+D)Predictor2 = B/(B+D)Predictor 1Predictor 2
21 Semipartial correlation For semi-partial correlationPredictor1 = (A+C)/(A+B+C+D)Predictor2 same as beforeThe result for the addition of the second variable is the same as it would be in standard MRThus if the goal is to see the unique contribution of a single variable after all others have been controlled for, the results of a sequential regression can be determined to an extent with the squared semipartial correlations from a regular MRIn general terms, it is the unique contribution of the variable at the point it enters the equation (sequential or stepwise)
22 Variable importance: Example data The semipartial correlation is labeled as ‘part’ correlation in SPSS*Here we can see that education level is really doing all the work in this model*This seems to be an SPSS quirk. The vast majority of references to semi-partial correlation call it semi-partial.22
23 Another exampleMental health symptoms predicted by number of doctor visits, physical health symptoms, number of stressful life events
24 Another exampleHere we see that physical health symptoms and stressful life events both significantly contribute to the model in a statistical senseExamining the unique contributions, physical health symptoms seem more ‘important’, but unless you run a bootstrapping procedure to provide a statistical test for their difference, you have zero evidence to suggest one is contributing significantly more than the otherIn other words, that difference my just be due to sampling variability and unless you test that you cannot say that e.g and .223 are statistically different
25 Variable Importance: Comparison Comparison of standardized coefficients, partial, and semi-partial correlation coefficientsAll of them are ‘partial’ correlations in the sense that they give an account of the relationship among a predictor and the DV after removing the effects of others in some fashion
26 Variable Importance: A more intuitive metric The problem with all of the previous measures of variable importance is that they don’t decompose R2 into the relative contributions to it from the predictorsThere is one that provides an average R2 increase, depending on the order a predictor enters into the model3 predictor example A B C; B A C, C A B etc.One way to think about it using what you’ve just learned is thinking of the squared semi-partial correlation whether a variable is first second third etc.This statistic is just an average semi-partial, and note that the average is for all possible permutationsE.g. the R-square contribution for B being first in the model includes B A C and B C A, both of which would of course be the same value
27 As Predictor 2: R2 change = .639 and .087 Note there are 2 models in which war would be first and have that value(war, math, bush; war, bush, math)Outcome: Score on a Social Conservativism/Liberalism scaleAs Predictor 2: R2 change = .639 and .087Here there are two models in which war would be second(math, war, bush; bush, war, math)As Predictor 3: R2 change = .098There are 2 models in which war would be last and have that value.
28 InterpretationThe average of these* is the average contribution to R square for a particular variable over all possible orderingsIn this case for war it is ~.36, i.e. on average, it increases R square 36% of variance accounted forFurthermore, if we add up the average R-squared contribution for all three…= .65.65 is the R2 for the model*Again the average includes all values, so the six from previous ( )/6 = .363 for the predictor War.
29 Example The output of interest is labeled as LMG below LMG stands for Lindemann, Merenda and Gold, authors who introduced it in a textbook in the early 80s‘Last’ is simply the squared semi-partial correlation‘First’ is just the square of the simple bivariate correlation between predictor and DV‘Beta squared’ is the square of the beta coefficient with all variables in‘Pratt’ is the product of the standardized coefficient and the simple bivariate correlationIt too will add up to the model R2 but is not recommended, one reason being that it can actually be negativelibrary(relaimpo)RegModel.1 <- lm(SOCIAL~BUSH+MTHABLTY+WAR, data=Dataset)calc.relimp(RegModel.1, type = c("lmg", "last", "first", "betasq", "pratt"))*Note the relaimpo package is equipped to provide bootstrapped estimateslmg last first betasq prattBUSHMATHWAR29
30 Relative Importance Summary There are multiple ways to estimate a variable’s contribution to the model, and some may be better than othersA general approach:Check simple bivariate relationshipsIf you don’t see worthwhile correlations with the DV there you shouldn’t expect much from your results regarding the model*Check for outliers and compare with robust measures alsoYou may detect that some variables are so highly correlated that one is redundantStatistical significance is not a useful means of assessing relative importance, nor is the raw coefficient typicallyStandardized coefficients and partial correlations are a first stepCompare standardized to simple correlations as a check on possible suppressionOf typical output the semi-partial correlation is probably the more intuitive assessmentThe LMG is also intuitive, and is a natural decomposition of R2, unlike the others*You can’t create something out of nothing. For some reason I’ve found many that have tried that were surprised to learn that nothing came of a regression technique even though their simple correlations showed clearly there wasn’t anything to begin with.30
31 Multiple Regression: Initial Summary So far we have gone through some basics of interpreting a MR analysisHowever an adequate, responsible MR will have much of the analysis spent testing assumptions, identifying problem cases, determining model adequacy, and validating the modelWe turn to that now
32 How does MR work?Recall our formula for a straight line in simple regressionNow we have multiple predictors, so the formula will now represent vectors and matricesX here is the dataframe with a column for the constant
33 How does MR work? Dimensions of Y = n,1 Dimensions of X = n, k+1 Dimensions for the vector of coefficients b = k+1,1Dimensions of the vector of residuals e = n,1
34 How does MR work? First we need the coefficients Take the transpose of the data matrix and multiply it by the original datamatrix, as well as the vector of values of the DVTake the inverse of the first product and multiple that by the second product to obtain the vector coefficientsMultiply the data matrix by the vector of coefficientsThe result will be a vector of predicted valuesSubtract from the vector of observed to get the vector of residuals
35 How does MR work? Another way: Bi= column vector of standardized regression coefficientsRii = matrix of correlations among the predictorsRiy = column vector of correlations of the DV and predictorsRyi = is now the row vector of correlations of the DV and predictors