# Unit 7: Statistical control in depth: Correlation and collinearity Unit 7 / Page 1© Andrew Ho, Harvard Graduate School of Education.

## Presentation on theme: "Unit 7: Statistical control in depth: Correlation and collinearity Unit 7 / Page 1© Andrew Ho, Harvard Graduate School of Education."— Presentation transcript:

Unit 7: Statistical control in depth: Correlation and collinearity http://wondermark.com/553/ Unit 7 / Page 1© Andrew Ho, Harvard Graduate School of Education

Incorporating research funding into our US News model Unit 7 / Page 2© Andrew Ho, Harvard Graduate School of Education

Comparing coefficients across models As before, we look at how coefficients vary across models and look for large changes in magnitude, sign, or significance. Research funding, although a statistically significant and substantively notable predictor on its own, is mediated by GRE scores and the size of the doctoral cohort such that it is not a significant predictor with these other terms in the model. GRE scores and doctoral student cohort size appear to be a better explanation of the variation in peer ratings than research funding. There is no predictive utility for research funding given all else in the model. However, research funding still adds to the predictive utility of the model, if not significantly. Should we keep it? Unit 7 / Page 3© Andrew Ho, Harvard Graduate School of Education

Model selection criteria and adjusted R-sq Does the model chosen reflect your underlying theory? Does the model allow you to address the effects of your key question predictor(s)? Are you excluding predictors that are statistically significant? If so, why? Are you including predictors you could reasonably set aside (parsimony principle)? No model is ever “final.” We’ll spend more time on this in Unit 11. Because R-sq can never decrease with additional predictors, adj-R-sq adds a penalty term that discourages frivolous variable addition. Unit 7 / Page 4© Andrew Ho, Harvard Graduate School of Education

A flawed but useful framework for “statistical control” Y XZ Unit 7 / Page 5© Andrew Ho, Harvard Graduate School of Education Think of the correlation between variables as the overlap of circles in a Venn diagram: Y XZ In this diagram, the pairwise correlations between the variables are all approximately equal. We can think of the goal of prediction as accounting for as much of the Y circle (the variance) as possible. Y XZ The left-hand figure shows the proportion of Y variance accounted for by X. The right-hand figure shows the proportion of Y variance accounted for by Z. Note that X and Z are themselves correlated.

From confounding to multicollinearity When predictors are highly correlated, we can imagine that it might be difficult for the regression model to tell which predictor is accounting for the variance of Y. Particularly severe cases of confounding between two or more predictors is known as multicollinearity. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 6 Y X Z As the diagram makes clear, this is not a problem for the prediction of Y: the overall goal of accounting for Y variance. Nor is it a threat to the baseline assumptions of regression. Residuals can still be identically normally distributed and centered on zero. Multicollinearity threatens substantive interpretations by increasing the volatility of regression coefficients, clouding interpretation of the magnitude, sign, and significance of coefficients.

The body fat example Predicting the percentage of body fat in 20 healthy females 25-34 years old. Can we predict the percentage of body fat using a much less expensive and time-consuming procedure: measuring triceps skinfold thickness (mm), thigh circumference (cm), and midarm circumference (cm)? © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 7

Comparing regression coefficients across models The triceps and thigh coefficients are significant single predictors, but the two- predictor equation shows that the triceps variable does not account for a significant amount of body fat variance above and beyond the thigh variable. The adjusted R-sq value also declines from a model that contains the thigh value alone. The arm variable seems to act as a suppressor variable in the two-predictor equation with triceps, however, there is little substantive justification for this. The three-predictor model has no significant predictors, although the omnibus null hypothesis can be rejected, and the adjusted R-sq is the highest of all possible combinations of models. This is consistent with multicollinearity. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 8

Visualizing multicollinearity Multicollinearity arises from an inability of the estimation procedure to decide from among a number of feasible equations for the best-fit plane. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 9 These problems occur more often than one might expect. One might accidentally include two predictors that are on different but linearly related scales, (meters and feet, scale scores and z-scores, frequencies and percentages). One might include variables that are linear combinations of other variables (try predicting a future score with a Time 1 score, a Time 2 score, and growth defined as Time 2 – Time 1). In perfect collinearity, Stata will spit out one of your variables in protest, “note: growth omitted due to collinearity.”

The Variance Inflation Factor (VIF) Our primary concern is the increase in our standard errors, born of an inability of the estimation procedure to estimate individual regression coefficients. We can anticipate the standard errors, because they are related to how well each predictor is predicted by all the other predictors. We can call this,, one for each j of k predictors, where is the proportion of variance in X j accounted for by the other predictors. We can then define a quantity called the Variance Inflation Factor (VIF) that indicates how troublesome a predictor’s standard error is likely to be, due to strong prediction by other predictors: © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 10

Addressing multicollinearity As a loose rule of thumb, any VIF greater than 10 suggests remedial action, although VIFs greater than 5 should be noted. If interpretation of coefficients is desired, one drastic response is to remove a variable. In this example, where prediction is the goal, there may be no need, or we might just keep the thigh variable? Other advanced techniques involve combining predictors with factor analysis or principal components analysis, or using a technique called ridge regression. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 11

Model selection philosophies (more in Unit 11) Different disciplines and subdisciplines have different norms driving variable and model selection. More substantively driven fields that typically deal with lower sample sizes tend to include predictors on theoretical grounds. – Remember that including a nonsignificant predictor can still reduce bias in other predictors, if it really functions as a covariate in the population. More statistically driven fields that typically deal with larger sample sizes tend to prune predictors for parsimony and because there is limited evidence supporting the inclusion of nonsignificant variables in the absence of strong theory. We want you to appreciate this continuum while also emphasizing the value of sensitivity studies: coefficient comparisons across models. Regardless of your philosophy, robust interpretations of coefficients across plausible models should be reassuring, and volatility should be reported and explained substantively to the extent possible. Even if prediction alone is the goal, beware of chasing R-sq or even adj-R- sq blindly. This can lead to overfitting: models that fit the sample data well but do so by chasing random noise. A more parsimonious model will fit the next sample from the population better in these cases. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 12

Start looking at the results sections of papers in your field © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 13 Michal Kurlaender & John Yun (2007) Measuring school racial composition and student outcomes in a multiracial society, American Journal of Education, 113, 213-242

Get to know the norms in your discipline © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 14 Barbara Pan, Meredith Rowe, Judith Singer and Catherine Snow (2005) Maternal correlates of growth in toddler vocabulary production in low-income families, Child Development, 76(4) 763-782

What are the takeaways from this unit? © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 15 Multiple regression is a very powerful tool – The ability to estimate certain associations while accounting for others greatly expands the utility of statistical models – It can support inferences about the effects of some predictors while holding other predictors constant But with great power comes great responsibility – Coefficients and their interpretations are conditional on all the variables in a model. Inclusion or exclusion of variables can change interpretations dramatically. – It is our responsibility to explore plausible rival models and anticipate the effects of including variables beyond the scope of our dataset. The pattern of correlations can help presage multiple regression results – Learn how to examine a correlation matrix and foreshadow how the predictors will behave in a multiple regression model – If you have one (or more) control predictors, consider examining a partial correlation matrix that removes that effect Multicollinearity is an extreme example of confounding – We can anticipate it with correlations but metrics like VIF are better at flagging problems. – Collinearity can be addressed by variable exclusion, variable combination, or more advanced techniques.

Download ppt "Unit 7: Statistical control in depth: Correlation and collinearity Unit 7 / Page 1© Andrew Ho, Harvard Graduate School of Education."

Similar presentations