We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byAbner Mills
Modified about 1 year ago
Unit 7: Statistical control in depth: Correlation and collinearity Unit 7 / Page 1© Andrew Ho, Harvard Graduate School of Education
Incorporating research funding into our US News model Unit 7 / Page 2© Andrew Ho, Harvard Graduate School of Education
Comparing coefficients across models As before, we look at how coefficients vary across models and look for large changes in magnitude, sign, or significance. Research funding, although a statistically significant and substantively notable predictor on its own, is mediated by GRE scores and the size of the doctoral cohort such that it is not a significant predictor with these other terms in the model. GRE scores and doctoral student cohort size appear to be a better explanation of the variation in peer ratings than research funding. There is no predictive utility for research funding given all else in the model. However, research funding still adds to the predictive utility of the model, if not significantly. Should we keep it? Unit 7 / Page 3© Andrew Ho, Harvard Graduate School of Education
Model selection criteria and adjusted R-sq Does the model chosen reflect your underlying theory? Does the model allow you to address the effects of your key question predictor(s)? Are you excluding predictors that are statistically significant? If so, why? Are you including predictors you could reasonably set aside (parsimony principle)? No model is ever “final.” We’ll spend more time on this in Unit 11. Because R-sq can never decrease with additional predictors, adj-R-sq adds a penalty term that discourages frivolous variable addition. Unit 7 / Page 4© Andrew Ho, Harvard Graduate School of Education
A flawed but useful framework for “statistical control” Y XZ Unit 7 / Page 5© Andrew Ho, Harvard Graduate School of Education Think of the correlation between variables as the overlap of circles in a Venn diagram: Y XZ In this diagram, the pairwise correlations between the variables are all approximately equal. We can think of the goal of prediction as accounting for as much of the Y circle (the variance) as possible. Y XZ The left-hand figure shows the proportion of Y variance accounted for by X. The right-hand figure shows the proportion of Y variance accounted for by Z. Note that X and Z are themselves correlated.
From confounding to multicollinearity When predictors are highly correlated, we can imagine that it might be difficult for the regression model to tell which predictor is accounting for the variance of Y. Particularly severe cases of confounding between two or more predictors is known as multicollinearity. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 6 Y X Z As the diagram makes clear, this is not a problem for the prediction of Y: the overall goal of accounting for Y variance. Nor is it a threat to the baseline assumptions of regression. Residuals can still be identically normally distributed and centered on zero. Multicollinearity threatens substantive interpretations by increasing the volatility of regression coefficients, clouding interpretation of the magnitude, sign, and significance of coefficients.
The body fat example Predicting the percentage of body fat in 20 healthy females years old. Can we predict the percentage of body fat using a much less expensive and time-consuming procedure: measuring triceps skinfold thickness (mm), thigh circumference (cm), and midarm circumference (cm)? © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 7
Comparing regression coefficients across models The triceps and thigh coefficients are significant single predictors, but the two- predictor equation shows that the triceps variable does not account for a significant amount of body fat variance above and beyond the thigh variable. The adjusted R-sq value also declines from a model that contains the thigh value alone. The arm variable seems to act as a suppressor variable in the two-predictor equation with triceps, however, there is little substantive justification for this. The three-predictor model has no significant predictors, although the omnibus null hypothesis can be rejected, and the adjusted R-sq is the highest of all possible combinations of models. This is consistent with multicollinearity. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 8
Visualizing multicollinearity Multicollinearity arises from an inability of the estimation procedure to decide from among a number of feasible equations for the best-fit plane. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 9 These problems occur more often than one might expect. One might accidentally include two predictors that are on different but linearly related scales, (meters and feet, scale scores and z-scores, frequencies and percentages). One might include variables that are linear combinations of other variables (try predicting a future score with a Time 1 score, a Time 2 score, and growth defined as Time 2 – Time 1). In perfect collinearity, Stata will spit out one of your variables in protest, “note: growth omitted due to collinearity.”
The Variance Inflation Factor (VIF) Our primary concern is the increase in our standard errors, born of an inability of the estimation procedure to estimate individual regression coefficients. We can anticipate the standard errors, because they are related to how well each predictor is predicted by all the other predictors. We can call this,, one for each j of k predictors, where is the proportion of variance in X j accounted for by the other predictors. We can then define a quantity called the Variance Inflation Factor (VIF) that indicates how troublesome a predictor’s standard error is likely to be, due to strong prediction by other predictors: © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 10
Addressing multicollinearity As a loose rule of thumb, any VIF greater than 10 suggests remedial action, although VIFs greater than 5 should be noted. If interpretation of coefficients is desired, one drastic response is to remove a variable. In this example, where prediction is the goal, there may be no need, or we might just keep the thigh variable? Other advanced techniques involve combining predictors with factor analysis or principal components analysis, or using a technique called ridge regression. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 11
Model selection philosophies (more in Unit 11) Different disciplines and subdisciplines have different norms driving variable and model selection. More substantively driven fields that typically deal with lower sample sizes tend to include predictors on theoretical grounds. – Remember that including a nonsignificant predictor can still reduce bias in other predictors, if it really functions as a covariate in the population. More statistically driven fields that typically deal with larger sample sizes tend to prune predictors for parsimony and because there is limited evidence supporting the inclusion of nonsignificant variables in the absence of strong theory. We want you to appreciate this continuum while also emphasizing the value of sensitivity studies: coefficient comparisons across models. Regardless of your philosophy, robust interpretations of coefficients across plausible models should be reassuring, and volatility should be reported and explained substantively to the extent possible. Even if prediction alone is the goal, beware of chasing R-sq or even adj-R- sq blindly. This can lead to overfitting: models that fit the sample data well but do so by chasing random noise. A more parsimonious model will fit the next sample from the population better in these cases. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 12
Start looking at the results sections of papers in your field © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 13 Michal Kurlaender & John Yun (2007) Measuring school racial composition and student outcomes in a multiracial society, American Journal of Education, 113,
Get to know the norms in your discipline © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 14 Barbara Pan, Meredith Rowe, Judith Singer and Catherine Snow (2005) Maternal correlates of growth in toddler vocabulary production in low-income families, Child Development, 76(4)
What are the takeaways from this unit? © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 15 Multiple regression is a very powerful tool – The ability to estimate certain associations while accounting for others greatly expands the utility of statistical models – It can support inferences about the effects of some predictors while holding other predictors constant But with great power comes great responsibility – Coefficients and their interpretations are conditional on all the variables in a model. Inclusion or exclusion of variables can change interpretations dramatically. – It is our responsibility to explore plausible rival models and anticipate the effects of including variables beyond the scope of our dataset. The pattern of correlations can help presage multiple regression results – Learn how to examine a correlation matrix and foreshadow how the predictors will behave in a multiple regression model – If you have one (or more) control predictors, consider examining a partial correlation matrix that removes that effect Multicollinearity is an extreme example of confounding – We can anticipate it with correlations but metrics like VIF are better at flagging problems. – Collinearity can be addressed by variable exclusion, variable combination, or more advanced techniques.
Effect Sizes and Power Review. Statistical Power Statistical power refers to the probability of finding a particular sized effect Specifically, it is.
Helen Chester University of Manchester. Brief overview of study and findings Focus on issues and recommendations for: Researchers wishing to do similar.
Regression Analysis: A statistical procedure used to find relationships among a set of variables.
MULTIPLE REGRESSION ANALYSIS. ENGR. DIVINO AMOR P. RIVERA STATISTICAL COORDINATION OFFICER I NSO LA UNION CONTENTS Table for the types of Multiple Regression.
X,Y scatterplot These are plots of X,Y coordinates showing each individual's or sample's score on two variables. When plotting data this way we are usually.
Using Statas Margins Command to Estimate and Interpret Adjusted Predictions and Marginal Effects Richard Williams
Lecture 20 Missing Data and random effect modelling.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Chapter 2 Overview of the Data Mining Process 1. Introduction Data Mining – Predictive analysis Tasks of Classification & Prediction Core of Business.
Povertyactionlab.org Planning Sample Size for Randomized Evaluations Esther Duflo J-PAL.
Fitness Chances selected (%) a b d c strength post pre Age Activity r = 0.57 Linear Models and Effect Magnitudes for Research, Clinical and Practical.
Presentation of Data Tables and graphs are convenient for presenting data. They present the data in an organized format, enabling the reader to find information.
1 MRes 3rd March 2010 Logistic regression. 2 Programme 2pm – 3:15pm. A talk. A break for coffee. 3:45pm – 4:30pm. A short exercise.
If you are viewing this slideshow within a browser window, select File/Save as… from the toolbar and save the slideshow to your computer, then open it.
1 Logistic regression. 2 Programme 2:05 pm Talk 3:15 pm Coffee break 3:45 pm A short exercise 4:30 pm Finish.
Meta-analysis The EBM workshop A.A.Haghdoost, MD; PhD of Epidemiology
F-tests continued. Introduction Discuss the problems associated with structural breaks in the data. Examine the Chow test for a structural break. Assess.
Agresti/Franklin Statistics, 1 of 141 Chapter 12 Multiple Regression Learn…. T o use Multiple Regression Analysis to predict a response variable using.
Bivariate &/vs. Multivariate Differences between correlations, simple regression weights & multivariate regression weights Patterns of bivariate & multivariate.
Multiple Linear Regression Laurens Holmes, Jr. Nemours/A.I.duPont Hospital for Children Nothing explains everythin g.
1 Logistic regression. 2 Regression Regression is a set of techniques for exploiting the presence of statistical ASSOCIATIONS among variables to make.
Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
1 MRes Wednesday 11 th March 2009 Logistic regression.
Quality Tools and Techniques in the School and Classroom.
PHILOSOPHY OF SCIENCE: Bayesian inference Zoltán Dienes Thomas Bayes
Chapter 5 Sample Surveys. Background We have learned ways to display, describe, and summarize data, but have been limited to examining the particular.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Copyright © 2010 Pearson Education, Inc. Slide
Ordinary least Squares. Introduction Describe the nature of financial data. Assess the concepts underlying regressions analysis Describe some examples.
Multiple Regression Analysis Multiple Regression Model Sections
© 2016 SlidePlayer.com Inc. All rights reserved.