We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byAbner Mills
Modified about 1 year ago
Unit 7: Statistical control in depth: Correlation and collinearity http://wondermark.com/553/ Unit 7 / Page 1© Andrew Ho, Harvard Graduate School of Education
Incorporating research funding into our US News model Unit 7 / Page 2© Andrew Ho, Harvard Graduate School of Education
Comparing coefficients across models As before, we look at how coefficients vary across models and look for large changes in magnitude, sign, or significance. Research funding, although a statistically significant and substantively notable predictor on its own, is mediated by GRE scores and the size of the doctoral cohort such that it is not a significant predictor with these other terms in the model. GRE scores and doctoral student cohort size appear to be a better explanation of the variation in peer ratings than research funding. There is no predictive utility for research funding given all else in the model. However, research funding still adds to the predictive utility of the model, if not significantly. Should we keep it? Unit 7 / Page 3© Andrew Ho, Harvard Graduate School of Education
Model selection criteria and adjusted R-sq Does the model chosen reflect your underlying theory? Does the model allow you to address the effects of your key question predictor(s)? Are you excluding predictors that are statistically significant? If so, why? Are you including predictors you could reasonably set aside (parsimony principle)? No model is ever “final.” We’ll spend more time on this in Unit 11. Because R-sq can never decrease with additional predictors, adj-R-sq adds a penalty term that discourages frivolous variable addition. Unit 7 / Page 4© Andrew Ho, Harvard Graduate School of Education
A flawed but useful framework for “statistical control” Y XZ Unit 7 / Page 5© Andrew Ho, Harvard Graduate School of Education Think of the correlation between variables as the overlap of circles in a Venn diagram: Y XZ In this diagram, the pairwise correlations between the variables are all approximately equal. We can think of the goal of prediction as accounting for as much of the Y circle (the variance) as possible. Y XZ The left-hand figure shows the proportion of Y variance accounted for by X. The right-hand figure shows the proportion of Y variance accounted for by Z. Note that X and Z are themselves correlated.
From confounding to multicollinearity When predictors are highly correlated, we can imagine that it might be difficult for the regression model to tell which predictor is accounting for the variance of Y. Particularly severe cases of confounding between two or more predictors is known as multicollinearity. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 6 Y X Z As the diagram makes clear, this is not a problem for the prediction of Y: the overall goal of accounting for Y variance. Nor is it a threat to the baseline assumptions of regression. Residuals can still be identically normally distributed and centered on zero. Multicollinearity threatens substantive interpretations by increasing the volatility of regression coefficients, clouding interpretation of the magnitude, sign, and significance of coefficients.
The body fat example Predicting the percentage of body fat in 20 healthy females 25-34 years old. Can we predict the percentage of body fat using a much less expensive and time-consuming procedure: measuring triceps skinfold thickness (mm), thigh circumference (cm), and midarm circumference (cm)? © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 7
Comparing regression coefficients across models The triceps and thigh coefficients are significant single predictors, but the two- predictor equation shows that the triceps variable does not account for a significant amount of body fat variance above and beyond the thigh variable. The adjusted R-sq value also declines from a model that contains the thigh value alone. The arm variable seems to act as a suppressor variable in the two-predictor equation with triceps, however, there is little substantive justification for this. The three-predictor model has no significant predictors, although the omnibus null hypothesis can be rejected, and the adjusted R-sq is the highest of all possible combinations of models. This is consistent with multicollinearity. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 8
Visualizing multicollinearity Multicollinearity arises from an inability of the estimation procedure to decide from among a number of feasible equations for the best-fit plane. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 9 These problems occur more often than one might expect. One might accidentally include two predictors that are on different but linearly related scales, (meters and feet, scale scores and z-scores, frequencies and percentages). One might include variables that are linear combinations of other variables (try predicting a future score with a Time 1 score, a Time 2 score, and growth defined as Time 2 – Time 1). In perfect collinearity, Stata will spit out one of your variables in protest, “note: growth omitted due to collinearity.”
The Variance Inflation Factor (VIF) Our primary concern is the increase in our standard errors, born of an inability of the estimation procedure to estimate individual regression coefficients. We can anticipate the standard errors, because they are related to how well each predictor is predicted by all the other predictors. We can call this,, one for each j of k predictors, where is the proportion of variance in X j accounted for by the other predictors. We can then define a quantity called the Variance Inflation Factor (VIF) that indicates how troublesome a predictor’s standard error is likely to be, due to strong prediction by other predictors: © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 10
Addressing multicollinearity As a loose rule of thumb, any VIF greater than 10 suggests remedial action, although VIFs greater than 5 should be noted. If interpretation of coefficients is desired, one drastic response is to remove a variable. In this example, where prediction is the goal, there may be no need, or we might just keep the thigh variable? Other advanced techniques involve combining predictors with factor analysis or principal components analysis, or using a technique called ridge regression. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 11
Model selection philosophies (more in Unit 11) Different disciplines and subdisciplines have different norms driving variable and model selection. More substantively driven fields that typically deal with lower sample sizes tend to include predictors on theoretical grounds. – Remember that including a nonsignificant predictor can still reduce bias in other predictors, if it really functions as a covariate in the population. More statistically driven fields that typically deal with larger sample sizes tend to prune predictors for parsimony and because there is limited evidence supporting the inclusion of nonsignificant variables in the absence of strong theory. We want you to appreciate this continuum while also emphasizing the value of sensitivity studies: coefficient comparisons across models. Regardless of your philosophy, robust interpretations of coefficients across plausible models should be reassuring, and volatility should be reported and explained substantively to the extent possible. Even if prediction alone is the goal, beware of chasing R-sq or even adj-R- sq blindly. This can lead to overfitting: models that fit the sample data well but do so by chasing random noise. A more parsimonious model will fit the next sample from the population better in these cases. © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 12
Start looking at the results sections of papers in your field © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 13 Michal Kurlaender & John Yun (2007) Measuring school racial composition and student outcomes in a multiracial society, American Journal of Education, 113, 213-242
Get to know the norms in your discipline © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 14 Barbara Pan, Meredith Rowe, Judith Singer and Catherine Snow (2005) Maternal correlates of growth in toddler vocabulary production in low-income families, Child Development, 76(4) 763-782
What are the takeaways from this unit? © Andrew Ho, Harvard Graduate School of EducationUnit 7 / Page 15 Multiple regression is a very powerful tool – The ability to estimate certain associations while accounting for others greatly expands the utility of statistical models – It can support inferences about the effects of some predictors while holding other predictors constant But with great power comes great responsibility – Coefficients and their interpretations are conditional on all the variables in a model. Inclusion or exclusion of variables can change interpretations dramatically. – It is our responsibility to explore plausible rival models and anticipate the effects of including variables beyond the scope of our dataset. The pattern of correlations can help presage multiple regression results – Learn how to examine a correlation matrix and foreshadow how the predictors will behave in a multiple regression model – If you have one (or more) control predictors, consider examining a partial correlation matrix that removes that effect Multicollinearity is an extreme example of confounding – We can anticipate it with correlations but metrics like VIF are better at flagging problems. – Collinearity can be addressed by variable exclusion, variable combination, or more advanced techniques.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Chapter 16 Data Analysis: Testing for Associations.
2 Multicollinearity Presented by: Shahram Arsang Isfahan University of Medical Sciences April 2014.
Topic 3: Regression. Correlation Analysis correlation analysis expresses the relationship between two data series using a single number. The correlation.
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
15-1 Objectives of Multiple Regression Establish the linear equation that best predicts values of a dependent variable Y using more than one explanatory.
Chapter 12 Examining Relationships in Quantitative Research Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Correlation and Regression Product Moment Correlation The product moment correlation, r, summarizes the strength of association between two metric.
Correlation is a precondition for causality– but by itself it is not sufficient to show causality.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Extension to Multiple Regression. Simple regression With simple regression, we have a single predictor and outcome, and in general things are straightforward.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
+ Chapter 13: Inference in Regression Lecture PowerPoint Slides Discovering Statistics 2nd Edition Daniel T. Larose.
2-1 MGMG 522 : Session #2 Learning to Use Regression Analysis & The Classical Model (Ch. 3 & 4)
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday Prediction Intervals for Multiple Regression (Chapter 4.5) Multicollinearity (Chapter 4.6).
Chapter 8 Relationships Among Variables Research Methods in Physical Activity.
ANOVA, Regression and Multiple Regression March
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
MULTIPLE REGRESSION. OVERVIEW What Makes it Multiple? What Makes it Multiple? Additional Assumptions Additional Assumptions Methods of Entering Variables.
Slide 1 Example of Simple and Multiple Regression This example is from pages of Multivariate Data Analysis by Hair, Black, Babin, Anderson, and.
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Multiple Regression and Correlation Analysis Chapter 14 Copyright © 2011 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 15-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Relationship between education level, income, and length of time out of school Our new regression equation: is the predicted value of the dependent.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Understanding Statistics Note: Bring exam review questions next week. Please do not provide answers.
Multiple Regression David A. Kenny January 12, 2014.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 24 Building Regression Models.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.
Multicollinearity - violation of the assumption that no independent variable is a perfect linear function of one or more other independent variables.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Multiple Regression Models: Some Details & Surprises Review of raw & standardized models Differences between r, b & β Bivariate & Multivariate patterns.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Multiple Regression.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
1 B IVARIATE AND MULTIPLE REGRESSION Estratto dal Cap. 8 di: “Statistics for Marketing and Consumer Research”, M. Mazzocchi, ed. SAGE, LEZIONI IN.
Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
© 2017 SlidePlayer.com Inc. All rights reserved.