Regression: Primary Goals We usually are focused on one of the following goals: Predicting the response variable based on a set of predictors Reliability Quantifying the relationship between the predictors and the response--Interpretability It both situations, confounding and interaction can be concerns.
What is “Confounding”? We saw this with the Smoking and Age predictors in our SBP example. We consider the relationship of SBP to… Smoking Status alone Smoking Status along with age Our interest is in determining whether smoking raises blood pressure.
Smoking is confounded with Age Smoking by itself is not significant Without age, we are not able to see a difference in the smoking groups. (The groups are actually different but we cannot see it until we add age (a covariate).
Smoking is confounded with Age (2) Smoking variable tests significant After adjusting for age, the two smoking groups are clearly different!
Estimates The effect of smoking is confounded with age – if we don’t first adjust for age we cannot won’t see accurately the effect of smoking.
Confounding Confounding exists if meaningfully different interpretations of a relationship of interest can be made depending on whether or not a nuisance variable (or covariate) is included in the model. How to find confounding? Get lucky and stumble upon it (like we did) Look for it intentionally by running a lot of different models and watching for variables that aren’t significant at first but become significant when adding other variables (covariates).
Confounding (2) If confounding is present, it may lead to inaccurate results if not careful – important covariates MUST be included (even if they aren’t even significant!) Making the variable of interest significant is enough to warrant including the covariate If we had failed to adjust for age, we will not get a good estimate for the difference due to smoking, and will also have wrongly conclude that smoking status doesn’t matter.
Confounding vs. Multicollinearity Parameter estimates will change wildly when (multi)collinearity is involved too! They are almost opposite SE’s increase and X1 becomes insignificant (added last) when X2 is in the model – (MULTI)COLLINEARITY This (usually) works both ways—both variables “fight” SE’s decrease and X1 becomes significant (added last) only when X2 is in the model – CONFOUNDING Confounding is usually only one way—the covariate(Z) helps the confounded variable(X) Age is helping Smoking
Confounding vs. Multicollinearity (2) Can catch (multi)collinearity in the correlation matrix Any single correlation > 0.9 collinearity between just those two predictors Any predictor that has several values between 0.5 and 0.9 with other predictors multi-collinearity For confounding, there will usually be some correlation between X and Z but it will not be very large. Our example:
Interaction Interaction is (sort of) one step beyond confounding – not only does it make a difference to adjust for Z, but the relationship between Y and X is fundamentally different at different levels of Z. Can think of this as having a differerent regression line for each fixed level of Z. With no interaction, these lines would be parallel.
SBP Example We found Age and Smk to both be important. Is it possible that they are interacting? X = age Z = 0 for non-smokers, 1 for smokers
Interaction Looking at plots can give us some idea of interaction (parallel lines). However... It is very easy to just test to see if the XZ interaction term is important. Treat it just as you would any other variable and do a partial F-test. Note that if a model includes XZ interaction term, it should also include X and Z main effects. We would never just look at the XZ term by itself.
Age/Smk Interaction Model Interaction mathematically described using a product term: Or just: where X 3 is X 1 X 2
SBP Example The interaction tests insignificant, there is no significant interaction between age and smk Suppose it was significant Would then have to keep the age_smk interaction term AS WELL AS both the age and smk variables (even if age and smk themselves are insignificant)
Confounding vs. Interaction Y = response X = predictor Z = covariate / 2 nd predictor Is the estimated relationship between Y and X dramatically different if one adjusts or does not adjust for Z? Confounding Is the estimated relationship between Y and X meaningfully different at different values of Z? Interaction
Correlations One problem with using interaction terms is that they tend to be highly correlated with one or both of the original variables In our example: Correlation between SMK and AGE_SMK turned out to be 0.98 This is NOT REAL!!! It is a form of “fake” collinearity, the variables aren’t really “fighting” to explain SS To remove this “fake” collinearity just center the variables Subtract the mean from all predictors This doesn’t change any significance tests or p-values, it only removes what we are calling fake collinearity
How to center? SBP Example Mean age was 53.25, subtract 53.25 from all the ages in the dataset and use these new values in the analysis Mean smk was 0.53125, (do the same thing) After centering: Correlation between SMK and AGE_SMK is now 0.017 (so they weren’t really fighting, it just looked like it because we didn’t center) Maybe we should always center???
General Uses Polynomial models used in situations where the relationship between Y and X is non- linear Can usually see it in scatterplots Should definitely catch it in residual plots! Somewhat dangerous, since a polynomial model of order n – 1 will always fit n data points exactly. Example?
Strategy for fitting CENTER your variables to avoid the “fake” (multi)collinearity. Use a special type of backward elimination procedure Test highest order term first! If a higher order term is significant, you MUST include all lower order terms for that variable
Example Problem 15.7 (sas/data available online) X = amount of vaccine, Y = measure of skin response in rats. 12 data points If we run just a simple linear regression, the R- square is only 45%, we will consider a polynomial model and try to do better!
Cubic Model x is X, x2 is X 2 =X*X, x3 is X 3 =X*X*X, etc X 3 is important – Must keep X 2 and X, why? Cubic model, model with X, X 2, and X 3 now explains 82% of the variation (was only 45% for the linear model)