# Unit 11: Regression in Practice Class 25…Class 25… © Andrew Ho, Harvard Graduate.

## Presentation on theme: "Unit 11: Regression in Practice Class 25…Class 25… © Andrew Ho, Harvard Graduate."— Presentation transcript:

Unit 11: Regression in Practice Class 25…Class 25… http://xkcd.com/675/ http://chaospet.com/2009/12/14/164-it-goes-both-ways/ © Andrew Ho, Harvard Graduate School of Education Unit 11 / Page 1

Where is Unit 11 in our 11-Unit Sequence? Unit 6: The basics of multiple regression Unit 7: Statistical control in depth: Correlation and collinearity Unit 10: Interactions and quadratic effects Unit 8: Categorical predictors I: Dichotomies Unit 9: Categorical predictors II: Polychotomies Unit 11: Regression in practice. Common Extensions. Unit 1: Introduction to simple linear regression Unit 2: Correlation and causality Unit 3: Inference for the regression model Building a solid foundation Unit 4: Regression assumptions: Evaluating their tenability Unit 5: Transformations to achieve linearity Mastering the subtleties Adding additional predictors Generalizing to other types of predictors and effects Pulling it all together © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 2

Design anticipates analysis. Design constrains analysis. Design is analysis. But… © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 3 Example: Brown, JD, L’Engle, KL, Pardun, CJ, Guo, G, Kenneavy, K, & Jackson, C (2006). Sexy media matter: Exposure to sexual content in music movies, television, and magazines predicts Black and White Adolescents’ Sexual Behavior, Pediatrics, 117(4), 1018-1027 http://www.unc.edu/depts/jomc/teenmedia/pdf/pediatrics%20longitudinal%204-3-06.pdf Research DesignSecondary Analysis An ideal scenario – Begin with a research question: Does exposure to sexy media predict sexual behavior? – Obtain a sample from a target population: 1017 Black and White adolescents from 14 middle schools in NC (887 in our dataset). – Experimental: Outcome, Treatment, Control. Sexual behavior, exposure to sexy media, exposure to media. – Nonexperimental: Outcome, Question Predictor, Covariates (baseline exposure, age, race/eth, gender, ses, parents, friends, religion, early puberty) A common scenario – Start a job working for some PI. – Get handed a massive dataset. – I’m kind of interested in this outcome variable, can you tell me anything about it? Thanks. – Or, you find a dataset online, or obtain it from some source, or get assigned it as a final project in some stats class. – We know that it’s ideal to gather data motivated by a clear research question in anticipation of an eventual analysis. – But we should be able to conduct responsible secondary analyses as well.

Follow your instincts: Exploratory Data Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 4 Instinct 1: Get your eyes on the data Stata’s command: codebook

Instinct 2: Visualize. Univariate Summaries Follow your instincts: Visualize! © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 5 hist activity, kdens freq discrete Statistical Sleuth; Data Explorer – Each histogram can tell you an incredible amount about the nature of the scale and the sample being measured… but don’t obsess just yet, regression is fairly robust (save this for S-061!) – The goal: Pick an observation, and tell yourself its story. ID#6 is a 15-year old White girl with sexual activity at the 75 th percentile, average SES, reporting high religious importance (with 1/3 of participants) and average Sexual Media Diet, with above average school engagement… – Ask yourself: Should any of these mediate, moderate, or be confounded with my question predictor?

Instinct 3: Visualize. Bivariate Summaries – After you understand your variables, their meanings, and their distributions, you may have the beginnings of some hypotheses, some “pet” variables that might be of particular interest to you. – Begin to understand how these variables relate to each other. Follow your instincts: Continue to fill your lab notebook © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 6 graph twoway (scatter activity sexmedia, msize(tiny) jitter(7)) (lfit activity sexmedia) /// (lpoly activity sexmedia, lcolor(blue)), legend(off) ytitle(Index of self-reported sexual activity)` The lab notebook is a good metaphor, because it reminds you of the distinction between what you do and what you report. – To be clear, do not describe univariate and bivariate distributions for pages on end. Describe the scales of the most important variables, their distributions to the extent that they a) motivate any remedial action or b) inform interpretation of regression results, and move on.

Means, Standard Deviations, Correlations Follow your instincts: Default tables © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 7 estpost correlate activity sexmedia baseact age pubearly male black sesindex schengag religion parengag pdisapp fabstain, matrix esttab, p unstack compress nonumbers

Managing and categorizing multiple predictors © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 8 If you are managing the study from the design stage, you will likely have a solid understanding of your outcome, your question predictor, and your covariates/controls. As a secondary data analyst, classifying variables as “outcomes,” “question predictors,” and “covariates/controls” is part of your job. Exploratory data analysis and substantive theory will guide you. Outcome Sexual Activity Question Predictor(s) Sexy Media Exposure Key Control Predictor(s) Baseline sexual activity age socioeconomic status Additional Control Predictor(s) School engagement Parental engagement Importance of religion [Rival Hypothesis Predictor(s)] Parental disapproval of sex Whether friends have had sex The model cannot distinguish between any predictors. These are substantive distinctions and guide your narrative flow Sexy media exposure does happen to be correlated with sexual activity, but this is clearly incomplete. The association is maintained above and beyond obvious covariates, with perhaps some mediation. …and less obvious covariates. …and even some covariates you definitely didn’t think of, but I did. = + + +

Model selection algorithms (that you should never use!) © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 10 All models are wrong, but some are useful George E.P. Box (1979) Far better an approximate answer to the right question…than an exact answer to the wrong question John W. Tukey (1962) The hallmark of good science is that it uses models and ‘theory’ but never believes them attributed to Martin Wilk in Tukey (1962) Occam’s razor: entia non sunt multiplicanda praeter necessitatem: Entities must not be multiplied beyond necessity. If two competing theories lead to the same predictions, the simpler one is better William of Occam (14 th century) In contrast to our recommended approach, I hope this seems tempting but loony. Of these approaches, Best Subsets can be the most informative in a many-predictor context, reminding you that many possible models can lead to very similar global fit statistics. In general, however, these data mining approaches should be avoided.

Approaches to variable selection and model building © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 11 A continuum of variable selection approaches anchored by two extreme caricatures Data-driven – Start with an outcome and a set of predictors. – Run a variable selection algorithm that maximizes global prediction along some criterion (R-sq, adj-R-sq…) – Do very little if any adjustment of the best-fitting model: no questioning of irrelevant, redundant, or confounded variables, no effort at identifying substantively central variables, no sensitivity study.

Model selection criteria Does the model chosen reflect your underlying theory? Does the model allow you to address the effects of your key question predictor(s)? Are you excluding predictors that are statistically significant? If so, why? Are you including predictors you could reasonably set aside (parsimony principle)? One of the most valuable frameworks you’ll take from this class is the tabular display of multiple plausible models. Allows you to evaluate the robustness of your conclusions under plausible model specifications. The goal is as much to explain dependencies as it is to arrive at a final model or models. Unit 7 / Page 12© Andrew Ho, Harvard Graduate School of Education

It’s worth remembering what “statistical control” can and can’t do © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 13 Residuals are Zagat ratings above and beyond what is expected by accounting for cost. Are residuals here “the best bang for your buck.” Are they “the best value?” If restaurants were schools, ratings were test score gains, and cost were socioeconomic status, would residuals show “value added?”

Slippery slopes to unwarranted causal inference Cost is related to ratings; cost is associated with ratings. Cost predicts ratings. – A \$1 difference in cost predicts/is associated with a.186 difference in ratings. A \$1 increase in cost predicts/is associated with a.186 increase in ratings. – As cost increases \$1, ratings are predicted to increase by.186. » Cost has an effect on ratings. The effect of cost on ratings is significant. Cost impacts/has an impact on ratings. A significant impact of cost on ratings. A \$1 increase in cost will lead to a.186 increase in ratings. Cost drives/causes ratings. Cost is a significant determinant of ratings. © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 14 I typically draw the line about here, though you’d be safer higher. We’ve been drifting here in technical writing, in the “results” section, but always circle back to defensible interpretations, in the “discussion” section.

What we’ve learned: The world of S-030 © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 15 A single outcome variable Continuous, interval scaled (noncategorical) A single predictor variable (Units 1-4) May be transformed to meet regression assumptions of normally distributed residuals (e.g., log(Y)) (Unit 5) Independent and identically normally distributed residuals centered on 0 (Unit 4) May be transformed to achieve linearity (Unit 5) May be dichotomous (Unit 8) or polychotomous (Unit 9) Multiple predictor variables (Units 6-7) Interactions: Products of predictors (Unit 10) Quadratic/Polynomial Regression for nonlinear relationships (Unit 10)

Unit 11 / Page 16 Multiple Regression Analysis Multiple Regression Analysis Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints Are the data longitudinal? Use Individual growth modeling If your residuals are not independent, replace OLS by GLS regression analysis Specify a Multilevel Model If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Discriminant Analysis Multinomial logistic regression analysis (polychotomous outcome) Binomial logistic regression analysis (dichotomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Conduct a Principal Components Analysis Form composites of the indicators of any common construct. Use Cluster Analysis Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use non-linear regression analysis What we haven’t learned (yet): The S-052 Roadmap, by Dr. John Willett What might a next course in regression analysis look like? © Andrew Ho, Harvard Graduate School of Education

You’ve come a long way… © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 17 Unit 6: The basics of multiple regression Unit 7: Statistical control in depth: Correlation and collinearity Unit 10: Interactions and quadratic effects Unit 8: Categorical predictors I: Dichotomies Unit 9: Categorical predictors II: Polychotomies Unit 11: Regression in practice. Common Extensions. Unit 1: Introduction to simple linear regression Unit 2: Correlation and causality Unit 3: Inference for the regression model Building a solid foundation Unit 4: Regression assumptions: Evaluating their tenability Unit 5: Transformations to achieve linearity Mastering the subtleties Adding additional predictors Generalizing to other types of predictors and effects Pulling it all together

You’ve come a long way… © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 18 Coasting on intro stat knowledge Sampling distributions!? Type II error?! Multiple regression isn’t so ba- Gummy bears in Jello?! I can kind of “see” diagnostics and transformations. Why do the slopes keep changing? Starting to get categoricals… Effects of variables on other effects?! Final project