Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unit 11: Regression in Practice Class 25…Class 25… © Andrew Ho, Harvard Graduate.

Similar presentations


Presentation on theme: "Unit 11: Regression in Practice Class 25…Class 25… © Andrew Ho, Harvard Graduate."— Presentation transcript:

1 Unit 11: Regression in Practice Class 25…Class 25… http://xkcd.com/675/ http://chaospet.com/2009/12/14/164-it-goes-both-ways/ © Andrew Ho, Harvard Graduate School of Education Unit 11 / Page 1

2 Where is Unit 11 in our 11-Unit Sequence? Unit 6: The basics of multiple regression Unit 7: Statistical control in depth: Correlation and collinearity Unit 10: Interactions and quadratic effects Unit 8: Categorical predictors I: Dichotomies Unit 9: Categorical predictors II: Polychotomies Unit 11: Regression in practice. Common Extensions. Unit 1: Introduction to simple linear regression Unit 2: Correlation and causality Unit 3: Inference for the regression model Building a solid foundation Unit 4: Regression assumptions: Evaluating their tenability Unit 5: Transformations to achieve linearity Mastering the subtleties Adding additional predictors Generalizing to other types of predictors and effects Pulling it all together © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 2

3 Design anticipates analysis. Design constrains analysis. Design is analysis. But… © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 3 Example: Brown, JD, L’Engle, KL, Pardun, CJ, Guo, G, Kenneavy, K, & Jackson, C (2006). Sexy media matter: Exposure to sexual content in music movies, television, and magazines predicts Black and White Adolescents’ Sexual Behavior, Pediatrics, 117(4), 1018-1027 http://www.unc.edu/depts/jomc/teenmedia/pdf/pediatrics%20longitudinal%204-3-06.pdf Research DesignSecondary Analysis An ideal scenario – Begin with a research question: Does exposure to sexy media predict sexual behavior? – Obtain a sample from a target population: 1017 Black and White adolescents from 14 middle schools in NC (887 in our dataset). – Experimental: Outcome, Treatment, Control. Sexual behavior, exposure to sexy media, exposure to media. – Nonexperimental: Outcome, Question Predictor, Covariates (baseline exposure, age, race/eth, gender, ses, parents, friends, religion, early puberty) A common scenario – Start a job working for some PI. – Get handed a massive dataset. – I’m kind of interested in this outcome variable, can you tell me anything about it? Thanks. – Or, you find a dataset online, or obtain it from some source, or get assigned it as a final project in some stats class. – We know that it’s ideal to gather data motivated by a clear research question in anticipation of an eventual analysis. – But we should be able to conduct responsible secondary analyses as well.

4 Follow your instincts: Exploratory Data Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 4 Instinct 1: Get your eyes on the data Stata’s command: codebook

5 Instinct 2: Visualize. Univariate Summaries Follow your instincts: Visualize! © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 5 hist activity, kdens freq discrete Statistical Sleuth; Data Explorer – Each histogram can tell you an incredible amount about the nature of the scale and the sample being measured… but don’t obsess just yet, regression is fairly robust (save this for S-061!) – The goal: Pick an observation, and tell yourself its story. ID#6 is a 15-year old White girl with sexual activity at the 75 th percentile, average SES, reporting high religious importance (with 1/3 of participants) and average Sexual Media Diet, with above average school engagement… – Ask yourself: Should any of these mediate, moderate, or be confounded with my question predictor?

6 Instinct 3: Visualize. Bivariate Summaries – After you understand your variables, their meanings, and their distributions, you may have the beginnings of some hypotheses, some “pet” variables that might be of particular interest to you. – Begin to understand how these variables relate to each other. Follow your instincts: Continue to fill your lab notebook © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 6 graph twoway (scatter activity sexmedia, msize(tiny) jitter(7)) (lfit activity sexmedia) /// (lpoly activity sexmedia, lcolor(blue)), legend(off) ytitle(Index of self-reported sexual activity)` The lab notebook is a good metaphor, because it reminds you of the distinction between what you do and what you report. – To be clear, do not describe univariate and bivariate distributions for pages on end. Describe the scales of the most important variables, their distributions to the extent that they a) motivate any remedial action or b) inform interpretation of regression results, and move on.

7 Means, Standard Deviations, Correlations Follow your instincts: Default tables © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 7 estpost correlate activity sexmedia baseact age pubearly male black sesindex schengag religion parengag pdisapp fabstain, matrix esttab, p unstack compress nonumbers

8 Managing and categorizing multiple predictors © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 8 If you are managing the study from the design stage, you will likely have a solid understanding of your outcome, your question predictor, and your covariates/controls. As a secondary data analyst, classifying variables as “outcomes,” “question predictors,” and “covariates/controls” is part of your job. Exploratory data analysis and substantive theory will guide you. Outcome Sexual Activity Question Predictor(s) Sexy Media Exposure Key Control Predictor(s) Baseline sexual activity age socioeconomic status Additional Control Predictor(s) School engagement Parental engagement Importance of religion [Rival Hypothesis Predictor(s)] Parental disapproval of sex Whether friends have had sex The model cannot distinguish between any predictors. These are substantive distinctions and guide your narrative flow Sexy media exposure does happen to be correlated with sexual activity, but this is clearly incomplete. The association is maintained above and beyond obvious covariates, with perhaps some mediation. …and less obvious covariates. …and even some covariates you definitely didn’t think of, but I did. = + + +

9 Order of Operations © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 9 Strategy 1 Question predictors first Model 1: Question Predictors What interests you most. For nonexperimental data, this is a misleading baseline. Model 2: Adding Key Controls Is the coefficient for the question predictors similar? May keep these in the model regardless of statistical significance. Model 3: Add Additional Controls May keep these only as necessary, perhaps accepting/rejecting as a block if they are substantively linked and don’t change the question predictor if added/removed individually. Model 4: Check Rival Hypotheses See whether “effects” of question predictor continue to be robust. Strategy 2 Control model first Model 1: Key controls Start with the status quo Model 2: Add additional controls May keep only if statistically significant, perhaps as a block. Model 3: Add question predictors See if these make a difference over and above the control predictors. Model 4: Check Rival Hypotheses As before. This is motivated just as much by substance as by statistics. If key controls are so well established (and generally strong) that they demand being addressed, then Strategy 2, or something approaching it, may be more natural.

10 Model selection algorithms (that you should never use!) © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 10 All models are wrong, but some are useful George E.P. Box (1979) Far better an approximate answer to the right question…than an exact answer to the wrong question John W. Tukey (1962) The hallmark of good science is that it uses models and ‘theory’ but never believes them attributed to Martin Wilk in Tukey (1962) Occam’s razor: entia non sunt multiplicanda praeter necessitatem: Entities must not be multiplied beyond necessity. If two competing theories lead to the same predictions, the simpler one is better William of Occam (14 th century) In contrast to our recommended approach, I hope this seems tempting but loony. Of these approaches, Best Subsets can be the most informative in a many-predictor context, reminding you that many possible models can lead to very similar global fit statistics. In general, however, these data mining approaches should be avoided.

11 Approaches to variable selection and model building © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 11 A continuum of variable selection approaches anchored by two extreme caricatures Data-driven – Start with an outcome and a set of predictors. – Run a variable selection algorithm that maximizes global prediction along some criterion (R-sq, adj-R-sq…) – Do very little if any adjustment of the best-fitting model: no questioning of irrelevant, redundant, or confounded variables, no effort at identifying substantively central variables, no sensitivity study.

12 Model selection criteria Does the model chosen reflect your underlying theory? Does the model allow you to address the effects of your key question predictor(s)? Are you excluding predictors that are statistically significant? If so, why? Are you including predictors you could reasonably set aside (parsimony principle)? One of the most valuable frameworks you’ll take from this class is the tabular display of multiple plausible models. Allows you to evaluate the robustness of your conclusions under plausible model specifications. The goal is as much to explain dependencies as it is to arrive at a final model or models. Unit 7 / Page 12© Andrew Ho, Harvard Graduate School of Education

13 It’s worth remembering what “statistical control” can and can’t do © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 13 Residuals are Zagat ratings above and beyond what is expected by accounting for cost. Are residuals here “the best bang for your buck.” Are they “the best value?” If restaurants were schools, ratings were test score gains, and cost were socioeconomic status, would residuals show “value added?”

14 Slippery slopes to unwarranted causal inference Cost is related to ratings; cost is associated with ratings. Cost predicts ratings. – A $1 difference in cost predicts/is associated with a.186 difference in ratings. A $1 increase in cost predicts/is associated with a.186 increase in ratings. – As cost increases $1, ratings are predicted to increase by.186. » Cost has an effect on ratings. The effect of cost on ratings is significant. Cost impacts/has an impact on ratings. A significant impact of cost on ratings. A $1 increase in cost will lead to a.186 increase in ratings. Cost drives/causes ratings. Cost is a significant determinant of ratings. © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 14 I typically draw the line about here, though you’d be safer higher. We’ve been drifting here in technical writing, in the “results” section, but always circle back to defensible interpretations, in the “discussion” section.

15 What we’ve learned: The world of S-030 © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 15 A single outcome variable Continuous, interval scaled (noncategorical) A single predictor variable (Units 1-4) May be transformed to meet regression assumptions of normally distributed residuals (e.g., log(Y)) (Unit 5) Independent and identically normally distributed residuals centered on 0 (Unit 4) May be transformed to achieve linearity (Unit 5) May be dichotomous (Unit 8) or polychotomous (Unit 9) Multiple predictor variables (Units 6-7) Interactions: Products of predictors (Unit 10) Quadratic/Polynomial Regression for nonlinear relationships (Unit 10)

16 Unit 11 / Page 16 Multiple Regression Analysis Multiple Regression Analysis Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints Are the data longitudinal? Use Individual growth modeling If your residuals are not independent, replace OLS by GLS regression analysis Specify a Multilevel Model If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Discriminant Analysis Multinomial logistic regression analysis (polychotomous outcome) Binomial logistic regression analysis (dichotomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Conduct a Principal Components Analysis Form composites of the indicators of any common construct. Use Cluster Analysis Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use non-linear regression analysis What we haven’t learned (yet): The S-052 Roadmap, by Dr. John Willett What might a next course in regression analysis look like? © Andrew Ho, Harvard Graduate School of Education

17 You’ve come a long way… © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 17 Unit 6: The basics of multiple regression Unit 7: Statistical control in depth: Correlation and collinearity Unit 10: Interactions and quadratic effects Unit 8: Categorical predictors I: Dichotomies Unit 9: Categorical predictors II: Polychotomies Unit 11: Regression in practice. Common Extensions. Unit 1: Introduction to simple linear regression Unit 2: Correlation and causality Unit 3: Inference for the regression model Building a solid foundation Unit 4: Regression assumptions: Evaluating their tenability Unit 5: Transformations to achieve linearity Mastering the subtleties Adding additional predictors Generalizing to other types of predictors and effects Pulling it all together

18 You’ve come a long way… © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 18 Coasting on intro stat knowledge Sampling distributions!? Type II error?! Multiple regression isn’t so ba- Gummy bears in Jello?! I can kind of “see” diagnostics and transformations. Why do the slopes keep changing? Starting to get categoricals… Effects of variables on other effects?! Final project

19 And there are next steps to be taken… © Andrew Ho, Harvard Graduate School of Education Unit 11 / Page 19 Some themes of this course can help you to keep up your momentum: Treat statistics like a language. Use it (in active, participatory practice) or lose it. Create study groups at your place of employment. Collaborate on projects. Submit papers to conferences. Attend conferences. Read quantitative blogs and publications (538, information is beautiful, … xkcd) Keep in touch with us. We can add you to future course websites and keep you thinking about next steps. Keep in touch with each other. Google (but with prejudice). UCLA is particularly solid. Youtube, hit and miss. Ask to take on quantitative work at work. Look for data analysis opportunities. Everywhere.

20 Acknowledgements Huge thanks to Judy Singer for a course blueprint that even I couldn’t completely flub. Thanks to the LTC for support throughout this semester. Thanks to Melita Garrett for his administrative support. Thanks to xkcd… The S-030 Experience: Huge thanks to the TFs, Ann, Beth, Dave, James, Marc, Priya, and Vidur, for supporting the better half of learning, outside of lecture. And thanks to you all for making my second and final time teaching S-030 so enjoyable and memorable. © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 20 http://xkcd.com/1038//


Download ppt "Unit 11: Regression in Practice Class 25…Class 25… © Andrew Ho, Harvard Graduate."

Similar presentations


Ads by Google