We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byAmberly Gibbs
Modified over 2 years ago
Unit 11: Regression in Practice Class 25…Class 25… http://xkcd.com/675/ http://chaospet.com/2009/12/14/164-it-goes-both-ways/ © Andrew Ho, Harvard Graduate School of Education Unit 11 / Page 1
Where is Unit 11 in our 11-Unit Sequence? Unit 6: The basics of multiple regression Unit 7: Statistical control in depth: Correlation and collinearity Unit 10: Interactions and quadratic effects Unit 8: Categorical predictors I: Dichotomies Unit 9: Categorical predictors II: Polychotomies Unit 11: Regression in practice. Common Extensions. Unit 1: Introduction to simple linear regression Unit 2: Correlation and causality Unit 3: Inference for the regression model Building a solid foundation Unit 4: Regression assumptions: Evaluating their tenability Unit 5: Transformations to achieve linearity Mastering the subtleties Adding additional predictors Generalizing to other types of predictors and effects Pulling it all together © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 2
Design anticipates analysis. Design constrains analysis. Design is analysis. But… © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 3 Example: Brown, JD, L’Engle, KL, Pardun, CJ, Guo, G, Kenneavy, K, & Jackson, C (2006). Sexy media matter: Exposure to sexual content in music movies, television, and magazines predicts Black and White Adolescents’ Sexual Behavior, Pediatrics, 117(4), 1018-1027 http://www.unc.edu/depts/jomc/teenmedia/pdf/pediatrics%20longitudinal%204-3-06.pdf Research DesignSecondary Analysis An ideal scenario – Begin with a research question: Does exposure to sexy media predict sexual behavior? – Obtain a sample from a target population: 1017 Black and White adolescents from 14 middle schools in NC (887 in our dataset). – Experimental: Outcome, Treatment, Control. Sexual behavior, exposure to sexy media, exposure to media. – Nonexperimental: Outcome, Question Predictor, Covariates (baseline exposure, age, race/eth, gender, ses, parents, friends, religion, early puberty) A common scenario – Start a job working for some PI. – Get handed a massive dataset. – I’m kind of interested in this outcome variable, can you tell me anything about it? Thanks. – Or, you find a dataset online, or obtain it from some source, or get assigned it as a final project in some stats class. – We know that it’s ideal to gather data motivated by a clear research question in anticipation of an eventual analysis. – But we should be able to conduct responsible secondary analyses as well.
Follow your instincts: Exploratory Data Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 4 Instinct 1: Get your eyes on the data Stata’s command: codebook
Instinct 2: Visualize. Univariate Summaries Follow your instincts: Visualize! © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 5 hist activity, kdens freq discrete Statistical Sleuth; Data Explorer – Each histogram can tell you an incredible amount about the nature of the scale and the sample being measured… but don’t obsess just yet, regression is fairly robust (save this for S-061!) – The goal: Pick an observation, and tell yourself its story. ID#6 is a 15-year old White girl with sexual activity at the 75 th percentile, average SES, reporting high religious importance (with 1/3 of participants) and average Sexual Media Diet, with above average school engagement… – Ask yourself: Should any of these mediate, moderate, or be confounded with my question predictor?
Instinct 3: Visualize. Bivariate Summaries – After you understand your variables, their meanings, and their distributions, you may have the beginnings of some hypotheses, some “pet” variables that might be of particular interest to you. – Begin to understand how these variables relate to each other. Follow your instincts: Continue to fill your lab notebook © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 6 graph twoway (scatter activity sexmedia, msize(tiny) jitter(7)) (lfit activity sexmedia) /// (lpoly activity sexmedia, lcolor(blue)), legend(off) ytitle(Index of self-reported sexual activity)` The lab notebook is a good metaphor, because it reminds you of the distinction between what you do and what you report. – To be clear, do not describe univariate and bivariate distributions for pages on end. Describe the scales of the most important variables, their distributions to the extent that they a) motivate any remedial action or b) inform interpretation of regression results, and move on.
Means, Standard Deviations, Correlations Follow your instincts: Default tables © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 7 estpost correlate activity sexmedia baseact age pubearly male black sesindex schengag religion parengag pdisapp fabstain, matrix esttab, p unstack compress nonumbers
Managing and categorizing multiple predictors © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 8 If you are managing the study from the design stage, you will likely have a solid understanding of your outcome, your question predictor, and your covariates/controls. As a secondary data analyst, classifying variables as “outcomes,” “question predictors,” and “covariates/controls” is part of your job. Exploratory data analysis and substantive theory will guide you. Outcome Sexual Activity Question Predictor(s) Sexy Media Exposure Key Control Predictor(s) Baseline sexual activity age socioeconomic status Additional Control Predictor(s) School engagement Parental engagement Importance of religion [Rival Hypothesis Predictor(s)] Parental disapproval of sex Whether friends have had sex The model cannot distinguish between any predictors. These are substantive distinctions and guide your narrative flow Sexy media exposure does happen to be correlated with sexual activity, but this is clearly incomplete. The association is maintained above and beyond obvious covariates, with perhaps some mediation. …and less obvious covariates. …and even some covariates you definitely didn’t think of, but I did. = + + +
Order of Operations © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 9 Strategy 1 Question predictors first Model 1: Question Predictors What interests you most. For nonexperimental data, this is a misleading baseline. Model 2: Adding Key Controls Is the coefficient for the question predictors similar? May keep these in the model regardless of statistical significance. Model 3: Add Additional Controls May keep these only as necessary, perhaps accepting/rejecting as a block if they are substantively linked and don’t change the question predictor if added/removed individually. Model 4: Check Rival Hypotheses See whether “effects” of question predictor continue to be robust. Strategy 2 Control model first Model 1: Key controls Start with the status quo Model 2: Add additional controls May keep only if statistically significant, perhaps as a block. Model 3: Add question predictors See if these make a difference over and above the control predictors. Model 4: Check Rival Hypotheses As before. This is motivated just as much by substance as by statistics. If key controls are so well established (and generally strong) that they demand being addressed, then Strategy 2, or something approaching it, may be more natural.
Model selection algorithms (that you should never use!) © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 10 All models are wrong, but some are useful George E.P. Box (1979) Far better an approximate answer to the right question…than an exact answer to the wrong question John W. Tukey (1962) The hallmark of good science is that it uses models and ‘theory’ but never believes them attributed to Martin Wilk in Tukey (1962) Occam’s razor: entia non sunt multiplicanda praeter necessitatem: Entities must not be multiplied beyond necessity. If two competing theories lead to the same predictions, the simpler one is better William of Occam (14 th century) In contrast to our recommended approach, I hope this seems tempting but loony. Of these approaches, Best Subsets can be the most informative in a many-predictor context, reminding you that many possible models can lead to very similar global fit statistics. In general, however, these data mining approaches should be avoided.
Approaches to variable selection and model building © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 11 A continuum of variable selection approaches anchored by two extreme caricatures Data-driven – Start with an outcome and a set of predictors. – Run a variable selection algorithm that maximizes global prediction along some criterion (R-sq, adj-R-sq…) – Do very little if any adjustment of the best-fitting model: no questioning of irrelevant, redundant, or confounded variables, no effort at identifying substantively central variables, no sensitivity study.
Model selection criteria Does the model chosen reflect your underlying theory? Does the model allow you to address the effects of your key question predictor(s)? Are you excluding predictors that are statistically significant? If so, why? Are you including predictors you could reasonably set aside (parsimony principle)? One of the most valuable frameworks you’ll take from this class is the tabular display of multiple plausible models. Allows you to evaluate the robustness of your conclusions under plausible model specifications. The goal is as much to explain dependencies as it is to arrive at a final model or models. Unit 7 / Page 12© Andrew Ho, Harvard Graduate School of Education
It’s worth remembering what “statistical control” can and can’t do © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 13 Residuals are Zagat ratings above and beyond what is expected by accounting for cost. Are residuals here “the best bang for your buck.” Are they “the best value?” If restaurants were schools, ratings were test score gains, and cost were socioeconomic status, would residuals show “value added?”
Slippery slopes to unwarranted causal inference Cost is related to ratings; cost is associated with ratings. Cost predicts ratings. – A $1 difference in cost predicts/is associated with a.186 difference in ratings. A $1 increase in cost predicts/is associated with a.186 increase in ratings. – As cost increases $1, ratings are predicted to increase by.186. » Cost has an effect on ratings. The effect of cost on ratings is significant. Cost impacts/has an impact on ratings. A significant impact of cost on ratings. A $1 increase in cost will lead to a.186 increase in ratings. Cost drives/causes ratings. Cost is a significant determinant of ratings. © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 14 I typically draw the line about here, though you’d be safer higher. We’ve been drifting here in technical writing, in the “results” section, but always circle back to defensible interpretations, in the “discussion” section.
What we’ve learned: The world of S-030 © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 15 A single outcome variable Continuous, interval scaled (noncategorical) A single predictor variable (Units 1-4) May be transformed to meet regression assumptions of normally distributed residuals (e.g., log(Y)) (Unit 5) Independent and identically normally distributed residuals centered on 0 (Unit 4) May be transformed to achieve linearity (Unit 5) May be dichotomous (Unit 8) or polychotomous (Unit 9) Multiple predictor variables (Units 6-7) Interactions: Products of predictors (Unit 10) Quadratic/Polynomial Regression for nonlinear relationships (Unit 10)
Unit 11 / Page 16 Multiple Regression Analysis Multiple Regression Analysis Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints Are the data longitudinal? Use Individual growth modeling If your residuals are not independent, replace OLS by GLS regression analysis Specify a Multilevel Model If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Discriminant Analysis Multinomial logistic regression analysis (polychotomous outcome) Binomial logistic regression analysis (dichotomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Conduct a Principal Components Analysis Form composites of the indicators of any common construct. Use Cluster Analysis Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use non-linear regression analysis What we haven’t learned (yet): The S-052 Roadmap, by Dr. John Willett What might a next course in regression analysis look like? © Andrew Ho, Harvard Graduate School of Education
You’ve come a long way… © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 17 Unit 6: The basics of multiple regression Unit 7: Statistical control in depth: Correlation and collinearity Unit 10: Interactions and quadratic effects Unit 8: Categorical predictors I: Dichotomies Unit 9: Categorical predictors II: Polychotomies Unit 11: Regression in practice. Common Extensions. Unit 1: Introduction to simple linear regression Unit 2: Correlation and causality Unit 3: Inference for the regression model Building a solid foundation Unit 4: Regression assumptions: Evaluating their tenability Unit 5: Transformations to achieve linearity Mastering the subtleties Adding additional predictors Generalizing to other types of predictors and effects Pulling it all together
You’ve come a long way… © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 18 Coasting on intro stat knowledge Sampling distributions!? Type II error?! Multiple regression isn’t so ba- Gummy bears in Jello?! I can kind of “see” diagnostics and transformations. Why do the slopes keep changing? Starting to get categoricals… Effects of variables on other effects?! Final project
And there are next steps to be taken… © Andrew Ho, Harvard Graduate School of Education Unit 11 / Page 19 Some themes of this course can help you to keep up your momentum: Treat statistics like a language. Use it (in active, participatory practice) or lose it. Create study groups at your place of employment. Collaborate on projects. Submit papers to conferences. Attend conferences. Read quantitative blogs and publications (538, information is beautiful, … xkcd) Keep in touch with us. We can add you to future course websites and keep you thinking about next steps. Keep in touch with each other. Google (but with prejudice). UCLA is particularly solid. Youtube, hit and miss. Ask to take on quantitative work at work. Look for data analysis opportunities. Everywhere.
Acknowledgements Huge thanks to Judy Singer for a course blueprint that even I couldn’t completely flub. Thanks to the LTC for support throughout this semester. Thanks to Melita Garrett for his administrative support. Thanks to xkcd… The S-030 Experience: Huge thanks to the TFs, Ann, Beth, Dave, James, Marc, Priya, and Vidur, for supporting the better half of learning, outside of lecture. And thanks to you all for making my second and final time teaching S-030 so enjoyable and memorable. © Andrew Ho, Harvard Graduate School of EducationUnit 11 / Page 20 http://xkcd.com/1038//
Unit 1c: Detecting Influential Data Points and Assessing Their Impact © Andrew Ho, Harvard Graduate School of EducationUnit 1c – Slide 1
Unit 3a: Introducing the Multilevel Regression Model © Andrew Ho, Harvard Graduate School of EducationUnit 3a – Slide 1
Unit 5c: Adding Predictors to the Discrete Time Hazard Model © Andrew Ho, Harvard Graduate School of EducationUnit 5c– Slide 1
Unit 4c: Taxonomies of Logistic Regression Models © Andrew Ho, Harvard Graduate School of EducationUnit 4c – Slide 1
Andrew Ho Harvard Graduate School of Education Tuesday, January 22, 2013 S-052 Shopping – Applied Data Analysis.
Unit 3b: From Fixed to Random Intercepts © Andrew Ho, Harvard Graduate School of EducationUnit 3b – Slide 1
© Willett, Harvard University Graduate School of Education, 8/27/2015S052/I.3(c) – Slide 1 More details can be found in the “Course Objectives and Content”
Unit 4b: Fitting the Logistic Model to Data © Andrew Ho, Harvard Graduate School of EducationUnit 4b – Slide 1
Unit 5b: The Logistic Regression Approach to Life Table Analysis © Andrew Ho, Harvard Graduate School of EducationUnit 5b– Slide 1
S052/Shopping Presentation – Slide #1 © Willett, Harvard University Graduate School of Education S052: Applied Data Analysis What Would You Like To Know.
Chapter 13: Inference in Regression
S052/Shopping Presentation – Slide #1 © Willett, Harvard University Graduate School of Education S052: Applied Data Analysis Shopping Presentation: A.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Unit 2a: Dealing “Empirically” with Nonlinear Relationships © Andrew Ho, Harvard Graduate School of EducationUnit 2a – Slide 1
Shopping class: Unit 0/Slide #1 © Judith D. Singer, Harvard Graduate School of Education Shopping for S-030: Intermediate Statistics: Applied Regression.
Going from data to analysis Dr. Nancy Mayo. Getting it right Research is about getting the right answer, not just an answer An answer is easy The right.
Simple Linear Regression
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Statistical Analyses & Threats to Validity
Unit 8: Categorical predictors I: Dichotomies Class 19…Class 19… © Andrew Ho, Harvard Graduate School of EducationUnit.
Review Lecture 51 Tue, Dec 13, Chapter 1 Sections 1.1 – 1.4. Sections 1.1 – 1.4. Be familiar with the language and principles of hypothesis testing.
L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.
Correlation and Regression Analysis
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Some Terms Y = o + 1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Topic 12 – Further Topics in ANOVA
HOW TO WRITE RESEARCH PROPOSAL BY DR. NIK MAHERAN NIK MUHAMMAD.
AP Statistics Review Day 1 Chapters 1-4. AP Exam Exploring Data accounts for 20%-30% of the material covered on the AP Exam. “Exploratory analysis of.
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
Multiple Regression. Multiple Regression Usually several variables influence the dependent variable Example: income is influenced by years of education.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.
Introduction to Regression Analysis, Chapter 13,
Lecture 8 Relationships between Scale variables: Regression Analysis
Correlation & Regression
General Set up of the exam. Two Sections Multiple Choice Multiple Choice 90 minutes 90 minutes 40 questions 40 questions Free Response Free Response 90.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
June 2, 2008Stat Lecture 18 - Review1 Final review Statistics Lecture 18.
Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
© 2017 SlidePlayer.com Inc. All rights reserved.