Download presentation

Presentation is loading. Please wait.

Published byOswaldo Passey Modified over 2 years ago

2
More details can be found in the “Course Objectives and Content” handout on the course webpage. Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If your sole predictor is continuous, MRA is identical to correlational analysis If your sole predictor is dichotomous, MRA is identical to a t-test If your several predictors are categorical, MRA is identical to ANOVA If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use Factor Analysis: EFA or CFA? © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 1 S052/§I.1(a): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic? Today’s Topic Area

3
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 2 S052/§I.1(a): Applied Data Analysis Where Does Today’s Topic Appear in the Printed Syllabus? In the future, I ask you to keep automatically tabs on the inter- connections among the Roadmap, the Daily Topic Area, the Printed Syllabus, and the content of the day’s class, when you first download and pre-read the required day’s class materials. Today’s topic, Deciding Which Regression Models to Fit, is from Syllabus Section I.1(a) and includes:Syllabus Section I.1(a) Slides 3-4: Introducing the ILLCAUSE data-example. Slides 5-6: A “Universe Of All Possible Models”. Slide 7: Two Strategies For Choosing Subsets of Regression Models To Fit. Slide 8: Where Is My Strategy Documented. Slides 9-21: Exploratory Univariate & Bivariate Analyses in the ILLCAUSE Dataset. Slide 22: Establishing Priorities Among the Predictors. Slides 22-25: Fitting a Sensible Taxonomy of Regression Models in the ILLCAUSE Dataset. Slide 26: Decoding Standard Regression Output. Slide 27: APA-Style Table Displaying a Taxonomy Of Fitted Regression Models. Slides 29-30: Appendix I. Slide 31: Appendix II. Today’s topic, Deciding Which Regression Models to Fit, is from Syllabus Section I.1(a) and includes:Syllabus Section I.1(a) Slides 3-4: Introducing the ILLCAUSE data-example. Slides 5-6: A “Universe Of All Possible Models”. Slide 7: Two Strategies For Choosing Subsets of Regression Models To Fit. Slide 8: Where Is My Strategy Documented. Slides 9-21: Exploratory Univariate & Bivariate Analyses in the ILLCAUSE Dataset. Slide 22: Establishing Priorities Among the Predictors. Slides 22-25: Fitting a Sensible Taxonomy of Regression Models in the ILLCAUSE Dataset. Slide 26: Decoding Standard Regression Output. Slide 27: APA-Style Table Displaying a Taxonomy Of Fitted Regression Models. Slides 29-30: Appendix I. Slide 31: Appendix II.

4
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 3 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit What Question, and Dataset, Will Drive Our Presentation Today? RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and if so, by how much? Dataset on website: ILLCAUSE.txt.ILLCAUSE.txt Codebook on website: ILLCAUSE_infoILLCAUSE_info Dataset on website: ILLCAUSE.txt.ILLCAUSE.txt Codebook on website: ILLCAUSE_infoILLCAUSE_info DatasetILLCAUSE.txt OverviewData for investigating differences in children’s understanding of the causes of illness, by their health status. SourcePerrin E.C., Sayer A.G., and Willett J.B.Perrin E.C., Sayer A.G., and Willett J.B. (1991). Sticks And Stones May Break My Bones: Reasoning About Illness Causality And Body Functioning In Children Who Have A Chronic Illness, Pediatrics, 88(3), 608-19. Sample size301 children, including a sub-sample of 205 who were described as asthmatic, diabetic, or healthy. After further reductions due to the list-wise deletion of cases with missing data on one or more variables, the analytic sub-sample used in class ends up containing 33 diabetic children, 68 asthmatic children and 93 healthy children. More infoChronically-ill children were recruited into the study through their pediatricians; healthy children were a matched random sample drawn from the same schools as the ill children. UpdatedSeptember 16, 2005

5
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 4 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit What Variables Will We Focus On In Our Analyses?

6
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 5 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Even With A Few Predictors, There Are So Many Models You Can Possibly Fit! To address RQ about children’s Understanding of Illness Causality: outcome Choose ILLCAUSE as your outcome. predictors Choose “HEALTH,” and perhaps AGE and SES as your predictors. And proceed with a multiple regression analysis … To address RQ about children’s Understanding of Illness Causality: outcome Choose ILLCAUSE as your outcome. predictors Choose “HEALTH,” and perhaps AGE and SES as your predictors. And proceed with a multiple regression analysis … … and then, what about non-linear expressions of the continuous predictors, or categorical versions, or what if you add another predictor like gender or race, or … How Many Potential Models Would There Be, Then? until you begin to enumerate how many possible models you can actually specify using just these few predictors … The task seems ok until you begin to enumerate how many possible models you can actually specify using just these few predictors … Three models with a 1 main effect Three models with 2 main effects and 1 two-way interaction One model with 3 main effects Three models with 2 main effects Three models with 3 main effects and 1 two-way interaction Three models with 3 main effects and 2 two-way interactions One model with 3 main effects and 3 two-way interactions and so on...

7
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 6 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit How Big Is the Universe of All Possible Models, and How Can You Map it? Willett’s Rule: Approximate number of feasible regression models that can be specified increases exponentially with the number of potential predictors: “Initial” model contains main effect of the question predictor HEALTH? * * Second model adds the main effect of control predictor AGE? * * Third model adds two- way interaction of HEALTH and AGE? * * Fourth model adds the main effect of control predictor, SES? * * next ? You are here! This means that, with one outcome and thirteen predictors … … the “Universe of All Possible Models” contains 73,566,892 potential model specifications … It seems plausible to ask, then … In this Universe, What Strategy Can Lead You To The “Best” Subset Of Models? It seems plausible to ask, then … In this Universe, What Strategy Can Lead You To The “Best” Subset Of Models?

8
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 7 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Two Broad Strategies For Deciding Which Models To Specify And Fit Two Broad Classes of Model-Specification Strategy? More Thoughtful and Reasonable Methods: Use the Research Question, the supporting Substantive Theory and Logic to specify and fit a Systematic Taxonomy of Regression Models. Make sure your decisions are driven by your need to Answer Specific Research Questions, to Test Reasonable Hypotheses, and to Tell A Good Story. More Thoughtful and Reasonable Methods: Use the Research Question, the supporting Substantive Theory and Logic to specify and fit a Systematic Taxonomy of Regression Models. Make sure your decisions are driven by your need to Answer Specific Research Questions, to Test Reasonable Hypotheses, and to Tell A Good Story. Example Follows … ILLCAUSE Data: We’ll use the same example later to refine our ability to conduct regression analyses by adding new tools: General Linear Hypothesis (GLH) Testing, Influence Statistics, Innovations in Residual Analysis, Strategies For Improved Interpretation Of Fitted Models. Example Follows … ILLCAUSE Data: We’ll use the same example later to refine our ability to conduct regression analyses by adding new tools: General Linear Hypothesis (GLH) Testing, Influence Statistics, Innovations in Residual Analysis, Strategies For Improved Interpretation Of Fitted Models. Well-known “Automated” Methods: Forward Selection … Backward Elimination … Stepwise Regression … All-Possible-Subsets Regression … erk! Well-known “Automated” Methods: Forward Selection … Backward Elimination … Stepwise Regression … All-Possible-Subsets Regression … erk! I Don’t Recommend These Methods At All: Choice of model specifications abdicated to a computer. Choice of next model can be strongly impacted by relationships among the predictors already present in the model and any potential subsequent predictors. I Don’t Recommend These Methods At All: Choice of model specifications abdicated to a computer. Choice of next model can be strongly impacted by relationships among the predictors already present in the model and any potential subsequent predictors. Please Don’t Ever Use These Methods … (but do read about them in the readings associated with DAM #1, so that you can recognize what to avoid!) Please Don’t Ever Use These Methods … (but do read about them in the readings associated with DAM #1, so that you can recognize what to avoid!)

9
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 8 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Example of Specifying a Sensible Taxonomy of Regression Models to Data RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and if so, by how much? Here starts the data-example – my illustrative data-analyses are contained in... RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and if so, by how much? Here starts the data-example – my illustrative data-analyses are contained in... Data-Analytic Handout I.1(a).1 Data-Analytic Handout I.1(a).1 Available on class website. Features exploratory univariate and bivariate analyses of the ILLCAUSE data. Data-Analytic Handout I.1(a).1 Data-Analytic Handout I.1(a).1 Available on class website. Features exploratory univariate and bivariate analyses of the ILLCAUSE data. Data-Analytic Handout I.1(a).2 Data-Analytic Handout I.1(a).2 Available on class website. Features the fitting of one sensible taxonomy of regression models to the ILLCAUSE data. Data-Analytic Handout I.1(a).2 Data-Analytic Handout I.1(a).2 Available on class website. Features the fitting of one sensible taxonomy of regression models to the ILLCAUSE data. A “Do it Yourself” S TATA Activity is available on the course website:“Do it Yourself” S TATA Activity Additional Support Materials. Additional Support Materials Other kinds of support? All S052 Data-Analytic Handouts contain “model” S TATA Code and Statistical Output I asked you to print these handouts out, and include them in your package of materials for today’s class. They serve as “models” for your own future data-analyses (such as our regular Data-Analytic Memos (DAMs), and your future research). Consult them carefully as you work on the DAM assignments. A few programming comments follow … All S052 Data-Analytic Handouts contain “model” S TATA Code and Statistical Output I asked you to print these handouts out, and include them in your package of materials for today’s class. They serve as “models” for your own future data-analyses (such as our regular Data-Analytic Memos (DAMs), and your future research). Consult them carefully as you work on the DAM assignments. A few programming comments follow …

10
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 9 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Annotated STATA Code, From The Exploratory Analysis *-------------------------------------------------------------------------------- * S-052: APPLIED DATA ANALYSIS * Data-Analytic Handout I.1(a).1 * * I: Conducting Sensible Multiple Regression Analyses. * 1(a): Fitting Taxonomies of Multiple Regression Models. * #1: Introducing the Data. * * RQ: Do Children Who Are Chronically Ill Understand the Causes of Illness * Better Than Do Healthy Children? * * Programming: * Stata Version: Stata 11 SE. * Authors: Andres Molano, Monica Yudron & John B. Willett. * Last Modified: Jan 3, 2011. *-------------------------------------------------------------------------------- * Set the critical parameters of the computing environment. *-------------------------------------------------------------------------------- * Specify the version of Stata to be used in the analysis: version 11.0 * Clear all computer memory and delete any existing stored graphs: clear graph drop _all * Clear and set initial matrix and memory parameters: clear matrix set mat 400 set mem 200m * Define the local directory: cd "C:\My Documents\My Course Stuff\S052\Data Analytic Handouts\Stata\Section I\" Data-Analytic Handout I.1(a).1 starts like this… You can title STATA programs & code, using comments. Include: Name of the Handout. Link to the Syllabus. Substantive Theme (RQ). Programming logistics. You can title STATA programs & code, using comments. Include: Name of the Handout. Link to the Syllabus. Substantive Theme (RQ). Programming logistics. Any line that begins with an asterisk is a comment. It doesn’t matter what it ends with. Any line that begins with an asterisk is a comment. It doesn’t matter what it ends with. Any current version of S TATA can recognize code written according to the rules of any previous version of the software Clear everything out, before the current program executes Define the local directory, so STATA knows where to write its logs, etc.

11
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 10 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Annotated STATA Code, From The Exploratory Analysis *-------------------------------------------------------------------------------- * Open a log to contain a permanent record of the syntax and analytic output. *-------------------------------------------------------------------------------- log using "I_1a_1.log", replace *-------------------------------------------------------------------------------- * Input the raw dataset, name and label the variables and their values. *-------------------------------------------------------------------------------- * Input the dataset: infile ID ILLCAUSE SES PPVT AGE GENREAS HEALTH /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\ILLCAUSE.txt" * Label the variables in the dataset: label variable ID "Child Identification Code" label variable ILLCAUSE "Understanding of Illness Causality Score" label variable SES "Hollingshead SES" label variable PPVT "Peabody Picture Vocabulary Test Score" label variable AGE "Chronological Age (Months)" label variable GENREAS "General Reasoning Ability Score" label variable HEALTH "Health Status" * Label the values of categorical question predictor HEALTH: label define HEALTHLBL 3 "Diabetic" 5 "Asthmatic" 6 "Healthy" label values HEALTH HEALTHLBL And continues like this… Open a log file to contain all output, STATA writes it to the local directory defined previously. In the infile command, specify the names of the variables (in order of their appearance in the dataset) and the location of the data-file that contains the raw data. The data are read into the stata active file, and held in memory In the infile command, specify the names of the variables (in order of their appearance in the dataset) and the location of the data-file that contains the raw data. The data are read into the stata active file, and held in memory Variables are easily labeled with informative names. You can also label the values of particular variables, with informative names. Here, we will subsequently focus on only the Diabetic, Asthmatic and Healthy children to simplify the analysis, so I have named only those values of HEALTH. You can also label the values of particular variables, with informative names. Here, we will subsequently focus on only the Diabetic, Asthmatic and Healthy children to simplify the analysis, so I have named only those values of HEALTH.

12
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 11 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Annotated STATA Code For Conducting Exploratory Analysis *-------------------------------------------------------------------------------- * Subset an analytic sample with only healthy, asthmatic, & diabetic children. *-------------------------------------------------------------------------------- keep if HEALTH==3 | HEALTH==5 | HEALTH==6 *-------------------------------------------------------------------------------- * List and check values of the variables in dataset, for the first 30 cases. *-------------------------------------------------------------------------------- list in 1/30 *-------------------------------------------------------------------------------- * Obtain univariate summary statistics on selected variables, in analytic sample *-------------------------------------------------------------------------------- * On continuous outcome ILLCAUSE: sum ILLCAUSE * On categorical question predictor, HEALTH: hist HEALTH, discrete frequency name(I_1a_1_g1,replace) * On continuous control predictors AGE and SES: sum AGE SES And like this … Select out the sub-samples of children who are to be compared. Here, to simplify the analyses, I focus on the subsamples of children who are either Diabetic (HEALTH=3), Asthmatic (HEALTH=5) or Healthy (HEALTH=6) Select out the sub-samples of children who are to be compared. Here, to simplify the analyses, I focus on the subsamples of children who are either Diabetic (HEALTH=3), Asthmatic (HEALTH=5) or Healthy (HEALTH=6) List out 30 cases, to check if the data have been input correctly, listing the variables in the order in which you want them to list. Request univariate descriptive statistics on … outcome, ILLCAUSE. Request univariate descriptive statistics … on categorical question predictor, HEALTH. Request univariate descriptive statistics … on continuous covariates, AGE and SES. =

13
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 12 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Annotated STATA Code For Conducting Exploratory Analysis *-------------------------------------------------------------------------------- * Explore bivariate relationships between the outcome ILLCAUSE & all predictors. *-------------------------------------------------------------------------------- * Between outcome ILLCAUSE and categorical question predictor, HEALTH: graph box ILLCAUSE, over(HEALTH) name(I_1a_1_g2,replace) tabstat ILLCAUSE, statistics(mean sd max min count) by(HEALTH) * Between outcome ILLCAUSE and continuous "social" control, SES: graph twoway scatter ILLCAUSE SES, msymbol(+) name(I_1a_1_g3,replace) pwcorr ILLCAUSE SES, sig * Between outcome ILLCAUSE and continuous "design" covariate, child AGE. * Here, potential issues of curvilinearity in the relationship are revealed that * are subsequently addressed via transformation: * First, examine bivariate relationship of ILLCAUSE & untransformed child AGE : graph twoway scatter ILLCAUSE AGE, msymbol(+) name(I_1a_1_g4,replace) pwcorr ILLCAUSE AGE, sig * Second, examine bivariate relationship, after taking natural log of child AGE: generate LAGE = log(AGE) graph twoway scatter ILLCAUSE LAGE, msymbol(+) name(I_1a_1_g5,replace) pwcorr ILLCAUSE LAGE, sig *-------------------------------------------------------------------------------- * Examine bivariate relationships among predictors. *-------------------------------------------------------------------------------- * Between covariates AGE & SES and the HEALTH status question predictor. tabstat AGE SES, statistics(mean sd max min count) by(HEALTH) *-------------------------------------------------------------------------------- * Close the log *-------------------------------------------------------------------------------- log close Examine the bivariate relationship between continuous outcome ILLCAUSE and categorical question predictor HEALTH: Parallel box-plots of ILLCAUSE, by HEALTH. Univariate descriptive stats on ILLCAUSE, by HEALTH. Examine the bivariate relationship between continuous outcome ILLCAUSE and categorical question predictor HEALTH: Parallel box-plots of ILLCAUSE, by HEALTH. Univariate descriptive stats on ILLCAUSE, by HEALTH. Examine the bivariate relationship between continuous outcome ILLCAUSE and continuous covariate SES: Scatterplot of ILLCAUSE versus SES. Estimated bivariate correlation of ILLCAUSE and SES. Examine the bivariate relationship between continuous outcome ILLCAUSE and continuous covariate SES: Scatterplot of ILLCAUSE versus SES. Estimated bivariate correlation of ILLCAUSE and SES. Examine the bivariate relationship between continuous outcome ILLCAUSE and continuous covariate AGE. Before the transformation of AGE: Scatterplot of ILLCAUSE versus AGE. Estimated bivariate correlation of ILLCAUSE and AGE. After the log-transformation of AGE: Scatterplot of ILLCAUSE versus LN_AGE. Estimated bivariate correlation of ILLCAUSE and LN_AGE. Examine the bivariate relationship between continuous outcome ILLCAUSE and continuous covariate AGE. Before the transformation of AGE: Scatterplot of ILLCAUSE versus AGE. Estimated bivariate correlation of ILLCAUSE and AGE. After the log-transformation of AGE: Scatterplot of ILLCAUSE versus LN_AGE. Estimated bivariate correlation of ILLCAUSE and LN_AGE. Close the log Check out the hypothesized inter- connections among the question predictor and the covariates

14
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 13 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Listing a Few Cases from the ILLCAUSE Datasest +--------------------------------------------------------+ | ID ILLCAUSE SES PPVT AGE GENREAS HEALTH | |--------------------------------------------------------| 1. | 301. 2 138 128 4.802 Diabetic | 2. | 302 2.857 2 102 79 2.188 Diabetic | 3. | 303 3.429 3 84 151 3.302 Diabetic | 4. | 304 4.286 3 98 178 5.219 Diabetic | 5. | 305 4.286 4 80 113 2.5 Diabetic | |--------------------------------------------------------| Notice that a period (.) is the code that is used in STATA as the default missing value code. Notice that the children's values of SES are heterogeneous and remember that higher values mean lower SES!!! Notice the heterogeneous ages of the sampled children (in months) This column contains the values of a STATA “system” variable that counts and identifies the observations in the order in which they appear in the active file. We will make much use of it later. Selected output from Data-Analytic Handout I.1(a).1 … data on a few early cases The values of the outcome, ILLCAUSE, are listed here for each child Notice that the health status of the children has been reformatted by my STATA program from a numerical to an alphabetic label

15
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 14 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Of Outcome ILLCAUSE & Question Predictor HEALTH What Do You Notice, in this Bivariate Scatterplot, That May Most Usefully Inform Subsequent Regression Analyses? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Parallel box-plots illustrating the sample bivariate relationship of continuous outcome, ILLCAUSE, with children’s HEALTH status” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Parallel box-plots illustrating the sample bivariate relationship of continuous outcome, ILLCAUSE, with children’s HEALTH status”

16
S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Of Outcome ILLCAUSE & Question Predictor HEALTH When We Fit Future Multiple Regression Models To These Data, We May Find … S052/I.1(a) – Slide 15© Willett, Harvard University Graduate School of Education What Do You Notice, in this Scatter-Plot, That May Most Usefully Inform Our Subsequent Regression Analyses? 1.Serious residual heteroscedasticity. 2.That a small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. 1.Serious residual heteroscedasticity. 2.That a small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses.

17
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 16 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Between Outcome ILLCAUSE & Covariate AGE What Do You Notice, in this Scatterplot, That May Most Usefully Inform Subsequent Regression Analyses? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between continuous outcome, ILLCAUSE and continuous covariate AGE …” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between continuous outcome, ILLCAUSE and continuous covariate AGE …” r = 0.671 ***

18
S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Of Outcome ILLCAUSE & Covariate AGE When We Fit Future Multiple Regression Models To These Data, We May Find That … 1.There is serious residual heteroscedasticity. 2.A small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between the outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. 1.There is serious residual heteroscedasticity. 2.A small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between the outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. S052/I.1(a) – Slide 17© Willett, Harvard University Graduate School of Education What Do You Notice, in this Scatter-Plot, That May Most Usefully Inform Our Subsequent Regression Analyses? r = 0.671 ***

19
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 18 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Between Outcome ILLCAUSE & Covariate LogAGE Do You Feel Any Better Now? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between the continuous outcome, ILLCAUSE and continuous covariate LN_AGE …” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between the continuous outcome, ILLCAUSE and continuous covariate LN_AGE …” r = 0.683 ***

20
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 19 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Exploratory Bivariate Scatterplot of the ILLCAUSE/SES Relationship What Do You Notice, in this Scatterplot, That May Most Usefully Inform Subsequent Regression Analyses? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between the continuous outcome, ILLCAUSE and continuous covariate SES …” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between the continuous outcome, ILLCAUSE and continuous covariate SES …” r = -0.247 ***

21
S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Of Outcome ILLCAUSE & Question Predictor HEALTH When We Fit Future Multiple Regression Models To These Data, We May Find That … 1.There is serious residual heteroscedasticity. 2.A small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between the outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. 1.There is serious residual heteroscedasticity. 2.A small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between the outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. S052/I.1(a) – Slide 20© Willett, Harvard University Graduate School of Education What Do You Notice, in this Scatter-Plot, That May Most Usefully Inform Our Subsequent Regression Analyses? r = -0.247 ***

22
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 21 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Exploratory Bivariate Tabulations of the ILLCAUSE Data What Do You Notice, in this Table, That May Most Usefully Inform Subsequent Regression Analyses? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Sample distribution of childrens’s AGE and SES, by their HEALTH status …” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Sample distribution of childrens’s AGE and SES, by their HEALTH status …” HEALTH | AGE SES ----------+-------------------- Diabetic | 136.75 2.805556 | 35.9105 1.141914 | 194 5 | 62 1 | 36 36 ----------+-------------------- Asthmatic | 129.0822 2.69863 | 40.72187.8446717 | 200 4 | 61 1 | 73 73 ----------+-------------------- Healthy | 131.9896 1.78125 | 42.02267.7567677 | 203 4 | 64 1 | 96 96 ----------+-------------------- Total | 131.7902 2.287805 | 40.4458.9852329 | 203 5 | 61 1 | 205 205 -------------------------------

23
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 22 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Then You Have To Fit Regression Models … Which Ones Should You Fit? Fit a sensible subset of multiple regression models that address your research question directly, do NOT explore the entire universe of possible models: First, identify your outcome variable -- here, ILLCAUSE (Duh!). Second, establish important classes of predictors, based on substance (i.e., your research questions and theoretical framework), your research design, … etc, and establish the priorities among them. Third, choose a sensible order in which to enter the predictor classes into the regression model, again based on substance and your research questions (an example follows, but see the Appendix). Fourth, enter the predictors systematically within their classes, exhausting each class before proceeding to the next. At each step, once the main effects have been exhausted, consider including the interactions. Fit a sensible subset of multiple regression models that address your research question directly, do NOT explore the entire universe of possible models: First, identify your outcome variable -- here, ILLCAUSE (Duh!). Second, establish important classes of predictors, based on substance (i.e., your research questions and theoretical framework), your research design, … etc, and establish the priorities among them. Third, choose a sensible order in which to enter the predictor classes into the regression model, again based on substance and your research questions (an example follows, but see the Appendix). Fourth, enter the predictors systematically within their classes, exhausting each class before proceeding to the next. At each step, once the main effects have been exhausted, consider including the interactions. PriorityPredictorComment HighHEALTH is the Key Question Predictor. Without the presence of HEALTH in the final model, we cannot address the research question! Medium AGE is a key “Design” Control Predictor because it represents the multi-cohort nature of the research design: Our sample contains multiple sub-samples (“cohorts”) of children at different ages. By controlling for AGE, we can pool all the children into the same analysis, regardless of their age, rather than doing an “age-by- age slice” analysis (as was suggested by one ill- informed reviewer!!!) Low SES is a subsidiary substantive Control Predictor. It is often worth including because some twit will always ask you if it matters. In these data, descriptive analyses suggest that ill children have lower SES, on average. So, if understanding illness also depends on home resources, the effect of SES could masquerade as an effect of HEALTH.

24
*-------------------------------------------------------------------------------- * Create additional predictors. *-------------------------------------------------------------------------------- * Create illness-group dichotomies to serve as principal question predictors: * Diabetic children: generate D=. replace D=1 if HEALTH ==3 replace D=0 if HEALTH !=3 * Asthmatic children: generate A=. replace A=1 if HEALTH ==5 replace A=0 if HEALTH !=5 * Healthy children: generate H=. replace H=1 if HEALTH ==6 replace H=0 if HEALTH !=6 * Take the natural log-transform of child AGE: generate LAGE = log(AGE) * Create the complete set of two-way health status by log-AGE interactions: generate DxLAGE = D*LAGE generate AxLAGE = A*LAGE generate HxLAGE = H*LAGE * Create the complete set of two-way health status by SES interactions: generate DxSES = D*SES generate AxSES = A*SES generate HxSES = H*SES * Create the two-way logAGE by SES interaction: generate LAGExSES = LAGE*SES * Create the complete set of three-way health status by log-AGE by SES interactions: generate DxLAGExSES = D*LAGE*SES generate AxLAGExSES = A*LAGE*SES generate HxLAGExSES = H*LAGE*SES © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 23 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Conducting the Requisite Multiple Regression Analyses? From Data-Analytic Handout I.1(a).2 … input the data and recode/create any variables needed for the subsequent regression analysis … Convert categorical HEALTH status into a vector of appropriate dichotomous (“dummy”) predictors. Transform AGE, by taking its natural logarithm Create a set of two-way HEALTH by LN_AGE interactions Create a set of two-way HEALTH by SES interactions Create a Log(AGE) by SES interaction Create a a set of three-way HEALTH by AGE by SES interactions

25
*-------------------------------------------------------------------------------- * Fit a sensible taxonomy of nested regression models to investigate the * impact of health status on children's understanding of illness causality. *-------------------------------------------------------------------------------- * Using the "check out the effect of the major question predictor first, and then * control for everything else" strategy: * First, estimate the uncontrolled "total effect" of health status: * Model 1: Include the main effects of D and A simultaneously, * Omit predictor H to make healthy children the "reference category": regress ILLCAUSE D A * Second, account for the multi-cohort research design by controlling for * heterogeneity in the ages of the participating children: * Model 2: Include the main effect of the child's log-AGE: regress ILLCAUSE D A LAGE * Model 3: Check the two-way interaction of health status and child log-AGE: regress ILLCAUSE D A LAGE DxLAGE AxLAGE * Third, control for additional substantive covariate, socio-economic status: * Model 4: Check the main effect of family socioeconomic status: regress ILLCAUSE D A LAGE DxLAGE AxLAGE SES * Model 5: Check whether all interactions with SES are required in the model: regress ILLCAUSE D A LAGE DxLAGE AxLAGE /// SES DxSES AxSES LAGExSES DxLAGExSES AxLAGExSES © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 24 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Conducting the Requisite Multiple Regression Analyses? And here’s the multiple regression analyses … Specify the predictors you want included in each hypothesized model The first variable to appear in the list is the outcome, by default You can specify many regression models one after another, sequentially Regress is the STATA procedure for conducting multiple regression analysis Categorical question predictor, HEALTH, is represented by only two of the health status dummies: D (for diabetics), and A (for asthmatics). For a discussion of this, see Appendix 1. Categorical question predictor, HEALTH, is represented by only two of the health status dummies: D (for diabetics), and A (for asthmatics). For a discussion of this, see Appendix 1. This symbol permits any line of code to break onto the next line

26
*-------------------------------------------------------------------------------- * Fit a sensible taxonomy of nested regression models to investigate the * impact of health status on children's understanding of illness causality. *-------------------------------------------------------------------------------- * Using the "check out the effect of the major question predictor first, and then * control for everything else" strategy: * First, estimate the uncontrolled "total effect" of health status: * Model 1: Include the main effects of D and A simultaneously, * Omit predictor H to make healthy children the "reference category": regress ILLCAUSE D A * Second, account for the multi-cohort research design by controlling for * heterogeneity in the ages of the participating children: * Model 2: Include the main effect of the child's log-AGE: regress ILLCAUSE D A LAGE * Model 3: Check the two-way interaction of health status and child log-AGE: regress ILLCAUSE D A LAGE DxLAGE AxLAGE * Third, control for additional substantive covariate, socio-economic status: * Model 4: Check the main effect of family socioeconomic status: regress ILLCAUSE D A LAGE DxLAGE AxLAGE SES * Model 5: Check whether all interactions with SES are required in the model: regress ILLCAUSE D A LAGE DxLAGE AxLAGE /// SES DxSES AxSES LAGExSES DxLAGExSES AxLAGExSES *-------------------------------------------------------------------------------- * Make Model 4 more parsimonious by specifying health status as one dichotomy *-------------------------------------------------------------------------------- * Create a new question predictor to distinguish ill children from healthy: generate ILL=. replace ILL=1 if D ==1|A==1 replace ILL=0 if H ==1 * Create the two-way interaction of predictor ILL & design covariate log-AGE: generate ILLxLAGE = ILL*LAGE * Simplify Model 4 by replacing the pair of former health-status dummies, D & A, * By the new single question predictor ILL: * Model 6: regress ILLCAUSE ILL LAGE ILLxLAGE SES © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 25 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Conducting the Requisite Multiple Regression Analyses? We will unpack this carefully in our next class! Can You Discern How The Logic of Slide #22 Drove The Specification Of This Nested Taxonomy Of Regression Models?

27
Source | SS df MS Number of obs = 194 -------------+------------------------------ F( 2, 191) = 23.45 Model | 39.7373141 2 19.868657 Prob > F = 0.0000 Residual | 161.809826 191.847171864 R-squared = 0.1972 -------------+------------------------------ Adj R-squared = 0.1888 Total | 201.54714 193 1.0442857 Root MSE =.92042 ------------------------------------------------------------------------------ ILLCAUSE | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- D | -.8373226.1864973 -4.49 0.000 -1.205182 -.4694638 A | -.9355971.1468597 -6.37 0.000 -1.225272 -.6459219 _cons | 4.603656.095443 48.23 0.000 4.415398 4.791914 ------------------------------------------------------------------------------ © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 26 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit What Does Regular Multiple Regression Output Look Like? Can You Navigate & Interpret Typical Regression Output? Can you interpret the estimated intercept? What hypothesis does each of these pairs of statistics test? Conceptually, what is standard error? Can you interpret the estimated coefficient associated with predictors D and A? Can you interpret the R 2 statistic? Can you interpret the “Root MSE” statistic? What hypothesis do these statistics test? Can you interpret the “Sum of Squares Error” -- or SSError -- statistic? Can you interpret the “Sum of Squares Model” - - or SSModel -- statistic? Can you interpret the “Sum of Squares Total” -- or SSTotal -- statistic?

28
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 27 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Finally, Can You Assemble The Taxonomy of Fitted Models In An APA-Style Table What are the critical features of APA Formatting for tables? Consult the Style Manuals Consult the Style Manuals and exemplars on the course website We’ll dissect this taxonomy, and interpret its substantive story in great detail, in our subsequent classes...

29
S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Using Clickers In Future S-052 Classes Should We Use Clickers in Future S-052 Classes? S052/I.1(a) – Slide 28© Willett, Harvard University Graduate School of Education 1.Yes, I never want to put my Clicker down again. You will have to pry it from my cold dead hands. 2.Ok, I give in, in a few classes, maybe, if you must. 3.Hell no, never again! 4.You should take all the Clickers in the school and pound them into eMulch. 5.I will bring an egg sandwich for lunch next time. 1.Yes, I never want to put my Clicker down again. You will have to pry it from my cold dead hands. 2.Ok, I give in, in a few classes, maybe, if you must. 3.Hell no, never again! 4.You should take all the Clickers in the school and pound them into eMulch. 5.I will bring an egg sandwich for lunch next time.

30
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 29 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Appendix I: Why Can Two Dummy Predictors Distinguish Among Three Groups? The first fitted regression model (M1) from Data-Analytic Handout I.1(a).1 is: From it, you can estimate the predicted value of ILLCAUSE in each health status group by substituting numerical values of the health status predictors that represent prototypical individuals in the dataset: Notice that the predicted outcome values corresponding to one of the groups – the reference, omitted or comparison group (here, healthy children) – are obtained when the two dichotomous predictors that distinguish the chronically-ill children are both set to zero. This means that, if you have an intercept in the model, you need one less dummy predictor in the model than there are groups compared, as the fitted value for the “reference (or omitted) group” is provided by the estimated intercept. Another way of thinking about this is to understand that, although there are three distinct health status groups present, only two independent pieces of information are needed to indicate the health status of a child because if a child is neither diabetic nor asthmatic then s/he must be healthy, by default. Of course, you get to choose which of the health status groups serves as the reference, because you are the one who picks which dummy predictor is omitted from the regression model. Typically, you make this choice for substantive, not statistical, reasons.

31
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 30 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Appendix 1: Why Can Two Dummy Predictors Distinguish Among Three Groups? Inspection of the fitted values computed on the previous slide indicate that the fitted regression parameters that we obtained in the analysis – that is, the estimated intercept parameter and the two estimated slope parameters associated with the dummy predictors representing health status, can be interpreted as follows: The fitted slope parameter associated with dummy predictor A represents the difference in the predicted value of ILLCAUSE between the asthmatic and “reference” healthy children – it is our best estimate of the difference between asthmatic and healthy children, on average, in the population (- 0.94). The fitted slope parameter associated with dummy predictor D represents the difference in the predicted value of ILLCAUSE between diabetic and “reference” healthy children – it is our best estimate of the difference between diabetic and healthy children, on average, in the population (-0.84). The fitted intercept represents the predicted value of ILLCAUSE (4.60) for those in the reference (or omitted) category –it is our best estimate of the understanding of healthy children, on average, in the population.

32
© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 31 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Appendix II: Other Reasonable Strategies For Specifying Taxonomies Of Regression Models “Baseline Control Model” Approach: Form a baseline control model, by sequentially adding control predictors, highest priority first, and testing for appropriate interactions as you go along. Then, add the main effects of the question predictors to the new baseline control model. Then, add interactions between the question predictors and the control predictors in the baseline control model, sequentially. Finally, add interactions between the question predictors. Here, the objective is to obtain a parsimonious model that controls away all extraneous variation first, and then focus attention on the impact of the question predictors. While this approach refines your view of the impact of the question predictors, removing that part of their effect that may depend on the inter-relationships with the controls, it never reveals the “total” impact of the question predictors on the outcome for a person who has been randomly selected from the population without regard to any of their other characteristics. “Work Back From The End” Approach: Include all possible predictors in the model, both their main effects and interactions. The, remove statistically unimportant predictors sequentially to achieve a more parsimonious model, starting with those of lowest declared priority that do not appear to have statistically significant effects (i.e., remove question predictors last). Make sure that you remove any statistically unimportant ahead of any of the main effects from which they are constituted. Here, the objective is to obtain a final parsimonious model by sequentially removing predictors that appear unimportant. The idea is that you get to see the impact of “everything” to start with, and then you can “slim down” the fitted model to a final model. However, the impact of main effects is always masked when interactions are present in the model, and you still may remove an important predictor whose correlation with another predictor makes it look unimportant. Devise Your Own Strategy? It’s acceptable to devise your own strategy, in fact it’s probably the best approach as you know the field the best!. But, remember that your strategy must be systematic, sensible and you must explain it explicitly to your reader, describing the logic that underpins its organization.

Similar presentations

OK

Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.

Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google