# More details can be found in the “Course Objectives and Content” handout on the course webpage. Multiple Regression Analysis (MRA) Multiple Regression.

## Presentation on theme: "More details can be found in the “Course Objectives and Content” handout on the course webpage. Multiple Regression Analysis (MRA) Multiple Regression."— Presentation transcript:

More details can be found in the “Course Objectives and Content” handout on the course webpage. Multiple Regression Analysis (MRA) Multiple Regression Analysis (MRA) Do your residuals meet the required assumptions? Test for residual normality Use influence statistics to detect atypical datapoints If your residuals are not independent, replace OLS by GLS regression analysis Use Individual growth modeling Specify a Multi-level Model If your sole predictor is continuous, MRA is identical to correlational analysis If your sole predictor is dichotomous, MRA is identical to a t-test If your several predictors are categorical, MRA is identical to ANOVA If time is a predictor, you need discrete- time survival analysis… If your outcome is categorical, you need to use… Binomial logistic regression analysis (dichotomous outcome) Multinomial logistic regression analysis (polytomous outcome) If you have more predictors than you can deal with, Create taxonomies of fitted models and compare them. Form composites of the indicators of any common construct. Conduct a Principal Components Analysis Use Cluster Analysis Use non-linear regression analysis. Transform the outcome or predictor If your outcome vs. predictor relationship is non-linear, Use Factor Analysis: EFA or CFA? © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 1 S052/§I.1(a): Applied Data Analysis Roadmap of the Course – What Is Today’s Topic? Today’s Topic Area

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 2 S052/§I.1(a): Applied Data Analysis Where Does Today’s Topic Appear in the Printed Syllabus? In the future, I ask you to keep automatically tabs on the inter- connections among the Roadmap, the Daily Topic Area, the Printed Syllabus, and the content of the day’s class, when you first download and pre-read the required day’s class materials. Today’s topic, Deciding Which Regression Models to Fit, is from Syllabus Section I.1(a) and includes:Syllabus Section I.1(a) Slides 3-4: Introducing the ILLCAUSE data-example. Slides 5-6: A “Universe Of All Possible Models”. Slide 7: Two Strategies For Choosing Subsets of Regression Models To Fit. Slide 8: Where Is My Strategy Documented. Slides 9-21: Exploratory Univariate & Bivariate Analyses in the ILLCAUSE Dataset. Slide 22: Establishing Priorities Among the Predictors. Slides 22-25: Fitting a Sensible Taxonomy of Regression Models in the ILLCAUSE Dataset. Slide 26: Decoding Standard Regression Output. Slide 27: APA-Style Table Displaying a Taxonomy Of Fitted Regression Models. Slides 29-30: Appendix I. Slide 31: Appendix II. Today’s topic, Deciding Which Regression Models to Fit, is from Syllabus Section I.1(a) and includes:Syllabus Section I.1(a) Slides 3-4: Introducing the ILLCAUSE data-example. Slides 5-6: A “Universe Of All Possible Models”. Slide 7: Two Strategies For Choosing Subsets of Regression Models To Fit. Slide 8: Where Is My Strategy Documented. Slides 9-21: Exploratory Univariate & Bivariate Analyses in the ILLCAUSE Dataset. Slide 22: Establishing Priorities Among the Predictors. Slides 22-25: Fitting a Sensible Taxonomy of Regression Models in the ILLCAUSE Dataset. Slide 26: Decoding Standard Regression Output. Slide 27: APA-Style Table Displaying a Taxonomy Of Fitted Regression Models. Slides 29-30: Appendix I. Slide 31: Appendix II.

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 3 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit What Question, and Dataset, Will Drive Our Presentation Today? RQ: Do children who suffer from chronic illness understand the causes of illness better than healthy children and if so, by how much?  Dataset on website: ILLCAUSE.txt.ILLCAUSE.txt  Codebook on website: ILLCAUSE_infoILLCAUSE_info  Dataset on website: ILLCAUSE.txt.ILLCAUSE.txt  Codebook on website: ILLCAUSE_infoILLCAUSE_info DatasetILLCAUSE.txt OverviewData for investigating differences in children’s understanding of the causes of illness, by their health status. SourcePerrin E.C., Sayer A.G., and Willett J.B.Perrin E.C., Sayer A.G., and Willett J.B. (1991). Sticks And Stones May Break My Bones: Reasoning About Illness Causality And Body Functioning In Children Who Have A Chronic Illness, Pediatrics, 88(3), 608-19. Sample size301 children, including a sub-sample of 205 who were described as asthmatic, diabetic, or healthy. After further reductions due to the list-wise deletion of cases with missing data on one or more variables, the analytic sub-sample used in class ends up containing 33 diabetic children, 68 asthmatic children and 93 healthy children. More infoChronically-ill children were recruited into the study through their pediatricians; healthy children were a matched random sample drawn from the same schools as the ill children. UpdatedSeptember 16, 2005

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 4 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit What Variables Will We Focus On In Our Analyses?

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 6 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit How Big Is the Universe of All Possible Models, and How Can You Map it? Willett’s Rule: Approximate number of feasible regression models that can be specified increases exponentially with the number of potential predictors: “Initial” model contains main effect of the question predictor HEALTH? * * Second model adds the main effect of control predictor AGE? * * Third model adds two- way interaction of HEALTH and AGE? * * Fourth model adds the main effect of control predictor, SES? * * next ? You are here! This means that, with one outcome and thirteen predictors … … the “Universe of All Possible Models” contains 73,566,892 potential model specifications … It seems plausible to ask, then … In this Universe, What Strategy Can Lead You To The “Best” Subset Of Models? It seems plausible to ask, then … In this Universe, What Strategy Can Lead You To The “Best” Subset Of Models?

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 9 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Annotated STATA Code, From The Exploratory Analysis *-------------------------------------------------------------------------------- * S-052: APPLIED DATA ANALYSIS * Data-Analytic Handout I.1(a).1 * * I: Conducting Sensible Multiple Regression Analyses. * 1(a): Fitting Taxonomies of Multiple Regression Models. * #1: Introducing the Data. * * RQ: Do Children Who Are Chronically Ill Understand the Causes of Illness * Better Than Do Healthy Children? * * Programming: * Stata Version: Stata 11 SE. * Authors: Andres Molano, Monica Yudron & John B. Willett. * Last Modified: Jan 3, 2011. *-------------------------------------------------------------------------------- * Set the critical parameters of the computing environment. *-------------------------------------------------------------------------------- * Specify the version of Stata to be used in the analysis: version 11.0 * Clear all computer memory and delete any existing stored graphs: clear graph drop _all * Clear and set initial matrix and memory parameters: clear matrix set mat 400 set mem 200m * Define the local directory: cd "C:\My Documents\My Course Stuff\S052\Data Analytic Handouts\Stata\Section I\" Data-Analytic Handout I.1(a).1 starts like this… You can title STATA programs & code, using comments. Include: Name of the Handout. Link to the Syllabus. Substantive Theme (RQ). Programming logistics. You can title STATA programs & code, using comments. Include: Name of the Handout. Link to the Syllabus. Substantive Theme (RQ). Programming logistics. Any line that begins with an asterisk is a comment. It doesn’t matter what it ends with. Any line that begins with an asterisk is a comment. It doesn’t matter what it ends with. Any current version of S TATA can recognize code written according to the rules of any previous version of the software Clear everything out, before the current program executes Define the local directory, so STATA knows where to write its logs, etc.

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 10 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Annotated STATA Code, From The Exploratory Analysis *-------------------------------------------------------------------------------- * Open a log to contain a permanent record of the syntax and analytic output. *-------------------------------------------------------------------------------- log using "I_1a_1.log", replace *-------------------------------------------------------------------------------- * Input the raw dataset, name and label the variables and their values. *-------------------------------------------------------------------------------- * Input the dataset: infile ID ILLCAUSE SES PPVT AGE GENREAS HEALTH /// using "C:\My Documents\My Course Stuff\S052\Data\Datasets\ILLCAUSE.txt" * Label the variables in the dataset: label variable ID "Child Identification Code" label variable ILLCAUSE "Understanding of Illness Causality Score" label variable SES "Hollingshead SES" label variable PPVT "Peabody Picture Vocabulary Test Score" label variable AGE "Chronological Age (Months)" label variable GENREAS "General Reasoning Ability Score" label variable HEALTH "Health Status" * Label the values of categorical question predictor HEALTH: label define HEALTHLBL 3 "Diabetic" 5 "Asthmatic" 6 "Healthy" label values HEALTH HEALTHLBL And continues like this… Open a log file to contain all output, STATA writes it to the local directory defined previously. In the infile command, specify the names of the variables (in order of their appearance in the dataset) and the location of the data-file that contains the raw data. The data are read into the stata active file, and held in memory In the infile command, specify the names of the variables (in order of their appearance in the dataset) and the location of the data-file that contains the raw data. The data are read into the stata active file, and held in memory Variables are easily labeled with informative names. You can also label the values of particular variables, with informative names. Here, we will subsequently focus on only the Diabetic, Asthmatic and Healthy children to simplify the analysis, so I have named only those values of HEALTH. You can also label the values of particular variables, with informative names. Here, we will subsequently focus on only the Diabetic, Asthmatic and Healthy children to simplify the analysis, so I have named only those values of HEALTH.

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 11 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Annotated STATA Code For Conducting Exploratory Analysis *-------------------------------------------------------------------------------- * Subset an analytic sample with only healthy, asthmatic, & diabetic children. *-------------------------------------------------------------------------------- keep if HEALTH==3 | HEALTH==5 | HEALTH==6 *-------------------------------------------------------------------------------- * List and check values of the variables in dataset, for the first 30 cases. *-------------------------------------------------------------------------------- list in 1/30 *-------------------------------------------------------------------------------- * Obtain univariate summary statistics on selected variables, in analytic sample *-------------------------------------------------------------------------------- * On continuous outcome ILLCAUSE: sum ILLCAUSE * On categorical question predictor, HEALTH: hist HEALTH, discrete frequency name(I_1a_1_g1,replace) * On continuous control predictors AGE and SES: sum AGE SES And like this … Select out the sub-samples of children who are to be compared. Here, to simplify the analyses, I focus on the subsamples of children who are either Diabetic (HEALTH=3), Asthmatic (HEALTH=5) or Healthy (HEALTH=6) Select out the sub-samples of children who are to be compared. Here, to simplify the analyses, I focus on the subsamples of children who are either Diabetic (HEALTH=3), Asthmatic (HEALTH=5) or Healthy (HEALTH=6) List out 30 cases, to check if the data have been input correctly, listing the variables in the order in which you want them to list. Request univariate descriptive statistics on … outcome, ILLCAUSE. Request univariate descriptive statistics … on categorical question predictor, HEALTH. Request univariate descriptive statistics … on continuous covariates, AGE and SES. =

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 12 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Annotated STATA Code For Conducting Exploratory Analysis *-------------------------------------------------------------------------------- * Explore bivariate relationships between the outcome ILLCAUSE & all predictors. *-------------------------------------------------------------------------------- * Between outcome ILLCAUSE and categorical question predictor, HEALTH: graph box ILLCAUSE, over(HEALTH) name(I_1a_1_g2,replace) tabstat ILLCAUSE, statistics(mean sd max min count) by(HEALTH) * Between outcome ILLCAUSE and continuous "social" control, SES: graph twoway scatter ILLCAUSE SES, msymbol(+) name(I_1a_1_g3,replace) pwcorr ILLCAUSE SES, sig * Between outcome ILLCAUSE and continuous "design" covariate, child AGE. * Here, potential issues of curvilinearity in the relationship are revealed that * are subsequently addressed via transformation: * First, examine bivariate relationship of ILLCAUSE & untransformed child AGE : graph twoway scatter ILLCAUSE AGE, msymbol(+) name(I_1a_1_g4,replace) pwcorr ILLCAUSE AGE, sig * Second, examine bivariate relationship, after taking natural log of child AGE: generate LAGE = log(AGE) graph twoway scatter ILLCAUSE LAGE, msymbol(+) name(I_1a_1_g5,replace) pwcorr ILLCAUSE LAGE, sig *-------------------------------------------------------------------------------- * Examine bivariate relationships among predictors. *-------------------------------------------------------------------------------- * Between covariates AGE & SES and the HEALTH status question predictor. tabstat AGE SES, statistics(mean sd max min count) by(HEALTH) *-------------------------------------------------------------------------------- * Close the log *-------------------------------------------------------------------------------- log close Examine the bivariate relationship between continuous outcome ILLCAUSE and categorical question predictor HEALTH: Parallel box-plots of ILLCAUSE, by HEALTH. Univariate descriptive stats on ILLCAUSE, by HEALTH. Examine the bivariate relationship between continuous outcome ILLCAUSE and categorical question predictor HEALTH: Parallel box-plots of ILLCAUSE, by HEALTH. Univariate descriptive stats on ILLCAUSE, by HEALTH. Examine the bivariate relationship between continuous outcome ILLCAUSE and continuous covariate SES: Scatterplot of ILLCAUSE versus SES. Estimated bivariate correlation of ILLCAUSE and SES. Examine the bivariate relationship between continuous outcome ILLCAUSE and continuous covariate SES: Scatterplot of ILLCAUSE versus SES. Estimated bivariate correlation of ILLCAUSE and SES. Examine the bivariate relationship between continuous outcome ILLCAUSE and continuous covariate AGE. Before the transformation of AGE: Scatterplot of ILLCAUSE versus AGE. Estimated bivariate correlation of ILLCAUSE and AGE. After the log-transformation of AGE: Scatterplot of ILLCAUSE versus LN_AGE. Estimated bivariate correlation of ILLCAUSE and LN_AGE. Examine the bivariate relationship between continuous outcome ILLCAUSE and continuous covariate AGE. Before the transformation of AGE: Scatterplot of ILLCAUSE versus AGE. Estimated bivariate correlation of ILLCAUSE and AGE. After the log-transformation of AGE: Scatterplot of ILLCAUSE versus LN_AGE. Estimated bivariate correlation of ILLCAUSE and LN_AGE. Close the log Check out the hypothesized inter- connections among the question predictor and the covariates

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 13 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Listing a Few Cases from the ILLCAUSE Datasest +--------------------------------------------------------+ | ID ILLCAUSE SES PPVT AGE GENREAS HEALTH | |--------------------------------------------------------| 1. | 301. 2 138 128 4.802 Diabetic | 2. | 302 2.857 2 102 79 2.188 Diabetic | 3. | 303 3.429 3 84 151 3.302 Diabetic | 4. | 304 4.286 3 98 178 5.219 Diabetic | 5. | 305 4.286 4 80 113 2.5 Diabetic | |--------------------------------------------------------| Notice that a period (.) is the code that is used in STATA as the default missing value code. Notice that the children's values of SES are heterogeneous and remember that higher values mean lower SES!!! Notice the heterogeneous ages of the sampled children (in months) This column contains the values of a STATA “system” variable that counts and identifies the observations in the order in which they appear in the active file. We will make much use of it later. Selected output from Data-Analytic Handout I.1(a).1 … data on a few early cases The values of the outcome, ILLCAUSE, are listed here for each child Notice that the health status of the children has been reformatted by my STATA program from a numerical to an alphabetic label

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 14 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Of Outcome ILLCAUSE & Question Predictor HEALTH What Do You Notice, in this Bivariate Scatterplot, That May Most Usefully Inform Subsequent Regression Analyses? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Parallel box-plots illustrating the sample bivariate relationship of continuous outcome, ILLCAUSE, with children’s HEALTH status” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Parallel box-plots illustrating the sample bivariate relationship of continuous outcome, ILLCAUSE, with children’s HEALTH status”

S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Of Outcome ILLCAUSE & Question Predictor HEALTH When We Fit Future Multiple Regression Models To These Data, We May Find … S052/I.1(a) – Slide 15© Willett, Harvard University Graduate School of Education What Do You Notice, in this Scatter-Plot, That May Most Usefully Inform Our Subsequent Regression Analyses? 1.Serious residual heteroscedasticity. 2.That a small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. 1.Serious residual heteroscedasticity. 2.That a small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses.

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 16 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Between Outcome ILLCAUSE & Covariate AGE What Do You Notice, in this Scatterplot, That May Most Usefully Inform Subsequent Regression Analyses? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between continuous outcome, ILLCAUSE and continuous covariate AGE …” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between continuous outcome, ILLCAUSE and continuous covariate AGE …” r = 0.671 ***

S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Of Outcome ILLCAUSE & Covariate AGE When We Fit Future Multiple Regression Models To These Data, We May Find That … 1.There is serious residual heteroscedasticity. 2.A small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between the outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. 1.There is serious residual heteroscedasticity. 2.A small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between the outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. S052/I.1(a) – Slide 17© Willett, Harvard University Graduate School of Education What Do You Notice, in this Scatter-Plot, That May Most Usefully Inform Our Subsequent Regression Analyses? r = 0.671 ***

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 18 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Between Outcome ILLCAUSE & Covariate LogAGE Do You Feel Any Better Now? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between the continuous outcome, ILLCAUSE and continuous covariate LN_AGE …” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between the continuous outcome, ILLCAUSE and continuous covariate LN_AGE …” r = 0.683 ***

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 19 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Exploratory Bivariate Scatterplot of the ILLCAUSE/SES Relationship What Do You Notice, in this Scatterplot, That May Most Usefully Inform Subsequent Regression Analyses? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between the continuous outcome, ILLCAUSE and continuous covariate SES …” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Scatterplot of the sample bivariate relationship between the continuous outcome, ILLCAUSE and continuous covariate SES …” r = -0.247 ***

S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Sample Bivariate Relationship Of Outcome ILLCAUSE & Question Predictor HEALTH When We Fit Future Multiple Regression Models To These Data, We May Find That … 1.There is serious residual heteroscedasticity. 2.A small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between the outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. 1.There is serious residual heteroscedasticity. 2.A small number of atypical data-points may exert an inordinate impact on the fit. 3.There are issues of non-linearity in the hypothesized relationship between the outcome and this predictor? 4.The residuals may not be normally distributed. 5.The residuals for different children may not be independent of one another. 6.The figure provides other insights that have not been exhausted by this list of responses. S052/I.1(a) – Slide 20© Willett, Harvard University Graduate School of Education What Do You Notice, in this Scatter-Plot, That May Most Usefully Inform Our Subsequent Regression Analyses? r = -0.247 ***

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 21 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Exploratory Bivariate Tabulations of the ILLCAUSE Data What Do You Notice, in this Table, That May Most Usefully Inform Subsequent Regression Analyses? STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Sample distribution of childrens’s AGE and SES, by their HEALTH status …” STATA output from Data-Analytic Handout I.1(a).1 … Not APA-Style “Sample distribution of childrens’s AGE and SES, by their HEALTH status …” HEALTH | AGE SES ----------+-------------------- Diabetic | 136.75 2.805556 | 35.9105 1.141914 | 194 5 | 62 1 | 36 36 ----------+-------------------- Asthmatic | 129.0822 2.69863 | 40.72187.8446717 | 200 4 | 61 1 | 73 73 ----------+-------------------- Healthy | 131.9896 1.78125 | 42.02267.7567677 | 203 4 | 64 1 | 96 96 ----------+-------------------- Total | 131.7902 2.287805 | 40.4458.9852329 | 203 5 | 61 1 | 205 205 -------------------------------

*-------------------------------------------------------------------------------- * Create additional predictors. *-------------------------------------------------------------------------------- * Create illness-group dichotomies to serve as principal question predictors: * Diabetic children: generate D=. replace D=1 if HEALTH ==3 replace D=0 if HEALTH !=3 * Asthmatic children: generate A=. replace A=1 if HEALTH ==5 replace A=0 if HEALTH !=5 * Healthy children: generate H=. replace H=1 if HEALTH ==6 replace H=0 if HEALTH !=6 * Take the natural log-transform of child AGE: generate LAGE = log(AGE) * Create the complete set of two-way health status by log-AGE interactions: generate DxLAGE = D*LAGE generate AxLAGE = A*LAGE generate HxLAGE = H*LAGE * Create the complete set of two-way health status by SES interactions: generate DxSES = D*SES generate AxSES = A*SES generate HxSES = H*SES * Create the two-way logAGE by SES interaction: generate LAGExSES = LAGE*SES * Create the complete set of three-way health status by log-AGE by SES interactions: generate DxLAGExSES = D*LAGE*SES generate AxLAGExSES = A*LAGE*SES generate HxLAGExSES = H*LAGE*SES © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 23 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Conducting the Requisite Multiple Regression Analyses? From Data-Analytic Handout I.1(a).2 … input the data and recode/create any variables needed for the subsequent regression analysis … Convert categorical HEALTH status into a vector of appropriate dichotomous (“dummy”) predictors. Transform AGE, by taking its natural logarithm Create a set of two-way HEALTH by LN_AGE interactions Create a set of two-way HEALTH by SES interactions Create a Log(AGE) by SES interaction Create a a set of three-way HEALTH by AGE by SES interactions

*-------------------------------------------------------------------------------- * Fit a sensible taxonomy of nested regression models to investigate the * impact of health status on children's understanding of illness causality. *-------------------------------------------------------------------------------- * Using the "check out the effect of the major question predictor first, and then * control for everything else" strategy: * First, estimate the uncontrolled "total effect" of health status: * Model 1: Include the main effects of D and A simultaneously, * Omit predictor H to make healthy children the "reference category": regress ILLCAUSE D A * Second, account for the multi-cohort research design by controlling for * heterogeneity in the ages of the participating children: * Model 2: Include the main effect of the child's log-AGE: regress ILLCAUSE D A LAGE * Model 3: Check the two-way interaction of health status and child log-AGE: regress ILLCAUSE D A LAGE DxLAGE AxLAGE * Third, control for additional substantive covariate, socio-economic status: * Model 4: Check the main effect of family socioeconomic status: regress ILLCAUSE D A LAGE DxLAGE AxLAGE SES * Model 5: Check whether all interactions with SES are required in the model: regress ILLCAUSE D A LAGE DxLAGE AxLAGE /// SES DxSES AxSES LAGExSES DxLAGExSES AxLAGExSES © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 24 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Conducting the Requisite Multiple Regression Analyses? And here’s the multiple regression analyses … Specify the predictors you want included in each hypothesized model The first variable to appear in the list is the outcome, by default You can specify many regression models one after another, sequentially Regress is the STATA procedure for conducting multiple regression analysis Categorical question predictor, HEALTH, is represented by only two of the health status dummies: D (for diabetics), and A (for asthmatics). For a discussion of this, see Appendix 1. Categorical question predictor, HEALTH, is represented by only two of the health status dummies: D (for diabetics), and A (for asthmatics). For a discussion of this, see Appendix 1. This symbol permits any line of code to break onto the next line

*-------------------------------------------------------------------------------- * Fit a sensible taxonomy of nested regression models to investigate the * impact of health status on children's understanding of illness causality. *-------------------------------------------------------------------------------- * Using the "check out the effect of the major question predictor first, and then * control for everything else" strategy: * First, estimate the uncontrolled "total effect" of health status: * Model 1: Include the main effects of D and A simultaneously, * Omit predictor H to make healthy children the "reference category": regress ILLCAUSE D A * Second, account for the multi-cohort research design by controlling for * heterogeneity in the ages of the participating children: * Model 2: Include the main effect of the child's log-AGE: regress ILLCAUSE D A LAGE * Model 3: Check the two-way interaction of health status and child log-AGE: regress ILLCAUSE D A LAGE DxLAGE AxLAGE * Third, control for additional substantive covariate, socio-economic status: * Model 4: Check the main effect of family socioeconomic status: regress ILLCAUSE D A LAGE DxLAGE AxLAGE SES * Model 5: Check whether all interactions with SES are required in the model: regress ILLCAUSE D A LAGE DxLAGE AxLAGE /// SES DxSES AxSES LAGExSES DxLAGExSES AxLAGExSES *-------------------------------------------------------------------------------- * Make Model 4 more parsimonious by specifying health status as one dichotomy *-------------------------------------------------------------------------------- * Create a new question predictor to distinguish ill children from healthy: generate ILL=. replace ILL=1 if D ==1|A==1 replace ILL=0 if H ==1 * Create the two-way interaction of predictor ILL & design covariate log-AGE: generate ILLxLAGE = ILL*LAGE * Simplify Model 4 by replacing the pair of former health-status dummies, D & A, * By the new single question predictor ILL: * Model 6: regress ILLCAUSE ILL LAGE ILLxLAGE SES © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 25 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Conducting the Requisite Multiple Regression Analyses? We will unpack this carefully in our next class! Can You Discern How The Logic of Slide #22 Drove The Specification Of This Nested Taxonomy Of Regression Models?

Source | SS df MS Number of obs = 194 -------------+------------------------------ F( 2, 191) = 23.45 Model | 39.7373141 2 19.868657 Prob > F = 0.0000 Residual | 161.809826 191.847171864 R-squared = 0.1972 -------------+------------------------------ Adj R-squared = 0.1888 Total | 201.54714 193 1.0442857 Root MSE =.92042 ------------------------------------------------------------------------------ ILLCAUSE | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- D | -.8373226.1864973 -4.49 0.000 -1.205182 -.4694638 A | -.9355971.1468597 -6.37 0.000 -1.225272 -.6459219 _cons | 4.603656.095443 48.23 0.000 4.415398 4.791914 ------------------------------------------------------------------------------ © Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 26 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit What Does Regular Multiple Regression Output Look Like? Can You Navigate & Interpret Typical Regression Output? Can you interpret the estimated intercept? What hypothesis does each of these pairs of statistics test? Conceptually, what is standard error? Can you interpret the estimated coefficient associated with predictors D and A? Can you interpret the R 2 statistic? Can you interpret the “Root MSE” statistic? What hypothesis do these statistics test? Can you interpret the “Sum of Squares Error” -- or SSError -- statistic? Can you interpret the “Sum of Squares Model” - - or SSModel -- statistic? Can you interpret the “Sum of Squares Total” -- or SSTotal -- statistic?

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 27 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Finally, Can You Assemble The Taxonomy of Fitted Models In An APA-Style Table What are the critical features of APA Formatting for tables? Consult the Style Manuals Consult the Style Manuals and exemplars on the course website We’ll dissect this taxonomy, and interpret its substantive story in great detail, in our subsequent classes...

S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Using Clickers In Future S-052 Classes Should We Use Clickers in Future S-052 Classes? S052/I.1(a) – Slide 28© Willett, Harvard University Graduate School of Education 1.Yes, I never want to put my Clicker down again. You will have to pry it from my cold dead hands. 2.Ok, I give in, in a few classes, maybe, if you must. 3.Hell no, never again! 4.You should take all the Clickers in the school and pound them into eMulch. 5.I will bring an egg sandwich for lunch next time. 1.Yes, I never want to put my Clicker down again. You will have to pry it from my cold dead hands. 2.Ok, I give in, in a few classes, maybe, if you must. 3.Hell no, never again! 4.You should take all the Clickers in the school and pound them into eMulch. 5.I will bring an egg sandwich for lunch next time.

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 29 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Appendix I: Why Can Two Dummy Predictors Distinguish Among Three Groups? The first fitted regression model (M1) from Data-Analytic Handout I.1(a).1 is: From it, you can estimate the predicted value of ILLCAUSE in each health status group by substituting numerical values of the health status predictors that represent prototypical individuals in the dataset: Notice that the predicted outcome values corresponding to one of the groups – the reference, omitted or comparison group (here, healthy children) – are obtained when the two dichotomous predictors that distinguish the chronically-ill children are both set to zero. This means that, if you have an intercept in the model, you need one less dummy predictor in the model than there are groups compared, as the fitted value for the “reference (or omitted) group” is provided by the estimated intercept. Another way of thinking about this is to understand that, although there are three distinct health status groups present, only two independent pieces of information are needed to indicate the health status of a child because if a child is neither diabetic nor asthmatic then s/he must be healthy, by default. Of course, you get to choose which of the health status groups serves as the reference, because you are the one who picks which dummy predictor is omitted from the regression model. Typically, you make this choice for substantive, not statistical, reasons.

© Willett, Harvard University Graduate School of EducationS052/I.1(a) – Slide 30 S052/§I.1(a): Deciding Which Multiple Regression Models To Fit Appendix 1: Why Can Two Dummy Predictors Distinguish Among Three Groups? Inspection of the fitted values computed on the previous slide indicate that the fitted regression parameters that we obtained in the analysis – that is, the estimated intercept parameter and the two estimated slope parameters associated with the dummy predictors representing health status, can be interpreted as follows: The fitted slope parameter associated with dummy predictor A represents the difference in the predicted value of ILLCAUSE between the asthmatic and “reference” healthy children – it is our best estimate of the difference between asthmatic and healthy children, on average, in the population (- 0.94). The fitted slope parameter associated with dummy predictor D represents the difference in the predicted value of ILLCAUSE between diabetic and “reference” healthy children – it is our best estimate of the difference between diabetic and healthy children, on average, in the population (-0.84). The fitted intercept represents the predicted value of ILLCAUSE (4.60) for those in the reference (or omitted) category –it is our best estimate of the understanding of healthy children, on average, in the population.