# Advanced Stata Workshop Nealia Khan Tom Tomberlin Learning Technologies Center Harvard University, Graduate School of Education.

## Presentation on theme: "Advanced Stata Workshop Nealia Khan Tom Tomberlin Learning Technologies Center Harvard University, Graduate School of Education."— Presentation transcript:

Advanced Stata Workshop Nealia Khan Tom Tomberlin Learning Technologies Center Harvard University, Graduate School of Education

Located in Gutman Library, 3 rd floor Contact us at: stathelp@gse.harvard.edustathelp@gse.harvard.edu Can make an appointment or have us respond to your request via email Contact Information

Generating Variables Generate (gen) – allows the user to create or change the contents of a variable. The generate command allows for the use of mathematical functions and conditional statements. There are also many specific commands which can be included in the gen statement. Extensions to generate (egen) – egen allows the use of some specific functions in the creation of new variables. Egen cannot be used interchangeably with the gen command – you must use egen specific functions when using this command.

Gen and Egen Functions group concat cond

Group Assign a unique, three digit numeric value to each district and school name Code: sort district egen group = group(district) sort group gen districtid = group+100 drop group

Apply Your Knowledge Generate the variable schoolid that is a three digit number beginning with 301 which uniquely identifies each school in the schname variable. Hint: Sorting by both district and schname will keep schools within a district together in the id creation.

Join the two newly generated ids (district and school) into one six digit number that uniquely identifies each school. Code: egen id = concat(districtid schoolid) Concat

Condition (cond) We can use the cond function of the generate command to identify the number of duplicate observations of an id number. Code: sort id race female quietly by id: gen dup = cond(_N==1,0,_n)

Apply Your Knowledge Check to see how many duplicates there are for each value of id. Create the variable student and assign a value of 0 if there are 3 or fewer duplicates of id, a value of 1 if there are 4-6 duplicates. Generate a new variable, studentid, which joins id and student together. Drop the id, dup, and student variables from the dataset.

Forming Composites We can create a new variable, risk, that is the sum of the four dummy variables. Code: gen risk = lep + sped + lo_read + frlunch We can eliminate the missing values problem with an option in the egen command. Code: egen risk = rowtotal(lep sped lo_read frlunch)

Forming Composites We can use principal components analysis (PCA) to generate weights for each of the items in risk. The predict command will generate a value of risk for each observation in the dataset. Code: pca lep sped lo_read frlunch predict risk browse lep sped lo_read frlunch risk

Categorical Variables Our dataset contains the categorical variable race. We can deal with this variable in one of two ways: 1.Form dummy variables for each race subgroup. Code: gen race1=race if race==1 replace race1=0 if race~=1

Our dataset contains the categorical variable race. We can deal with this variable in one of two ways: 2.Indicate the categorical nature of the variable in the regression model. Code: regress mathraw i.race Categorical Variables

Does the effect of risk vary by racial group? xi3 allows us to form interactions from within the regression model. Code: xi3: regress mathraw i.race*risk Interactions with Categorical Variables

Use the xi3 function to fit a regression model that includes interaction effects between race and class size. Do we see a differential effect of class size for any race groups? Use the xi3 function to fit a regression model that includes interaction effects between race and both class size and time on the bus. Apply Your Knowledge

Stata has the ability to create formatted tables for regression models. These tables can be created within the Stata program or exported to a text format. Both methods of table creation rely on the estimate store (eststo) command in Stata Creating Regression Tables

There are two methods of storing the estimates of a regression model into eststo: 1.Invoking eststo immediately after a regression procedure and assigning a name for the stored values. Code: regress mathraw i.race risk eststo m1, title(Model 1) Creating Regression Tables

There are two methods of storing the estimates of a regression model into eststo: 2.Using eststo: before a regression model. Stata will automatically assign consecutive model numbers to the stored values. Code: eststo: regress mathraw risk (est1 stored) Creating Regression Tables

There are two commands that allow you to access the information in the estimate store memory – estout and esttab. estout is the most flexible in its ability to modify the appearance of the formatted regression table, but it also requires more programming code to achieve APA style tables. esttab is “wrapper” for estout and simplifies the coding process. Creating Regression Tables

Code: eststo: xi3: regress mathraw i.race (est1 stored) eststo: xi3: regress mathraw i.race risk class_sz bus_time (est2 stored) eststo: xi3: regress mathraw i.race*risk class_sz bus_time (est3 stored) ESTOUT

Code: estout est1 est2 est3 Now let’s try to format this table into something suitable for a research paper: Code: estout using models_out.rtf, cells(b(star fmt(3)) se(par fmt(2))) legend label title(Regression Models) mlabels("Model A" "Model B" "Model C") varlabels(_cons INTERCEPT) stats(N r2 df_r, fmt(0 3 0) label (N R2 DF)) style(fixed) Here is the result of this code: ESTOUT

ESTTAB Code: esttab We can make modifications to the standard esttab table: Code: esttab using models.rtf, se r2 ar2 label title({\b Table 1.} {\i Hierarchy of Fitted Models}) nonumbers mtitles("Model A" "Model B" "Model C") varlabels(_cons INTERCEPT) order( _Irace_2 _Irace_3 _Irace_4 _Irace_5 class_sz bus_time risk _Ira2Xri _Ira3Xri _Ira4Xri _Ira5Xri) style(fixed) Here is the result of this code:

ESTTAB

Here is the code for estout to produce the same table we just created in esttab: estout using `"models1.rtf"', cells(b(fmt(a3) star) se(fmt(a3) par)) stats(N r2 r2_a, fmt(%18.0g 3 3) labels(`"Observations"' `"{\i R}{\super 2}"' `"Adjusted {\i R}{\super 2}"')) starlevels("{\super *}" 0.05 "{\super **}" 0.01 "{\super ***}" 0.001, label(" {\i p} < ")) varwidth(20) modelwidth(12) begin({\trowd\trgaph108\trleft- 108@rtfrowdefbrdr\pard\intbl\ql {) delimiter(}\cell \pard\intbl\qc {) end(}\cell\row}) title({\b Table 1.} {\i Hierarchy of Fitted Models}) prehead(`"{\rtf1\ansi\deff0 {\fonttbl{\f0\fnil Times New Roman;}}"' `"{\info {\author.}{\company.}{\title.}{\creatim\yr2010\mo3\dy31\hr14\min14}}"' `"\deflang1033\plain\fs24"' `"{\footer\pard\qc\plain\f0\fs24\chpgn\par}"' `"{\pard\keepn\ql @title\par}"' {) posthead() prefoot() postfoot(`"{\pard\ql\fs20 Standard errors in parentheses\par}"' `"{\pard\ql\fs20 @starlegend\par}"' } `"{\pard \par}"' `"}"') label varlabels(_cons INTERCEPT) mlabels("Model A" "Model B" "Model C",) nonumbers collabels(, none) eqlabels(, begin("{\trowd\trgaph108\trleft-108@rtfrowdefbrdrt\pard\intbl\ql {") replace nofirst) notype level(95) replace order( _Irace_2 _Irace_3 _Irace_4 _Irace_5 class_sz bus_time risk _Ira2Xri _Ira3Xri _Ira4Xri _Ira5Xri) style(fixed)

Graphing - Scatterplots Bivariate Scatterplot – mathraw on risk Code: scatter mathraw risk Since risk is essentially a bin, the graph will have a “stacked” appearance to it. We can lessen this effect with the jitter option. Code: scatter mathraw risk, jitter(4) Here is the resulting graph:

Graphing - Scatterplots Now let’s add a fitted trend line to the scatterplot of mathraw on risk. Code: twoway scatter mathraw risk, jitter(4) || lfit mathraw risk The lfit option gives us a linear fitted trend line. There are two other fit options for the trend line – qfit (quadratic fit) and fpfit (fractional polynomial fit). Here are three graphs that illustrate these three fitted line options:

Graphing - Scatterplots We can easily add 95% confidence intervals to the fitted trend line for any of the fitted trend line options. Code: twoway scatter mathraw risk, jitter(4) || lfitci mathraw risk, ciplot(rline) twoway scatter mathraw gpa || qfitci mathraw gpa, ciplot(rline) twoway scatter mathraw risk, jitter(4) || fpfitci mathraw risk, ciplot(rline) Here are the three graphs from the code above:

Graphing – Residual Scatterplots Let’s begin by fitting our regression model: xi3: regress mathraw female class_sz bus_time i.race*risk Stata has two postestimation commands that allow us to check (raw) residuals against predictors and fitted values. Code: rvpplot class_sz, yline(0) rvfplot, yline(0) Here are the graphs for these two commands:

Graphing – Residual Scatterplots Suppose we want to plot the studentized residuals against the predictors and fitted values. We must generate studentized residuals for each observation and also predict fitted values of mathraw for each observation. Code: predict student if e(sample), rstudent predict fitted scatter student fitted, yline(2) yline(-2) scatter student class_sz, yline(2) yline(-2)

Graphing – Regression Lines We can create fitted regression lines in Stata by using the xi3 function with the regression command. The graph is generated by the postgr3 command followed by the variable of interest. Code: xi3: regress mathraw i.race*risk class_sz bus_time female postgr3 risk Here is the graph which results from this code:

Graphing – Regression Lines We can enhance our graph to show the effect of risk on mathraw by including prototypical values of class_sz in the graph. Code: gen class_cat=1 if class_sz<=17 replace class_cat=2 if class_sz>17 & class_sz<=30 replace class_cat=3 if class_sz>30 xi3: regress mathraw i.race*risk class_cat bus_time female postgr3 risk, by(class_cat) Here is the graph from the code:

Graphing – Regression Lines We can also generate a graph of the regression lines which show the interactions between risk and race. Code: postgr3 risk, by(race) Here is the graph produced by the code:

Graphing – Regression Lines We can use the graph combine command in Stata to join two graphs together. Code: xi3: regress mathraw class_sz bus_time female i.race*risk postgr3 risk, by(race) x(female=1) name(female) postgr3 risk, by(race) x(female=0) name(male) graph combine female male, ycommon Here is the graph for the preceding code:

Questions: Please complete the evaluation of this workshop Thank you!

Download ppt "Advanced Stata Workshop Nealia Khan Tom Tomberlin Learning Technologies Center Harvard University, Graduate School of Education."

Similar presentations