Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ec 2390: Section 1 Useful STATA commands Jack Willis September 14th, 2015.

Similar presentations


Presentation on theme: "Ec 2390: Section 1 Useful STATA commands Jack Willis September 14th, 2015."— Presentation transcript:

1 Ec 2390: Section 1 Useful STATA commands Jack Willis September 14th, 2015

2 Good coding practice Not the focus of today’s section Very important if you do empirical work Essential if you collaborate Different approaches. Common themes: – Version control – Automation. Replicability – Directories. Make code portable. Don’t overwrite raw data – Documentation. Comment your code – (abstraction and testing) Gentzkow and Shapiro detail a very thorough approach here: – http://web.stanford.edu/~gentzkow/research/CodeAndData.pdf http://web.stanford.edu/~gentzkow/research/CodeAndData.pdf

3 General Use do files Set a directory: e.g. cd “C:\data” – Makes collaborating much simpler Ctrl+d runs do file (or selected lines) Save your initial dataset (don’t reimport data each time) Use help files. Type help command_name To find 3 rd party packages: findit package_name google questions! Ask your classmates. Learn from code of others. AER replication code is a good source

4 Topics Importing data local variables Loops * and - time series / panel mode bysort egen collapse reshape _n and _N regression estout factor variables

5 Importing data Commands: import, insheet – First line as variable name is an option (Paste directly into editor)

6 Local variables Local variables enable you to write flexible code They also allow you to store scalars Define a local: – local indep_varname education Reference a local, use ` ‘: – regress schooling `indep_varname’ Eg: local i 1 disp(“i = `i’”)

7 for loops Command foreach loops through a section of code. Basic syntax: foreach i in list { code referencing `i’ } Syntax based on local variables Different loops: – foreach lname in list (for example: foreach file in file1.dta file2.dta file3.dta) – foreach lname of varlist – foreach num of numlist (for example: for each x in 1/1000) – Etc Command forvalues executes n times a particular command, but the loop goes specifically through numeric values: – forvalues num=5/13  loop from 5 to 13 – forvalues i=1(2)100  loop from 1 to 100, by 2 – Etc

8 * and - When defining a variable list: – X* refers to all variables whose names start with X – X-Z refers to all variables between X and Z, inclusive, in the current variable order

9 Binary logic A == 1 returns 1 (true) when A = 1, 0 (false) when A not equal 1 A != 1 reads “A not equal to 1” A == 1 & B > 2 “and” A == 1 | B >2 “or” if can be added to the end of many commands, in which case the command will only be run if the statement is true (i.e. = 1)

10 Time series / panel xtset – setting up time series / longitudinal analysis. Enables many useful functions, for this week’s problem set L is most relevant. Suppose I had a country-product-year level dataset on export egen panel=group(product countrycode) sort panel year xtset panel year gen exports_plus10 = F10.exports This automatically leaves blanks where it “should” Alternatively, once you have xtset, rather than generate the variables you can just include F and L in your regression directly, e.g.: reg income L(1/3).income

11 Execute Commands by Subgroups - bysort runs a stata command separately for each value of a for each value of a variable consideration. bysort does that ‘bysort’ runs a command separately for each value of a variable Using just ‘by’ requires the data to be sorted by the variable in consideration. ‘bysort’ does that for you Runs separate regressions for observations when foreign=“domestic” and when foreign=“foreign” Summarizes the variables price & mpg when foreign=“domestic” and foreign=“foreign”

12 _n and _N _n is a running count of observations (within group) _N is a total count of observations (within group)

13 by then sort You can use by and sort separately, in which case you must sort before using by. So bysort is easier. But, sometimes you want to sort by more variables than you subsequently use for by. E.g.: sort country product year by country product: gen first_year = 1 if _n == 1 Dummy variable for first year a product is reported for the country Note “if” – can be used with many commands

14 Using bysort to Identify Duplicates 4 groups of duplicates It is important to note that bysort cannot be used with every stata commands eg- scatter, histogram etc. Stata also has built in duplicate functions. Type help duplicates

15 Using bysort to generate lagged variables You can also index variables _n is particularly useful here sort country product year by country product: gen lagged_var = var[_n-1] What will this do for _n == 1?

16 egen (within variables) Create new variables that are statistical functions of individual original variables across all, or groups of, the observations Means for the whole sample Means for subgroups egen covers many functions – check help file

17 egen (across variables) Create new variables that are statistical functions of multiple original variables for each observation Example statistical functions

18 collapse college frequency weights aggregated Produce a new file with a single observation for each group of records in the original data set. This example produces the group means and medians.

19 reshape You might find it easier just to follow the two examples in the STATA help file for reshape whenever you need to do it:

20 Reshape Wide to Long When you have a wide dataset … but need a long one You can reshape the data from wide to long wide long Why would you do this? Some Stata statistical procedures (e.g. xtreg for panel data) require the data to be in long form

21 Let’s Look at the Code In-Depth We want our data to end up in long form The two vars that currently have numbers tacked on the end of their names; the ones we want to reshape. In Stata these are called “stubs”. Take the numbers off the end of the reshape vars, and put them in a new var called “year” This specifies a unique individual

22 Reshape Wide to Long Without ID What if there is no ID variable? Let’s create one

23 Reshape Long to Wide When you have a long dataset … but need a wide dataset You can reshape the data from long to wide … and optionally reorder the variables The order command serves only to rearrange the sequence of the variables on the file long wide

24 Let’s Look at the Code In-Depth long wide We want our data to end up in wide form The two vars that change each year, that we want to stick numbers on the end of Take the values in the variable “year”, and stick them on the end of inc and ue This specifies a unique individual

25 Regression Many commands Simplest: regress `y_var’ `x_vars’ Use `robust’ option Has a cluster option Outputs a table but also stores results locally – see “Returned results” in help file To save results: estimates store name

26 Exporting regression tables You will have to re-run your empirical work a lot Essential to automate table output rather than copy-pasting. Numerous ways to do this Many people use estout. You can install it by typing ssc install estout outreg and outreg2 are two alternatives.

27 estout Store regression estimates, then run estout to export table. Many options: check help file 1. Run regressions regress growth X estimates store regression1 regress growth gamma2-gamma20 estimates store region_FE regress growth gamma2-gamma20, robust estimates store robust...

28 2. Output table #delimit; estout regression1 region_FE robust using filename, style(tex) cells(b(star fmt(%9.3f)) se(par)) stats(r2_a N, fmt(%9.3f %9.0g) labels(R-squared)) legend label collabels(none) varlabels(_cons Constant) drop(gamma*) starlevels(* 0.10 ** 0.05 *** 0.01); #delimit cr; ``drop" command prevents Stata from reporting the coefficients of all the gamma dummies. ``style(tex)" command produces tables that you can just copy into LaTex (just take it out if you don't want that).

29 3. Include table in your latex doc: \begin{table}[!h] \begin{center} \input filename \caption{Growth regressions} \label{Table: Growth regressions I} \end{center} \end{table}

30 Factor variables Factor variables are a convenient way to include dummy or indicator variables when fitting a model For instance, assume that the categorical variable agegrp contains 1 for ages 20–24, 2 for ages 25–39, 3 for ages 40–44, etc. Typing logistic outcome weight i.agegrp estimates a logistic regression of outcome on weight and dummies for each agegrp category Examples: – i.groupindicators for levels of group – i.group#i.sexindicators for each combination of levels of group and sex, a two-way interaction – group#sexsame as i.group#i.sex – group##sexsame as i.group i.sex group#sex


Download ppt "Ec 2390: Section 1 Useful STATA commands Jack Willis September 14th, 2015."

Similar presentations


Ads by Google