Data Management in Stata

Data Management in Stata
Samuel DeWitt Part I – March 2nd, 2016

Organization 1. Importance of Data Management
2. Sample Data and Definitions 3. Basic Functions to Know 4. Data Exercises with Basic Functions 5. Advanced Functions to Know 6. Data Exercises with Advanced Functions 7. Some Other Functions to Learn 8. Special Topics 9. El Fin

1) Why is Data Management Important?
How often do we get “clean” data? Extremely rare (never myself) Gap in methods/statistics education Taught the “how” of analysis, not management of data to get to that point Typical cause of delay For newcomers or basic users, could take weeks or months to get data ready Enables use of different kinds of statistical analyses

2-A) Sample Data NLSY97 Motivating Question Data Variables
Longitudinal sample of American youth, ages as of Wave 1 Motivating Question What is the relationship between employment (dichotomous, weeks worked) and the variety of crimes (0-17 different types of self-reported crimes) a youth reports in any given wave? Data Waves 1-3 Variables Employment: Yes/No Weeks worked (0 – max) Crime Variety Score Demographics Job Satisfaction And more…

2-B) Some Preliminary Definitions
Data in “Wide” Format Data in “Long” Format id variety1 variety2 24 1341 4755 id wave variety 24 1 2 1341 4755

2-C) Some Preliminary Definitions
Information to Know Data in “Wide” Format Observations are simply persons, matrix is as wide as the number of covariates in the data set. Helpful to use a numerical suffix to denote observation sequences, hence the variety1 and variety2 depicted here. As long as variables named this way, can use shortcuts in reshape commands (as well as other commands, too). For example, if I want to compute descriptive statistics across variety scores, I can use an asterisk to denote that I want them for all variety variables tabstat variety*, stats(mean N) column(statistics) id variety1 variety2 24 1341 4755

2-D) Some Preliminary Definitions
Information to Know Data in “Long” Format Observations are now person-waves, matrix is skinnier now, but much taller as the sample size is now N * t, or the number of original cases times the number of waves. No longer have use for suffixes, but that means descriptive statistics are slightly more difficult to obtain by wave. bysort wave: sum variety, detail Could also compute person-means across waves: bysort id: sum variety, detail bysort id: egen varmean=mean(variety) id wave variety 24 1 2 1341 4755

3) Basic Functions to Know – A Short, Incomplete List
A) By, sort and Bysort B) Egen C) Reshape D) Gsort E) Lags and leads F) Carryforward

3-A) Basic Functions to Know: By & Bysort
Various – runs another Stata function within groups defined with “by” or “bysort” Use In practice, “by, sort” and “bysort” are the same If I want to see descriptive statistics of variety1 by race, I would write: bysort race: sum variety1, detail Or – by race, sort: sum variety1, detail Helpful Hint Suppose you need or want to sort by a variable but you do not want statistics created according to it – for generating a mean crime variety score for the first 3 waves, you would write this: bysort nlsyid (wave): egen varmean3=mean(variety) if wave<=3 Essentially, I want to be sure the data is sorted by person-waves, but do not want a different mean for every person-wave, which is what ”bysort nlsyid wave: egen varmean3=mean(variety) if wave<=3” would do

3-B) Basic Functions to Know: Egen
Will run a number of mathematical procedures over a set of pre-defined variables Use When data are in long format, egen tends to be helpful when used with the “bysort” function Say I want cluster means for gross household income over time, I would write: bysort nlsyid: egen hhinc_mean=mean(hhinc) Helpful Hint Perhaps the most useful egen command when dealing with panel data is “tag” – here’s what it looks like in action: I have data in person-observations and want to highlight just one row per person to put together static descriptives (sex, race, etc…) – egen id_tag=tag(nlsy_id) Results in the first row per person having a value of 1 for id_tag and all other rows will have 0s. Also very helpful when you want to tag unique observations within person (e.g., arrest cycles within person ids) Keep in mind that egen will, by default, treat missing values as 0s – this can be rectified by adding a conditional statement to the end of the function (true for many egen functions, not all)

3-C) Basic Functions to Know: Reshape
Will switch data from long to wide (and vice versa) Use Suppose I have data in “wide” format that I need to have in “long” to run panel data analyses, I would use “reshape long ….” to accomplish this reshape long varnames, i(person id variable) j(panel series varname) Helpful Hint It’s always good to keep a version of your data in each format, just in case you need the data in a different format for variable creation or some type of analysis Also, be sure to reshape the data in a different do file than the one you use to run analyses – if you mess something up, you’ll have to re-run the reshape again if you don’t (I will break this rule later)

3-D) Basic Functions to Know: Gsort
Alternative to “sort” allows you to sort variables in ascending or descending order Use Say you need to sort one variables in ascending order and another in descending order – gsort allows you to do this gsort +nlsyid –wave This will sort nlysid in ascending order (+) and wave in descending order (-) Helpful Hints Often helpful when putting data in their initial order, or when computing variables within IDs that are sensitive to sort order

3-E) Basic Functions to Know: Lags and Leads
Allow you to reference observations before (lag) or after (lead) a certain observation of interest Use Let’s say I want to create a difference between this and last year’s crime variety score but my data are now in long format – no need to worry, a lag function can solve that easily gen vardif=variety-variety[_n-1] if nlsyid==nslyid[_n-1] & year!=1 Tells Stata to generate a difference value for each variety score comparison (2 to 1, 3 to 2, 4 to 3, etc…) as long as nlsyid is the same for the comparison and it is not the first observation (no difference exists there, of course). Helpful Hint Lagged values can be helpful in a number of ways, but be mindful of the fact that they can create some pretty serious multicollinearity problems at times

3-F) Basic Functions to Know: Carryforward
Allows you to copy and replace values down rows Use Suppose I have a variable measured in only the first wave, but I want to copy it into the following rows – carryforward can do that easily, but you need to be careful bysort person_id: carryforward varname, replace Helpful Hint Always use carryforward with the by, sort or bysort command, otherwise it will copy values into rows for other IDs

4) Data Exercises with Basic Functions
Or, the complicated nature of things that should, by all means, be easier than they are.

5) Advanced Functions to Know – Another Short, Incomplete List
A) Import and Export B) Merge C) Global and Local D) Foreach and Forvalues E) While F) Statsby

5-A) Advanced Functions to Know: Import and Export
Allows you to port various types of data into or out of Stata (most helpful with Excel files) Use Let’s say someone you collaborate with uses a different data analysis program than you do – this means you will have to share a file that is universally recognized by both programs – often, this is some type of Microsoft Excel file (.csv; .xls; .xlsx) Luckily, the import function now accepts excel files with multiple sheets, so you do not always have to use a .csv (like I used to). import excel using “file path & name here”, sheet(“sheetname here”) other options….. export excel using “file path & name here”, Helpful Hint Did you know you can combine dish detergent with vinegar to create a super-cleaning agent? Great for cleaning tile in bathrooms….

5-B) Basic Functions to Know: Merge
For combining data sets Use Three different kinds of merges – 1:1; 1:m/m:1, or m:m (almost never right to use this last one) The first numeral or letter corresponds to the “master” data, the second to the “using” data Can use just one or multiple identifiers merge 1:m person_id quarter using “data set name here” Always returns an _merge variable Depending upon the purpose of the merge, you may not want to see a value of 1, 2, or 3 on this variable Helpful Hint Always double check the _merge variable after a merge and identify a validation observation beforehand from both data sets so it is easier to check how things went wrong (if they did!) Can suppress _merge variable – don’t unless you really know what you are doing!

5-C) Basic Functions to Know: Global and Local
Allow you to create temporary or permanent macros Use Varied – local macros are helpful to store temporary values used in other calculations or loops, I personally use global macros to streamline my code so that I am not typing the same variable names over and over Local macros may be referred to like so: local tempvar=1 di `tempvar’ Global macros may be referred to like so: global tempvar 1 di $tempvar Helpful Hint Multiple “for” loops in Stata can get kind of tricky, and Stata may not treat them properly, either (anecdotal evidence) so replacing a “forval” with a local macro can be a good solution Use global macros and loops together! Need higher order or logged terms for a bunch of continuous variables? Create a global macro and just reference that!

5-D) Basic Functions to Know: Foreach and Forvalues
Allow you to loop code across many variables or values (among other things) Use If you need to recode numerical missing value to Stata missing values across a number of variables, foreach is very helpful foreach “var” in varnames { recode `var’ (-9999 = .) } Or, let’s say you want to rename a series of numbered variables: forvalues num=1/5 { rename var`num’ newvar`num’ Helpful Hint Use these as often as you can but when you first begin using them, always make sure to double check that they worked as intended

5-E) Basic Functions to Know: While
Pretty versatile, I use it for controlling looping functions but can also be used when the “by” command cannot be used (as is true in some egen functions) Use Here’s how I have used it in the past within a foreach loop local i=99 while ì’>=1 { foreach var in varnames { replace `var’=ì’ if varname<=ì’ } local i=ì’-1 Helpful Hint Useful for when you need to use multiple loops, and especially when forvalues misbehaves in the loop or you just cannot figure out how to get it to work.

5-F) Basic Functions to Know: Statsby
Allows you to create separate data sets from results stored in Stata memory Use I’ve used it to extract results out of Stata just about in their entirety to then export them over to Excel. Let’s say I suspect that my results would likely be different in male/female subsamples – I can estimate and save these results to another data set. statsby _b _se, saving(“data set name here”, replace) by(male): regress depvar indepvars This will take all coefficients (_b) and standard errors (_se) from both models and store them in a new data set. Helpful Hint This can be pretty useful for obtaining summary statistics across subsamples in your data, especially if you have many of them. Would save a lot of copy and pasting on your part if you could just export the results right into Excel.

6) Data Exercises with Advanced Functions
Or, the simple nature of things that should, by all means, be harder than they are.

7) Some Other Functions to Learn
This is a list not just for you, but for myself also (hence why I don’t go over them at all) Else Helpful in loop commands Outreg/Outreg2 Creates output for you, almost like LaTeX Can set up multiple formats for tables that work for different journals Project User-written command that allows you to link multiple do-files and data sets within a particular named project you are working on Appears to be very helpful with file pathing issues Program The basic foundation of user written programs – this function allows you to write your own Stata programs and test them out

Or, where you ask me questions I may or may not know the answer to.
8) Special Topics Or, where you ask me questions I may or may not know the answer to.

Or, where I go home and put my pajamas back on.
9) El Fin Or, where I go home and put my pajamas back on.

Data Management in Stata

Similar presentations

Presentation on theme: "Data Management in Stata"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Management in Stata

Similar presentations

Presentation on theme: "Data Management in Stata"— Presentation transcript:

Similar presentations

About project

Feedback