Presentation on theme: "Research Methods Lecture 3 More STATA Ian Walker Room S2.109 02475 Slides available at:"— Presentation transcript:
Research Methods Lecture 3 More STATA Ian Walker Room S2.109 Slides available at:
Stat-Transfer Use STAT- TRANSFER to convert data. Click on Stat-transfer is “point and click”. Just tell it the file name and format and the format you want it in. Click “transfer”.
Stat Transfer options Useful options for creating a manageable dataset from a large one: –Keep or drop variables –Change variable format E.g. float to integer –Select observations E.g. “where (income + benefits)/famsize < 4500” Can be used for reading a large STATA dataset and writing a smaller one Avoids doing this in STATA itself
Customising STATA profile.do runs automatically when STATA starts Edit it to include commands you want to invoke every time.set mem 200m.log using justincase.log, replace Define preferences for STATA’s look and feel –Click on Prefs in menu Colours, graph scheme, etc. Save window positioning
Merging data - 1 file1 has id x1 x2 x3, file2 has id x4 x4 x5. You can merge using “key” in BOTH files (id) But you need to sort both files first. use file1. sort idsorts file1 according to id variable. save, replace. use file 2. sort idsorts file2 according to id variable. merge using file1. drop if _merge~=3 drops obs with any missing info. save file3
Merging data - 2 For each row (id) all vars in file1 added to corresponding row of file2 (if there is one)..merge creates a new variable, _merge –which =1 for those obs only in file1, =2 for those only in file2, and =3 for those in both. So the syntax above drops those obs that don’t have data in both files –and saves the result containing x1-x6 in file3.append to add more obs on the same vars.
Collapsing data (use with care) Collapse converts the data in memory into a dataset of means (or sums, medians, etc.) This is useful when you want to provide summary info at a higher level of aggregation –For example, suppose a dataset contains data on individuals – say their reg and whether u/e –To find the average u/e rates across reg type:. collapse unemp, by(region) leaves 1 obs for each reg and mean u/e rate.
Reshaping files Data may be “long” but thin –Eg each record is a household member –But there are few vars - say wage and hours Data may be “wide” but short –each record is a household and has lots of vars –(eg w1 w2 w3 hours1 hours2 hours3). reshape long inc ue, i(id) j(year) wide to long. reshape wide inc ue, i(id) j(year) back to wide. Handy for merging data together and for panel data
Syntax to remember >=means"greater or equal", & means "and", | means "or" =means“set equal to” ==means “is it equal to?” ~=means “not equal” (or use != ).meansmissing value For example. keep if x1>= 1 & x1<=3 | x1==7 & x1 ~=.. gen x = log(y). reg y x if z == 1 & y !=.
Using STATA as a calculator. display command –.dis 22/7 –disp log(250) –di exp(3.6) –di chiprob(2,6.45) (i.e. 2 df, deviance 6.45) returns (i.e. its significant at 5% level) –display _N Returns the sample size –(_N is the number of the last obs)
Using the data editor Open a datafile (eg auto.dta) Click on the icon Or type.edit You can edit datapoints! Or just browse the datafile
STATA’s editor STATA has an editor that allows you to create do files –Enter cmds – 1 per line –Save the commands in a “do” file –Highlight commands and click the button with page (or page with text) and down arrow to “run” (or “do”) commands.
Saving output Scroll – best to open “log file” (and close it). Click on file, log, begin. Or type. log using myoutput Then type some commands here and then. log close log command allows replace and append Default is a.smcl file extension (to “view”) It doesn’t save graphs –Copy graphs (use cut and paste or use menus)
Saving commands You might prefer own extension, say,.log –then you get an ASCII file that anything can edit –you can translate files to and from smcl format click on file, log, translate and fill in the dialog box Logging your output is a good way of developing a.do file –since it saves the commands as well as output Or you can just log the commands –type.cmdlog using xxx You can turn logging off and back on –.log off then.log on when ready to resume
Useful tips for.do files.#delimit ; /*makes ; the end of line char) */.use mydata, clear ;.set more off ;.set mem 200m ;.set matsize 200 ;.log using xxxx.log, replace ;...log close ;.exit, clear ;
Handling string variables encode Use encode when the original var is a character var (eg gender is "m" or "f’") encode command does not produce dummy variables, it just assigns numbers to each group defined by the character variable. In this example, gender was the original character var and sex is new numeric var:. encode gender, gen(sex) decode does the opposite
Extended generate (.egen) egen Useful when you need a new variable that is the mean, median, etc. of another variable –for all observations or for groups of observations. Also useful when you need to simply number groups of observations based on some classification variables. Great when you have panel data
.egen examples. egen sumvar1 = sum(var1) creates sumvar1 as sum of values of var1. egen meanvar1= mean(var1), by(var3 ) creates meanvar1 as mean of all values of var1. egen counter = count(id), by(company) creates count as the number of companies with nonmissing id’s. egen groupid = group(month year) assigns a number to each month/year group
Saving typing - 1 macro For defining lists of vars (globally or locally)..local macvar x1 x2 x3 x4 x5 x6.reg y1 `macvar’.reg y2 `macvar’.reg y3 `macvar’.reg y4 `macvar’ macvar becomes string “x1 x2 x3 x4 x5 x6" Pay careful attention to the different type of quotation marks
Saving typing - 2 for Performs same command on several vars. It can use several types of variable lists.for var1-var25: replaces 99 by missings).for var*: replaces 99 by missings.for 1-3, ltype(numeric): gen creates q1=0, q2=0, q3=0.for a b c, ltype(any): gen creates a=x, b=x, c=x
Regression models - I Linear regression and related models when the outcome variable is continuous –OLS, 2SLS, 3SLS, IV, quantile reg, Box-Cox … Binary outcome data –the outcome variable is 0 or 1(or y/n) probit, logit, nested logit...; Multiple outcome data –the outcome variable is 1, 2,..., conditional logit, ordered probit
Regression models - II Count data –the outcome variable is 0, 1, 2,..., occurrences Poisson regression, negative binomial Choice models –multinomial choice –A, B or C Multinomial logit, Random utility model, unordered probit, nested logit,...etc Selection models –Truncated, censored Tobit, Heckman selection models; linear regression or probit with selection
Regression models - III STATA supports several special data types. Once type is defined special commands work Time series –Estimate ARIMA, and ARCH models –Estimators for autocorrelation and heteroscedasticity –Estimate MA and other smoothers –Tests for auto, het, unit roots - h, d, LM, Q, ADF, P-P ….. –TS graphs
Special data types: survey Non-randomness induces OLS to be inefficient STATA can handle non-random survey data –see the “syv***” commands –Example (stratified sample of medical cases):. webuse nhanes2f, clear. svyset psuid [pweight=finalwgt], strata(stratid). svy: reg zinc age age2 weight female black orace rural. reg zinc age age2 weight female black orace rural
Special data types: duration Survival time data –See the “st***” commands.stset failtime /*sets the var that defines duration*/ Estimates a wide variety of models to explain duration
Weibull example …. ST regression supports Weibull, Cox PH and other options. streg load bearings, distribution(weibull) After streg you can plot the estimated hazard with. stcurve, cumhaz STATA allows functions to be plotted by specifying the function: –E.g. Weibull “hazard” model –
Special data types: Panel data STATA can handle “panel” data easily –see the “xt***” commands Common commands are.xtdes Describe pattern of xt data.xtsum Summarize xt data.xttab Tabulate xt data.xtline Line plots with xt data.xtreg Fixed and random effects
Panel data An xt dataset looks like this: pid yr_visit fev age sex height smokes xt*** cmds need vars identify person and “wave”:. iis pid. tis yr_visit Or use the tsset command. tsset pid yr_visit, yearly
Panel regression Once STATA has been told how to read the data it can perform regressions quite quickly:. xtreg y x, fe. xtreg y x, re
Further advice See Stephen Jenkins’ excellent course on duration modelling in STATAStephen Jenkins’ excellent course Steve Pudney’s excellent panel course –Beware his example dataset is 30mb+ To get up and running –Just have a go - you won’t break it! –Try some of the commands in this lecture To start to get proficient –Sign up for netcoursesnetcourses