Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adrián de la Garza Jeremy Green 27 March 2009

Similar presentations


Presentation on theme: "Adrián de la Garza Jeremy Green 27 March 2009"— Presentation transcript:

1 Adrián de la Garza Jeremy Green 27 March 2009
Intermediate STATA Adrián de la Garza Jeremy Green 27 March 2009 4/14/2017

2 Getting Help STATA Help: Just type help in STATA main Command window.
STATA listserv: UCLA Stat Computing: Yale StatLab Consultants, online help and FAQs: Manuals also available at SSL and Yale StatLab. 0. Introduction

3 Today’s Workshop 1. Programming/Project Management Tips
2. Data Management 3. Analyzing Data - Graphs - Statistical Analysis Latest version: STATA v. 10: Commands throughout this presentation will always refer to this version, although most are backwards-compatible. 0. Introduction

4 Using DO files (1/2) DO files allow you to run a whole program interactively; you can run it all at once or select portions of the program. AVOID making changes to your original data interactively using the STATA command window. Use DO files instead. Use DO files to make changes to your data and to run your statistical and graphical analyses. Keep track of your progress. 4 1. Programming/Project Management Tips

5 Using DO files (2/2) Keep your DO files organized: Helps to create a main DO file from which you run other DO files that perform smaller tasks on your data. Write lots of comments in your DO file to help you remember what a command or a section of your DO file does. This will help you remember what you did months ago. To open DO file, use FILE menu or DO-file button. 5 1. Programming/Project Management Tips

6 Log files Syntax Open log file
log using filename [, append replace [text|smcl] name(logname)] Close log, temporarily suspend logging, or resume logging log {close|off|on} [logname] Examples . log using mylog . log close . log using mylog, append . log using "filename containing spaces" 1. Programming/Project Management Tips

7 Managing Your Data Back up all Master Data Files
CD, USB drive, network Keep a detailed codebook Describes each variable and values Adding variables, cases, computing new variables Keep a roadmap Keep a log of all analyses with what you have done Save syntax files 7 2. Data Management

8 Inspecting Your Data (1/3)
cd “C:\Documents and Settings\Adrian\My Documents\stata files” clear set mem 80m log using “C:\Documents and Settings\Adrian\My Documents\stata files\logs\mylog” sysuse census browse list state region pop if _n <= 3 /* shows first 3 obs */ l state region pop if _N - _n <= 2 /* shows last 3 obs */ l state region pop in 1/3 /* shows first 3 obs */ l state region pop in -3/l /* shows last 3 obs */ 2. Data Management

9 Inspecting Your Data (2/3)
generate agesq = medage^2 /* creates variable equal to medage squared */ sum pop /* shows summary stats for pop */ scalar popmean = r(mean) /* saves mean of pop to scalar popmean */ /* create variable equal to 1 when pop > popmean and 0 otherwise */ g dummy = 0 replace dummy = 1 if pop > popmean /* how many states have population higher than average? */ count if dummy == 1 /* how many states NOT IN THE SOUTH have pop > popmean? */ count if dummy == 1 & region != 3 9 2. Data Management

10 Inspecting Your Data (3/3)
describe label list /* shows all labels attached to dataset */ label list cenreg /* shows label cenreg attached to variable region */ sum pop browse /* summarize population by region */ sum pop if region == “NE” /* this gives an error since region is not a string */ sum pop if region == 1 /* this does work */ 10 2. Data Management

11 Calculate mean population by region
Method 1 sum pop if region == 1 sum pop if region == 2 sum pop if region == 3 sum pop if region == 4 Downside: We have to type the sum command for each individual region. If the dataset contained population data by city and we had to compute means for each of the 50 states, typing the sum command 50 times would be very painful!!! 11 2. Data Management

12 Calculate mean population by region
Method 2 bysort region: sum pop Downside: This method shows the population means by region, like we wanted, but it also shows a bunch of other stats we may not care about. Also, the means are stored in memory but are not readily available for use in case we want to use those means for further calculations. 12 2. Data Management

13 Calculate mean population by region
Method 3 table region, c(m pop) Downside: This method is great for presentation purposes: it shows exactly the information we want. One problem, however, is that the information is still not readily available for use in case we want to store the population means by region for further analyses. 13 2. Data Management

14 Calculate mean population by region
Method 4 sysuse census, clear collapse (mean) pop, by(region) Downside: The collapse command converts the dataset in memory into a set of means, standard deviations, and other summary stats. In our case, the new dataset now contains population means by region. All variables other than the collapsed variable (pop) and the grouping variable (region) are NOT collapsed and hence disappear from dataset. Can we make any further analyses without the rest of the variables? 14 2. Data Management

15 Calculate mean population by region
Method 5 sysuse census, clear by region, sort: egen meanpop = mean(pop) Downside: Do we really want an additional variable in the dataset that contains information on population means by region, a number that is repeated for each observation (state) within the same region? In very large datasets, one additional variable may lead to memory constraints. Use scalars? 15 2. Data Management

16 Reshaping Data sysuse bplong, clear br Suppose we want to take difference in bp before and after treatment. Difficult to calculate difference if data is organized in long format. Need to convert to wide format. reshape wide bp, i(patient sex agegrp) j(when) g bpdiff = bp2 – bp1 16 2. Data Management

17 Value Labels (1/2) g gender = sex br Why do gender and sex look different?  Value labels Why use value labels? * They save space (e.g., “0” instead of “male” for each obs.) * More informative to the researcher (e.g., what region is 3?) * Regression, lists, tables… display labels instead of values table sex, c(m bp1 m bp2) table gender, c(m bp1 m bp2) 17 2. Data Management

18 Value Labels (2/2) label value gender sex /* note that sex refers to label, not var */ br patient sex gender label value gender /* detaches sex label from gender variable */ br pat sex gend label define genderlbl 0 “man” 1 “woman” label value gender genderlbl What do the following commands do? label define genderlbl 2 “na”, add label define genderlbl 0 “Man” 1 “Woman” 2 “NA”, modify 18 2. Data Management

19 Dummy Variables (1/3) Suppose we want to create dummy vars for each of the 4 regions in census database: g dum1 = 0 replace dum1 = 1 if region == 1 What problems may these commands lead to? 2. Data Management

20 Dummy Variables (2/3) To create four dummies, we need to type those two commands four times. More importantly, the previous method generates 0s even when we have missing values. tab region, g(d) This second method tabulates the variable region, showing a list of the four regions, and correctly creates 4 separate dummies, accounting for missing values. 20 2. Data Management

21 Dummy Variables (3/3) One more command that will be useful in regressions: xi i.region, noomit This third alternative yields the same results as the tab method described in previous slide. 21 2. Data Management

22 Merging Data (1/4) sysuse census, clear keep state-popurban
sort state /* both master and using data must be sorted */ save census1, replace keep state region medage-divorce /* note region is kept in both */ sort state save census2, replace use census1, clear merge state using census2 /* remember: both files must be sorted */ table _merge /* _merge keeps track of how good merge was */ 2. Data Management

23 Merging Data (2/4) Important!!!
If non-merging variable (e.g. region) is in both files, data on master file will be kept – while data on using file will be lost. use census1, clear l state region in 1 replace region = 2 in 1 sort state merge state using census2 table _merge l state region in 1 /* region data in master file is kept */ 23 2. Data Management

24 Merging Data (3/4) Now suppose that each of the two databases contains information about only SOME (non-overlapping) of the 50 states. Do we lose information after merging the two datasets? use census2, clear drop in 3/6 sort state save, replace use census1, clear drop in 22/23 merge state using census2 table _merge 24 2. Data Management

25 Merging Data (4/4) Finally, it’s important to note that, in case a variable has value labels attached in both datasets, labels attached to variables in master dataset prevail. This may cause serious trouble, for example, when we are merging datasets from surveys taken in different years and for which the possible values in the answers may mean different things. Example 1: Change in scale (1 to 4 in 1980; 1 to 5 in 1990). Example 2: Omitted country in second survey, but all countries, sorted in alphabetical order, are assigned consecutive values. 25 2. Data Management

26 Other Data Management Issues
Use StatTransfer software to convert Excel, SAS, SPSS, … into STATA. Use compress command to make your dataset as small as possible and use less memory. Some very large datasets won’t open in STATA due to STATA’s memory limitations. In this case, it is recommended that you open a subset of the dataset, delete variables/observations that don’t interest you and try again: use varlist using filename 26 2. Data Management

27 Analyzing Data: Make a List
Dependent Variable(s) (response, outcome, criterion) Independent Variables (explanatory or predictor variables) Treatment Variable Covariates / Confounding Variables Categorical and Continuous Variables Remember: Types of variables determine the statistics we use Time period Scope and type of analysis 27 3. Analyzing Data

28 Analyzing Data: Graphs (1/2)
Draw a histogram: sysuse auto, clear histogram price Create a scatter plot: scatter price mpg Draw line of best fit (linear regression): twoway lfit price mpg Put two graphs together: twoway scatter price mpg || lfit price mpg 3. Analyzing Data

29 Analyzing Data: Graphs (2/2)
Type help graphs to: * create other graphs (pie and bar charts, box plots, etc.); * adjust graph settings (change labels, axes, colors…) An easier (although less customizable) option is to use GRAPH menu. 29 3. Analyzing Data

30 Analyzing Data: Statistical Analysis (1/2)
Correlation: quantify relationships between variables Regression: predict dependent variable from independent variable(s) Group differences t-test & ANOVA Chi-square for categorical and frequency data Significance v. effect size More Complex Models 30 3. Analyzing Data

31 Analyzing Data: Statistical Analysis (2/2)
cor var1 var2 gives the basic (Pearson) correlation between two variables. cor price mpg regress var1 var2 gives the effect of var2 on var1. reg price mpg Useful textbook for more on stats for social sciences: Agresti, Alan, and Barbra Finlay (2008): Statistical Methods for the Social Sciences, Prentice Hall, 4th edition. Textbook examples with STATA: 31 3. Analyzing Data

32 Thank you!! 32


Download ppt "Adrián de la Garza Jeremy Green 27 March 2009"

Similar presentations


Ads by Google