Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intermediate STATA Adrián de la GarzaJeremy Green 27 March 2009 5/4/20151.

Similar presentations


Presentation on theme: "Intermediate STATA Adrián de la GarzaJeremy Green 27 March 2009 5/4/20151."— Presentation transcript:

1 Intermediate STATA Adrián de la GarzaJeremy Green 27 March /4/20151

2 2 Getting Help STATA Help: STATA Help: Just type help in STATA main Command window. STATA listserv: STATA listserv: UCLA Stat Computing: UCLA Stat Computing: Yale StatLab Consultants, online help and FAQs: Yale StatLab Consultants, online help and FAQs: Manuals also available at SSL and Yale StatLab. 0. Introduction

3 3 Today’s Workshop 1. Programming/Project Management Tips 2. Data Management 3. Analyzing Data - Graphs - Statistical Analysis Latest version: STATA v. 10: Commands throughout this presentation will always refer to this version, although most are backwards-compatible. 0. Introduction

4 4 Using DO files (1/2) DO files allow you to run a whole program interactively; you can run it all at once or select portions of the program. DO files allow you to run a whole program interactively; you can run it all at once or select portions of the program. AVOID making changes to your original data interactively using the STATA command window. Use DO files instead. AVOID making changes to your original data interactively using the STATA command window. Use DO files instead. Use DO files to make changes to your data and to run your statistical and graphical analyses. Keep track of your progress. Use DO files to make changes to your data and to run your statistical and graphical analyses. Keep track of your progress. 1. Programming/Project Management Tips

5 5 Using DO files (2/2) Keep your DO files organized: Helps to create a main DO file from which you run other DO files that perform smaller tasks on your data. Keep your DO files organized: Helps to create a main DO file from which you run other DO files that perform smaller tasks on your data. Write lots of comments in your DO file to help you remember what a command or a section of your DO file does. This will help you remember what you did months ago. Write lots of comments in your DO file to help you remember what a command or a section of your DO file does. This will help you remember what you did months ago. To open DO file, use FILE menu or DO-file button. To open DO file, use FILE menu or DO-file button. 1. Programming/Project Management Tips

6 6 Log files Syntax Open log file log using filename [, append replace [text|smcl] name(logname)] Close log, temporarily suspend logging, or resume logging log {close|off|on} [logname] Examples. log using mylog. log close. log using mylog, append. log close. log using "filename containing spaces" 1. Programming/Project Management Tips

7 Managing Your Data Back up all Master Data Files CD, USB drive, network Keep a detailed codebook Describes each variable and values Adding variables, cases, computing new variables Keep a roadmap Keep a log of all analyses with what you have done Save syntax files 2. Data Management 7

8 8 Inspecting Your Data (1/3) cd “C:\Documents and Settings\Adrian\My Documents\stata files” clear set mem 80m log using “C:\Documents and Settings\Adrian\My Documents\stata files\logs\mylog” sysuse census browse list state region pop if _n <= 3/* shows first 3 obs */ l state region pop if _N - _n <= 2/* shows last 3 obs */ l state region pop in 1/3/* shows first 3 obs */ l state region pop in -3/l/* shows last 3 obs */ 2. Data Management

9 9 Inspecting Your Data (2/3) generate agesq = medage^2 /* creates variable equal to medage squared */ sum pop /* shows summary stats for pop */ scalar popmean = r(mean) /* saves mean of pop to scalar popmean */ /* create variable equal to 1 when pop > popmean and 0 otherwise */ g dummy = 0 replace dummy = 1 if pop > popmean /* how many states have population higher than average? */ count if dummy == 1 /* how many states NOT IN THE SOUTH have pop > popmean? */ count if dummy == 1 & region != 3 2. Data Management

10 10 Inspecting Your Data (3/3) describe label list/* shows all labels attached to dataset */ label list cenreg/* shows label cenreg attached to variable region */ sum pop browse /* summarize population by region */ sum pop if region == “NE” /* this gives an error since region is not a string */ sum pop if region == 1 /* this does work */ 2. Data Management

11 11 Calculate mean population by region Method 1 sum pop if region == 1 sum pop if region == 2 sum pop if region == 3 sum pop if region == 4 Downside: We have to type the sum command for each individual region. If the dataset contained population data by city and we had to compute means for each of the 50 states, typing the sum command 50 times would be very painful!!! 2. Data Management

12 12 Calculate mean population by region Method 2 bysort region: sum pop Downside: This method shows the population means by region, like we wanted, but it also shows a bunch of other stats we may not care about. Also, the means are stored in memory but are not readily available for use in case we want to use those means for further calculations. 2. Data Management

13 13 Calculate mean population by region Method 3 table region, c(m pop) Downside: This method is great for presentation purposes: it shows exactly the information we want. One problem, however, is that the information is still not readily available for use in case we want to store the population means by region for further analyses. 2. Data Management

14 14 Calculate mean population by region Method 4 sysuse census, clear collapse (mean) pop, by(region) Downside: The collapse command converts the dataset in memory into a set of means, standard deviations, and other summary stats. In our case, the new dataset now contains population means by region. All variables other than the collapsed variable (pop) and the grouping variable (region) are NOT collapsed and hence disappear from dataset. Can we make any further analyses without the rest of the variables? 2. Data Management

15 15 Calculate mean population by region Method 5 sysuse census, clear by region, sort: egen meanpop = mean(pop) Downside: Do we really want an additional variable in the dataset that contains information on population means by region, a number that is repeated for each observation (state) within the same region? In very large datasets, one additional variable may lead to memory constraints. Use scalars? 2. Data Management

16 16 Reshaping Data sysuse bplong, clear br Suppose we want to take difference in bp before and after treatment. Difficult to calculate difference if data is organized in long format. Need to convert to wide format. reshape wide bp, i(patient sex agegrp) j(when) br g bpdiff = bp2 – bp1 2. Data Management

17 17 Value Labels (1/2) g gender = sex br Why do gender and sex look different?  Value labels Why use value labels? * They save space (e.g., “0” instead of “male” for each obs.) * More informative to the researcher (e.g., what region is 3?) * Regression, lists, tables… display labels instead of values table sex, c(m bp1 m bp2) table gender, c(m bp1 m bp2) 2. Data Management

18 18 Value Labels (2/2) label value gender sex/* note that sex refers to label, not var */ br patient sex gender label value gender/* detaches sex label from gender variable */ br pat sex gend label define genderlbl 0 “man” 1 “woman” label value gender genderlbl br pat sex gend What do the following commands do? label define genderlbl 2 “na”, add label define genderlbl 0 “Man” 1 “Woman” 2 “NA”, modify 2. Data Management

19 19 Dummy Variables (1/3) Suppose we want to create dummy vars for each of the 4 regions in census database: g dum1 = 0 replace dum1 = 1 if region == 1 … What problems may these commands lead to? 2. Data Management

20 20 Dummy Variables (2/3) To create four dummies, we need to type those two commands four times. To create four dummies, we need to type those two commands four times. More importantly, the previous method generates 0s even when we have missing values. More importantly, the previous method generates 0s even when we have missing values. tab region, g(d) This second method tabulates the variable region, showing a list of the four regions, and correctly creates 4 separate dummies, accounting for missing values. 2. Data Management

21 21 Dummy Variables (3/3) One more command that will be useful in regressions: xi i.region, noomit This third alternative yields the same results as the tab method described in previous slide. 2. Data Management

22 22 Merging Data (1/4) sysuse census, clear keep state-popurban sort state/* both master and using data must be sorted */ save census1, replace sysuse census, clear keep state region medage-divorce /* note region is kept in both */ sort state save census2, replace use census1, clear merge state using census2/* remember: both files must be sorted */ table _merge/* _merge keeps track of how good merge was */ 2. Data Management

23 23 Merging Data (2/4) Important!!! If non-merging variable (e.g. region) is in both files, data on master file will be kept – while data on using file will be lost. use census1, clear l state region in 1 replace region = 2 in 1 sort state merge state using census2 table _merge l state region in 1/* region data in master file is kept */ 2. Data Management

24 24 Merging Data (3/4) Now suppose that each of the two databases contains information about only SOME (non-overlapping) of the 50 states. Do we lose information after merging the two datasets? use census2, clear drop in 3/6 sort state save, replace use census1, clear drop in 22/23 sort state merge state using census2 table _merge 2. Data Management

25 25 Merging Data (4/4) Finally, it’s important to note that, in case a variable has value labels attached in both datasets, labels attached to variables in master dataset prevail. This may cause serious trouble, for example, when we are merging datasets from surveys taken in different years and for which the possible values in the answers may mean different things. Example 1: Change in scale (1 to 4 in 1980; 1 to 5 in 1990). Example 2: Omitted country in second survey, but all countries, sorted in alphabetical order, are assigned consecutive values. 2. Data Management

26 26 Other Data Management Issues  Use StatTransfer software to convert Excel, SAS, SPSS, … into STATA.  Use compress command to make your dataset as small as possible and use less memory.  Some very large datasets won’t open in STATA due to STATA’s memory limitations. In this case, it is recommended that you open a subset of the dataset, delete variables/observations that don’t interest you and try again: use varlist using filename 2. Data Management

27 Analyzing Data: Make a List Dependent Variable(s) (response, outcome, criterion) Independent Variables (explanatory or predictor variables) Treatment Variable Covariates / Confounding Variables Categorical and Continuous Variables Remember: Types of variables determine the statistics we use Time period Scope and type of analysis 3. Analyzing Data 27

28 28 Analyzing Data: Graphs (1/2) Draw a histogram: Draw a histogram: sysuse auto, clear histogram price Create a scatter plot: Create a scatter plot: scatter price mpg Draw line of best fit (linear regression): Draw line of best fit (linear regression): twoway lfit price mpg Put two graphs together: Put two graphs together: twoway scatter price mpg || lfit price mpg 3. Analyzing Data

29 29 Analyzing Data: Graphs (2/2) Type help graphs to: Type help graphs to: * create other graphs (pie and bar charts, box plots, etc.); * adjust graph settings (change labels, axes, colors…) An easier (although less customizable) option is to use GRAPH menu. An easier (although less customizable) option is to use GRAPH menu. 3. Analyzing Data

30 Analyzing Data: Statistical Analysis (1/2) Correlation: quantify relationships between variables Regression: predict dependent variable from independent variable(s) Group differences t-test & ANOVA Chi-square for categorical and frequency data Significance v. effect size More Complex Models 3. Analyzing Data 30

31 31 Analyzing Data: Statistical Analysis (2/2) cor var1 var2 gives the basic (Pearson) correlation between two variables. cor var1 var2 gives the basic (Pearson) correlation between two variables. cor price mpg regress var1 var2 gives the effect of var2 on var1. regress var1 var2 gives the effect of var2 on var1. reg price mpg Useful textbook for more on stats for social sciences: Agresti, Alan, and Barbra Finlay (2008): Statistical Methods for the Social Sciences, Prentice Hall, 4 th edition. Textbook examples with STATA: 3. Analyzing Data

32 Thank you!! 32


Download ppt "Intermediate STATA Adrián de la GarzaJeremy Green 27 March 2009 5/4/20151."

Similar presentations


Ads by Google