Download presentation
Presentation is loading. Please wait.
1
STATA Tutorial September 29, 2017
Jill Furzer Institute of Health Policy, Management, and Evaluation Canadian Centre for Health Economics September 29, 2017
2
Outline Why use STATA? Reading/Cleaning data Regression Analysis
Post-estimation Diagnostic Checks Advanced Topics in STATA STATA Resources
3
Why use STATA? Easy to work with
Interactive Menu driven Prior programming experience not required (can be helpful) Smooth learning curve
4
Software Learning Curves
Source:
5
Why use STATA? Easy to work with
Interactive Menu driven Prior programming experience not required (can be helpful) Smooth learning curve Strong data set management tool
6
Review: Data Types Cross-sectional: A collection of observations in one time period. Micro-data, surveys of persons, countries, etc. Time Series: Many points in time, but for one individual entity. Usually in aggregated form, like rates or percentages over time Panel: Cross-sectional + Time series data. Survey of the same individuals over many years STATA particularly useful for Panel Data. Wide range of features to handle the problems faced with this data.
7
Reading and Cleaning data
8
First Steps Case sensitive, so be careful: i.e.
regress y x Success! (if everything else is right) Regress y x error message Step 0: Double-click on the Stata icon Hint: commands show up as blue Hint: errors show up as red
9
Variables Window Review Window Results Window Command Window
Can be arranged based on preferences Command Window
10
Step 1: Start a Log File File Log Begin:
Stata will prompt you to name the file. Pick a creative name (e.g. logfile1), then click ok Stata will now record everything you do Importing data, running commands, store regression output, etc. Or by code: log using "/Users/jillfurzer/Documents/STATAtutoriallog.smcl” Saves as a smcl file Stata will now record everything you do (importing data, running commands, etc). It will even store your regression output, which is handy to look at later
11
Step 2: Import Data File Import Choose appropriate option:
.csv (Comma Separated) is a common option .xls (Microsoft Excel Format) and other formats are compatible too
12
Example: Importing a .xls File
Make sure “Import first row as variable names” is checked, then click ok
14
Step 3: Look at your Data Type describe to obtain some useful information about your dataset: String refers to data that has non-numeric characters (like saying male or female instead of 0/1) Will need to destring data if it is to change it to numeric so that you can run calculations on it To look at your data, type browse
15
Black text is for numeric variables
Blue text is labeled numeric variables So Gender will correspond to 0/1 but labeled as male female for descriptive purposes Red text is for character variables (called string variables in Stata)
16
Step 4: Clean Data Example: Convert Character variable to Numeric
Make use of Stata’s destring command: destring [varlist] , {generate(newvarlist)|replace} [destring_options] Eg: destring Age, replace ignore(NA) So you replace the original string data with the destringed data and are ignoring any variable that has a NA
17
Step 4: Clean Data Example: Sorting the Observations and Variables
Sorting changes the order in which the observations appear. We can sort numbers, letters, etc. - Example: sort x Ordering changes the order variables in dataset appear. - Example: order x y z Ordering especially important if you are doing panel data or any kind of multivariate logit/probit
18
Step 4: Clean Data Example: Renaming existing variables
Command: rename If you want to rename variable ‘ZGMFX10A’ as ‘height’ rename ZGMFX10A height
19
Step 4: Clean Data Command: label To label the dataset in memory:
Example: Labeling Data Command: label Gives descriptions to variables or data sets To label the dataset in memory: label data “National Population Health Survey” To label a variable: label var healthstat “Self-Reported Health Status” To label different numeric values the variable may take: label define vlhealthstat 1 Excellent 2 Very Good 3 Good 4 Fair 5 Poor label values healthstat vlhealthstat Define the value label however you like, common to just put vl infront of the value name Or label define vlgender 1 Male 0 Female If you make a mistake and want to drop a label it is the same procedure as dropping a variable but just put the work label in front to let STATA know that you are talking about the label and not the variable
20
Step 4: Clean Data Exploring Missing Values
Missing values are given by “.” in STATA To count the number of missing values in a variable, use user-written command tabmiss To install, type findit tabmiss in command window To use, type tabmiss varname Important Note: you can use “findit” to install other user written commands, as well as help files for commands in STATA Use frequency tables tab female year, m
21
Step 5: Basic Analysis Command: summarize
Example: Obtaining basic summary statistics Command: summarize Use to obtain basic summary statistics of 1 or more variables mean, standard deviation, min, max, etc. summarize [varlist] [if] [in] [weight] [, options] sum weight height Just saying summarize will list all of your variables Summarize followed by specific variable names will give you specific stats Correlate good for looking for relationships and also a good test before you start running regressions to see if there is going to be collinarity between variables (as ideally you want variables to provide different explanation power for the explanatory variable, having two highly correlated variables and skew results)
22
Step 5: Basic Analysis Example: Obtaining basic summary statistics
Command: correlate Creates a matrix of correlation or covariance coefficients for 2 or more variables correlate [varlist] [if] [in] [weight] [, correlate_options] corr height weight
23
Step 5: Basic Analysis Example: Frequency tables command: tabulate
Calculates and displays frequencies for one or two variables tabulate varname [if] [in] [weight] [, options] tab KEYSEX tab KEYSEX Age, r Can tabulate on one variable or two For two, options, using ,row will give you percentages
24
Step 5: Basic Analysis Example: More detailed descriptive statistics
Command: tabstat tabstat varlist [if] [in] [weight] [, options] This example calculates the sum of the variable Default stat in tabstat is mean (no specification) Other statistics: min, max, skewness, kurtosis...
25
Step 6: Variable Generation
Changing existing variables Command: replace changes the contents of an existing variable Most useful in the following cases: Creating binary and categorical variables Fixing the missing values Syntax: replace oldvar = exp [if exp] [in range] Ex: Replace responses coded as “no response” (-1 in this case) with missing values replace variable = . if variable == -1
26
Step 6: Variable Generation
Creating a new variable Command: generate Syntax: generate newvar = exp [if exp] [in range] Example: generate age_sq=age*age Notes: Can type generate or gen for short
27
Step 6: Variable Generation
Create a Binary Variable (0 / 1) Generate a variable equal to 0 for all observations Replace it to be 1 for selected observations Ex: Create a binary for people with income over $80,000: gen highinc=0 replace highinc=1 if hh_inc>=80000
28
Step 7: Graphs and Plots Plain Text Plot
plot yvar1 [yvar2 [yvar3]] xvar [if exp] [in range] [, columns(#) encode hlines(#) lines(#) vlines(#) ] ex: plot weight height Graphics Plot (generates an image file) [graph] twoway plot [if] [in] [, twoway_options] ex. graph twoway scatter weight height
29
Graph Examples Two-way scatter plot twoway scatter yvar xvar
Two-way line plot twoway line yvar xvar Two-way scatter plot with linear prediction from regression of x on y twoway (scatter yvar xvar) (lfit yvar xvar) Two-way scatter plot with linear prediction from regression of x on y with 95% CI twoway (scatter yvar xvar) (lfitci yvar xvar)
30
Saving data If you’ve imported data into STATA from a spreadsheet, text file, etc., you may want to save it as a STATA dataset. From STATA menu, go File Save (will give you an option to replace the data if it already exists)
31
Regression Analysis
32
Fitting a Linear Model General notation for linear regression:
regress depvar [indepvars] [if] [in] [weight] [, options] Where: Y is our dependent variable X is our independent variable(s) Determining which variables are what is usually determined by theory Research Question: Is there a relationship between weight and height?
33
Graphical Representation
Yhati – Estimated (or predicted) value of Y based on the regression coefficients Yi – Actual Value of Y ei – Residual (Difference between estimated Y and actual) B1 – Constant term B2 – Slope of line
34
Linear Model Output Follows notation (reg Y X) β2 β1
35
Post Estimation
36
Post Estimation Obtaining residuals predict residuals, residuals
NB: The “residuals” after predict is just the name you want to give to the residuals. You can change this if you want to Obtaining fitted values predict fittedvalues, xb
37
Heteroscedasticity testing
OLS assumes homoskedasticity We can test for this after running a regression Option 1: Examine residual pattern from the residual plot Plot the residuals vs. fitted values rvfplot, yline(0) Option 2: Formal test (Breusch-Pagan Test) estat hettest
38
RVF Plot
39
Breusch-Pagan Test Reject the null (no heteroskedasticity) in favour of the alternative (there is heteroskedasticity of some form).
40
Linearity testing OLS assumes a linearity
relationship between the Y and X’s is linear
42
Linearity testing OLS assumes a linearity
relationship between the Y and X’s is linear To test for this after a regression: Command: acprplot var, lowess acprplot height, lowess
43
ACPRPLOT Stata
44
Testing for multicollinearity
OLS assumes independent variables (x’s) collinear To test for this use correlation matrix Command: correlate varlist (before regression) Rule of thumb is >0.8 Drop variable/transform it (log income)
45
Testing for multicollinearity
Variance Inflation Factor vif (after regression) Rule of thumd VIF <5
46
Specification testing
Testing for omitted variables or model misspecification RESET test Syntax: estat ovtest Reject at 0.01 level: model has omitted variables
47
Testing Normality of Residuals
OLS assumes errors are normally distributed We can use the residuals to test this assumption. Command: predict r, residuals kdensity r, normal
48
Parameter Hypothesis Testing
Test whether a parameter equal zero i.e., Height has no impact on weight testparm height or test (height) Test both parameters equal zero i.e., both height and age have no impact on weight test (height weight) Test if coefficients on two variables are equal i.e., the effect of height on weight is the same as the effect of age on weight test (height= weight) Simple t-tests
49
Storing Estimation Results
To store results of regression estimates store name This is useful to analyze regression results after running multiple models, comparing models To list multiple results side-by-side estimates table name1 name2…name5, etc. To export results to excel, word, or LaTeX Use user-written command esttab:
50
Advanced Topics
51
Non-continuous outcome variables
Binary outcomes: probit or logit (help probit; help probit postestimation) Ordered discrete outcomes: oprobit (help oprobit; help oprobit postestimation) Categorical outcomes: mlogit (help mlogit; help mlogit postestimation)
52
Panel Data Econometrics
Pooled Linear Regression regress depvar [indepvars] [if] [in] [weight] [, options] reg weight height age Random Effects xtreg depvar [indepvars] [if] [in] [, re RE_options] xtreg weight height age year, re Fixed Effects xtreg depvar [indepvars] [if] [in] [weight] , fe [FE_options] xtreg weight height age year, fe Pooled linear regressions are used when you can’t specify what point in time that an observation takes place so you pool them all together as if they were cross section (no time trends) Random effects assigns parameters to the effect that the underlying common characteristics play on the dependent variable Fixed effects is a way to control for endogenity by assuming that there is a underlying relationship stemming from certain unifying characteristics (country effects, province effects, sibling effects, individual effects) doesn’t make an assumption on the underlying distribution where as random effects does (FE is thus semi non-paramentic)
53
Standard Errors Heteroskedastic Robust Standard Errors
reg weight height age year, vce(robust) Cluster Robust Standard Errors reg weight height age year, vce(cluster id) Bootstrapped Clustered Standard Errors reg weight height age year, vce(bootstrap, cluster(id))
54
Working With Do-Files Motivation Why bother?
We can ovoid tediously running the same set of commands over and over again… Creates a document listing all the commands we’ve run in plain text form Increases our productivity with STATA!
55
How to get to do file editor:
File New Do-file Or “Do-file Editor” button at top (depending on which version of STATA you have)
56
Inputs commands here Press to execute
57
Careful with backslahes, they will only be ignored if you are running the whole file versus just individual lines of code
58
STATA Resources
59
STATA Online Resources
STATA manuals are freely downloadable from the above site Typing help [topic] in the command window is also useful, but the online manuals generally contain more detail/examples
60
STATA Online Resources
UCLA Institute for Digital Research and Education List of topics and STATA resources can be found here:
61
Other STATA Resources Jones, A.M., Rice, N., d’Uva, T.B., Balia, S Applied Health Economics - Second Edition, Routledge Advanced Texts in Economics and Finance. Taylor & Francis Cameron, A.C., Trivedi, P.K Microeconometrics Using Stata – Revised Edition, Stata Press books. Allison, P.D Fixed Effects Regression Models, Quantitative Applications in the Social Sciences. SAGE Publications.
62
Useful Data Sites Ontario Data Documentation, Extraction Service and Infrastructure (ODESI) website: Computing in the Humanities and Social Sciences (CHASS) at U of T
63
Thanks for Listening Good luck with STATA!
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.