# CIQLE Workshop: Introduction to longitudinal data analysis with stata panel models and event history analysis Silke Aisenbrey, Yale University about:

## Presentation on theme: "CIQLE Workshop: Introduction to longitudinal data analysis with stata panel models and event history analysis Silke Aisenbrey, Yale University about:"— Presentation transcript:

CIQLE Workshop: Introduction to longitudinal data analysis with stata panel models and event history analysis Silke Aisenbrey, Yale University about: how can LD look what methods can be used it is an intro to these methods and stata not an expert seminar even though there are experts in the room for the different areas and we will use their knowledge-->hopefully an interactive seminar janet, hannah will contribute groups or ind (exp and beginners) take your time schedule today appr. till 6 tomorrow start at 10 till 5 food the data we will use are… they are examples

Goals for the workshop:
-Intro to stata -Modeling Change over time: Panel Regression Models (fixed, between and random) -Modeling whether and/or when events occur: Event History Analysis (Data management for event history data, kaplan-meier, cox, piecewise constant)

open stata: VARIABLES of open file RESULTS results and syntax REVIEW of syntax: commands or menu COMMAND

open data, with menu (stata data--> eventex.dta)

to see real data to make changes directly in data erase variables, cases, make single changes in cases -->

these a real data…

basic descriptive commands
relational and logical operators in stata: == is equal to ~= is not equal (also !=) > greater than < less than >= greater than or equal <= less than or equal & and | or ~ not (also!) Case sensitivi

basic descriptive commands
sum var tab var1 var2 tab var1 var2, col combine with: …… if var1==2 & var3>0 by var1: …………… sort ………… exercise: e.g.: tab abitur sex, col tab abitur sex if cohort==1930, col sort cohort by cohort: tab abitur sex do one example

basic commands for data management
help “command” gen var1 = var2 recode var1 (0=.) (1/8=2) (9=3) rename var1 var100 **use the following variables: cohort (indicator of cohort membership) sex (1=male, 2=female) agemaryc first marriage) exercise: e.g.: sum agemaryc recode married in groups -generate a new variable -recode new variable into groups -recode if marcens==0 we talk more about censoring later BUT because real data this is what it is tab your new variable with sex or cohort hannah later the real stuff

possible break

Intro to panel regression with stata:
-panel data -fixed effects -between effects -random effects -fixed or random?

panel data (panelex1.dta)
open panelex1

Panel data: Panel data, also called cross-sectional time series data, are data where multiple cases (people, firms, countries etc) were observed at two or more time periods. Cross-sectional data: only information about variance between subjects Panel data: two kinds of information between and within subjects --> two sources of variance csta==> more countries (pooled) the cross-sectional information reflected in the differences between subjects, and the time-series or within-subject information reflected in the changes within subjects over time. Because of these two types of information there are also two kindes of variances in Panel data: variance between subjects and variance within subjects over time. Panel data regression techniques allow you to take advantage of these different types of information. While it is possible to use ordinary multiple regression techniques on panel data, they may not be optimal. The estimates of coefficients derived from regression may be subject to omitted variable bias - a problem that arises when there is some unknown variable or variables that cannot be controlled for that affect the dependent variable. With panel data, it is possible to control for some types of omitted variables even without observing them, by observing changes in the dependent variable over time. This controls for omitted variables that differ between cases but are constant over time. It is also possible to use panel data to control for omitted variables that vary over time but are constant between cases.

Janet: Basics of panel regression models

cross sectional vs. panel analyses open panelex1
cross sectional vs. panel analyses open panelex1.dta ignore the fact that we have repeated measures: regress childrn income going through one example small positive relation conclusion: more children --> higher income

Fixed effects model Answers the question: What is the effect of x when x changes within persons over time e.g. Person A has two children at first point of time and three children at second, what effect has this change on income? Information used: fixed effects estimates using the time-series information in the data Variance analyzed: within Problems: only time variant variables It lets you use the changes in the variables over time to estimate the effects of the independent variables on your dependent variable, and is the main technique used for analysis of panel data. Fixed effects regression is the model to use when you want to control for omitted variables that differ between cases but are constant over time. like unmeasure socialization varibales - how to do this?

Fixed effects exercise: separate regression for each unit and then average it:
regress income childrn if id==1 regress income childrn if id==2

_____________________________ 2 = - 2.5
) ( + _____________________________ 2 = - 2.5 conclusion: more children --> lower income remember ignoring the time structure of our data - effect was positive instead of deviding by/n we can include dummys bec we assume that error for ind is time constant and we modell this error term by including the dummy Inclusion of dummy variable makes it obvious why we can’t include time invariant variables in the fixed effects model leave one iddummy out! exercise: generate dummy variable for person and regress with dummy variable tab id, g(iddum) reg income childrn iddum1 iddum2

Fixed effects -define data set as panel data
Fixed effects -define data set as panel data tsset id t -regression with fixed effects command xtreg income chldrn, fe explain tsset go over time then next id introduce: trigger in the next step we concentrate on the var between

Between effects model Answers the question: What is the effect of x when x is different (changes) between persons: Person A has “on the average” three children and Person B has “on the average” five children, what effect has this difference on their income? In the between effects model we model the mean response, where the means are calculated for each of the units. Information used: cross-sectional information (between subjects) Variance analyzed: between variance Time variant and time invariant variables The between effects estimator is mostly important because it is used to produce the random effects estimator. Regression with between effects is the model to use when you want to control for omitted variables that change over time but are constant between cases. It allows you to use the variation between cases to estimate the effect of the omitted independent variables on your dependent variable.

Between effects conclusion: more children --> more income
average ---> regress income childrn nur on slide Running xtreg with between effects is equivalent to taking the mean of each variable for each case across time and running a regression on the collapsed dataset of means. As this results in loss of information, between effects are not used much in practice. Researchers who want to look at time effects without considering panel effects generally will use a set of time dummy variables, which is the same as running time fixed effects. conclusion: more children --> more income define data as panel data xtreg dependent independent, be

Random effects model: Assumption: no difference between the two answers to the questions: 1) what is the effect of x when x changes within the person: Person A has two children at first point of time and three children at second, what effect does this change have on their income? 2) what is the effect of x when x is different (changes) between persons: Person A has two children and Person B has three children children, what effect does this difference have on their income? Information used: panel and cross-sectional (between and within subjects) Variance analyzed: between variance and within variance Time variant and time invariant variables Assumes that coefficients are constant across cases and the unit specific error term is zero all omitted variables are uncorrelated with the independent variable

Random effects model: -matrix-weighted average of the fixed and the between estimates. -assumes b1 has the same effect in the cross section as in the time-series -requires that individual error terms treated as random variables and follow the normal distribution. use: xtreg dependent independent if var==x, re Note that this output gives estimates of where sigma_u refers to the intercepts. rho refers to the proportion of the total variance that is due to the unit specific intercepts. Stata also provides a number of measures of R2 – Overall R2 is simply the standard R2 from regressing Y on x – Between R2 is the R2 from regression of the means of Y on the means of x (the between estimator) – Within R2 is similar and amounts to the R2 from the prediction equation: The biggest problem with the RE model is, again, the requirement that there is no correlation between the αis and x. If there are some unmeasured factors that go into αis and they are correlated with the xs then the estimates of those slopes will be biased.

explain klammern

possible break

open data: panelex2.dta varlist:
explain variables hhinc relative measure, can be negative, edu split for all analyses because household level, split sex!!! dependent hhincome

xtdes xtsum tell stata the structure of the data: tsset X Y X= caseid
Y=time/wave summary statistics: xtdes xtsum Freq. Percent Cum. Pattern (other patterns) XXXXXX

use the effects xtreg dependent independent if sex==1, fe xtreg dependent independent if sex==1, be xtreg dependent independent if sex==1, re exercise: compare/discuss models e.g.: xtreg indvar1 indvar2 … if sex==1, fe try to include time invariant variables try to make theoretical/empirical argument why you use which model becareful with the timeinvariant model

Problems/Tests/Solutions:
What’s the right model: fixed or random effects? Test: Hausman Test Null hypothesis: Coefficients estimated by the efficient random effects estimator are same as those estimated by the consistent fixed effects estimator. If same (insignificant P-value, Prob>chi2 larger than .05) --> safe to use random effects. If significant P-value --> use fixed effects. xtreg y x1 x2 x3 ... , fe estimates store fixed xtreg y x1 x2 x3 ... , re estimates store random hausman fixed random The hausman test tests the null hypothesis that the coefficients estimated by the efficient random effects estimator are the same as the ones estimated by the consistent fixed effects estimator. If they are (insignificant P-value, Prob>chi2 larger than .05) then it is safe to use random effects. If you get a significant P-value, however, you should use fixed effects. The generally accepted way of choosing between fixed and random effects is running a Hausman test. Statistically, fixed effects are always a reasonable thing to do with panel data (they always give consistent results) but they may not be the most efficient model to run. Random effects will give you better P-values as they are a more efficient estimator, so you should run random effects if it is statistcally justifiable to do so.The Hausman test checks a more efficient model against a less efficient but consistent model to make sure that the more efficient model also gives consistent results.To run a Hausman test comparing fixed with random effects in Stata, you need to first estimate the fixed effects model, save the coefficients so that you can compare them with the results of the next model, estimate the random effects model, and then do the comparison. hausman performs Hausman's specification test. To use hausman, one has to perform the following steps. (1) obtain an estimator that is consistent whether or not the hypothesis is true; (2) store the estimation results under a name-consistent using estimates store; (3) obtain an estimator that is efficient (and consistent) under the hypothesis that you are testing, but inconsistent otherwise; (4) store the estimation results under a name-efficient using estimates store; (5) use hausman to perform the test hausman name-consistent name-efficient [, options] The order of computing the two estimators may be reversed. You have to be careful though to specify to hausman the models in the order "always consistent" first and "efficient under H0" second. It is possible to skip storing the second model and refer to the last estimation results by a period (.). hausman may be used in any context. The order in which you specify the regressors in each model does not matter, but it is your responsibility to assure that the estimators and models are comparable, and satisfy the theoretical conditions (see (1) and (3) above).

The generally accepted way of choosing between fixed and random effects is running a Hausman test. Statistically, fixed effects are always a reasonable thing to do with panel data (they always give consistent results) but they may not be the most efficient model to run. Random effects will give you better P-values as they are a more efficient estimator, so you should run random effects if it is statistcally justifiable to do so.The Hausman test checks a more efficient model against a less efficient but consistent model to make sure that the more efficient model also gives consistent results.To run a Hausman test comparing fixed with random effects in Stata, you need to first estimate the fixed effects model, save the coefficients so that you can compare them with the results of the next model, estimate the random effects model, and then do the comparison. hausman performs Hausman's specification test. To use hausman, one has to perform the following steps. (1) obtain an estimator that is consistent whether or not the hypothesis is true; (2) store the estimation results under a name-consistent using estimates store; (3) obtain an estimator that is efficient (and consistent) under the hypothesis that you are testing, but inconsistent otherwise; (4) store the estimation results under a name-efficient using estimates store; (5) use hausman to perform the test hausman name-consistent name-efficient [, options] The order of computing the two estimators may be reversed. You have to be careful though to specify to hausman the models in the order "always consistent" first and "efficient under H0" second. It is possible to skip storing the second model and refer to the last estimation results by a period (.). hausman may be used in any context. The order in which you specify the regressors in each model does not matter, but it is your responsibility to assure that the estimators and models are comparable, and satisfy the theoretical conditions (see (1) and (3) above).

Problems/Tests/Solutions: Autocorrelation? What is autocorrelation:
Last time period’s values affect current values test: xtserial Install user-written program, type findit xtserial or net search xtserial xtserial depvar indepvars Wooldridge (2002, ) derives a simple test for autocorrelation in panel-data models. There is a user-written program, called xtserial, written by David Drukker to perform this test in Stata. xtserial implements a test for serial correlation in the idiosyncratic errors of a linear panel-data model discussed by Wooldridge (2002). Drukker (2003) presents simulation evidence that this test has good size and power properties in reasonable sample sizes. Under the null of no serial the residuals from the regression of the first-differenced variables should have an autocorrelation of This implies that the coefficient on the lagged residuals in a regression of the lagged residuals on the current residuals should be xtserial performs a Wald test of this hypothesis. See Drukker (2003) and Wooldridge (2002) for further details.

Significant test statistic indicates presence of serial correlation.
Solution: use model correcting for autocorrelation xtregar instead of xtreg xtserial implements a test for serial correlation in the idiosyncratic errors of a linear panel-data model discussed by Wooldridge (2002). Drukker (2003) presents simulation evidence that this test has good size and power properties in reasonable sample sizes. Under the null of no serial the residuals from the regression of the first-differenced variables should have an autocorrelation of This implies that the coefficient on the lagged residuals in a regression of the lagged residuals on the current residuals should be xtserial performs a Wald test of this hypothesis. See Drukker (2003) and Wooldridge (2002) for further details.

possible break

different data structure
panel -waves -number of wave1 / 2/ 3/ 4 wave1 / 2/ 3/ 4 wave1 / 2/ 3/ 4 regression models: dependent variable continuous event -dates of events -birth of first 1963 -birth of second 1966… -start of first -start of -start of second time information in event data more precise: dependent variable event happens 0/1

Different Faces of Event History Data
Time continuous discrete

Types of censoring Subject does not experience event of interest
Incomplete follow-up Lost to follow-up Withdraws from study Left or right censored X = dies key assumption for censoring: drop out is not dead and is therefore not informativ!

what does this mean in terms of schooling, marriage first job?

open data eventex.dta

tell stata that our data is “survival data” stset
stset X, failure(Y) id(Z) X= time at which event happens or right censored, this is always needed Y= 0 or missing means censored, all other values are interpreted as representing an event taking place/ failure Z= id three examples: stset ageendsch event: end of school time: end of school stset agemaryc, failure (marcens) id (caseid) event: marriage stset agestjob, failure (stjob) id (caseid) event: first job end of schooling happens to everybody, marriage and first job not! best would be three groups for three models

DATA MANGAGEMENT HANNAH

Different Models of Event History
Time continous discrete non-parametric semi-parametric parametric -kaplan-meier -nelson-aalen -log-rank test for comparison b/w groups -cox -piecewise constant -exponential -weibull -log-logistic -lognormal -gompertz -generalized gamma -logistic -log-log only qualitative covariates inclusion of covariates in models -compare survival experiences between groups (sex, cohorts) -univariate -multivariate Extended from Jenkins 2005

survivor function and hazard function
Survivor function, S(t) defines the probability of surviving longer than time t Survivor and hazard functions can be converted into each other Hazard (instantaneous hazard, force of mortality), is the risk that an event will occur during a time interval (Δ(t)) at time t, given that the subject did not experience the event before that time survivor: this is what the Kaplan-Meier curves show hazard: this is what Nelson & Aalen curves show an incident rate, with higher values indicates more events per time

non-parametric: kaplan-meier
List the Kaplan-Meier survivor function sts list sts list, by(sex) compare Graph the Kaplan-Meier survivor function sts graph sts graph, by(sex)

non-parametric: kaplan-meier
exercise: stset your data for marriage, endschool or first job e.g.: 1) sts list 2) sts graph 3) sts list, by (…) compare 4) sts graph, by (..)

non-parametric: Nelson-Aalen
List the Nelson-Aalen cumulative hazard function sts list, na sts list, na by(sex) compare Graph the Nelson-Aalen cumulative hazard function sts graph, na sts graph, na by(sex)

non-parametric: Nelson-Aalen
exercise: stset your data for marriage, endschool or first job 1) sts list, na 2) sts graph, na 3) sts list, na by (…) compare 4) sts graph, na by (..) are the diff significant????

Comparing Kaplan-Meier curves
non-parametric: kaplan-meier Comparing Kaplan-Meier curves Log-rank test can be used to compare survival curves Hypothesis test (test of significance) H0: the curves are statistically the same H1: the curves are statistically different Compares observed to expected cell counts events expected: these are the events expected would there be NO difference between the groups, in this case theobserved and expected values are different enough to produce a sign chi-suared value. (cl108) Less-commonly used test: Wilcoxon, which places greater weights on events near time 0. for

Comparing Kaplan-Meier curves
non-parametric: kaplan-meier Comparing Kaplan-Meier curves exercise: Test equality of survivor functions e.g.: sts test abitur What happens when you have several covariates that you believe contribute to survival? Can use stratified K-M curves – for 2 or maybe 3 covariates Need another approach – multivariate Cox proportional hazards model is most common -- for many covariates

Limit of Kaplan-Meier curves
non-parametric: kaplan-meier Limit of Kaplan-Meier curves What happens when you have several covariates that you believe contribute to survival? Example Education, marital status, children, gender contribute to job change Can use K-M curves – for 2 or maybe 3 covariates Need another approach – multivariate Cox proportional hazards model is most common -- for many covariates (think multivariate regression or logistic regression rather than a Student’s t-test or the odds ratio from a 2 x 2 table)

Cox proportional hazards model
semi-parametric models: cox Cox proportional hazards model Can handle both continuous and categorical predictor variables Without knowing baseline hazard ho(t), can still calculate coefficients for each covariate, and therefore hazard ratio Assumes multiplicative risk - -->proportional hazard assumption

semi-parametric models: cox
example age of first marriage stcox sex Interpretation: because the cox model does not estimate a baseline, there is no intercept in the output. sex (male=1) (female=2) whatever the hazard rate at a particular time is for men, it is 1.5 times higher for women what does this mean in our case? women get married younger than men do. limitations: Does not accommodate variables that change over time Luckily most variables (e.g. gender, ethnicity, or congenital condition) are constant If necessary, one can program time-dependent variables When might you want this? Baseline hazard function, ho(t), is never specified You can estimate ho(t) accurately if you need to estimate S(t).

Interpretation of the regression coefficients
semi-parametric models: cox Interpretation of the regression coefficients An estimated hazard rate ratio greater than 1 indicates the covariate is associated with an increased hazard of experiencing the event of interest An estimated hazard rate ratio less than 1 indicates the covariate is associated with a decreased hazard of experiencing the event of interest Estimated hazard rate ratio of 1 indicates no association between covariate and hazard.

Graphically: estimates for functions:
stcox sex, basehc (H0) stcurve, hazard at1(sex=0) at2(sex=1) stcox sex, basesurv (S0) stcurve, surviv at1(sex=0) at2(sex=1) don’t forget to drop H0/S0 before redoing the command

exercise: make your own cox model and estimate the hazard and survival

Proportional assumption: covariates are independent with respect to time and their hazards are constant over time Three general ways to examine model adequacy Graphically: Do survival curves intersect? Mathematically: Schoenfeld test Computationally: Time-dependent variables (extended model) observed vs expected plots stcoxkm???

stcoxkm, by (sex) compare with kaplan maier:
exercise: do this with one of your estimates works only with one var stcoxkm plots Kaplan-Meier observed survival curves and compares them to the Cox predicted curves for the same variable. The closer the observed values are to the predicted, the less likely it is that the proportional-hazards assumption has been violated.

stphplot, by (sex) "log-log" plots
exercise: do this with one of your estimates, stphplot can be adjusted --> look in stphplot help stphplot plots -ln{-ln(survival)} curves for each category of a nominal or ordinal covariate versus ln(analysis time). These are often referred to as "log-log" plots. Optionally, these estimates can be adjusted for covariates. The proportional-hazards assumption is not violated when the curves are parallel.

Mathematically: Schoenfeld Test
tests if the log hazard function is constant over time, thus a rejection of the null hypothesis indicates a deviation from the proportional hazard assumption stcox sex, schoenfeld(sch*) scaledsch(sca*) estat phtest (if more var estat phtest, detail) exercise: do this with your model, try to find a model which fits stcox age quadratic check cleves 141

Summary Survival analyses quantifies time to a single, dichotomous event Handles censored data well Survival and hazard can be mathematically converted to each other Kaplan-Meier survival curves can be compared graphically Cox proportional hazards models help distinguish individual contributions of covariates to survival, provided certain assumptions are met.

It can get a lot more complicated than this
The proportional hazards model as shown only works when the time to event data is relatively simple Complications non proportional hazard rates time dependent covariates competing risks multiple failures non-absorbing events etc. Extensive literature for these situations and software is available to handle them. we just look shortly into one piecewise

Semi-parametric models: Piecewise constant
-transition rate assumed to be not constant over observed time -splits data in user defined time pieces, -transition rates constant in each “time piece” -but: transition rates change between time pieces In Stata, the command “stpiece” automates the definition of time pieces. In our models. In addition, the option “tv” specifies variables whose effects are thinking to be non proportional and may vary between time pieces.

Semi-parametric models: piecewise constant
in STATA a user written command, an “ado file” by J. Sorensen: stpiece net search stpiece install file stpiece abitur, tp( ) tv(sex) tp: time pieces, intervals tv: covariates whose influence might vary over time pieces b/r: in most applications of transition rate models, the assumption that the forces of change are constant over time is not theoretically justified. most models deal with it by including time dependent covariates. BUT sometimes you can’t include the variables, because you didn’t measure them or else. ==> split timeaxis into time periods and assume that transition rates are constant in each of these intervals but can change between hannah checken install in cl?

the end

Similar presentations