Presentation on theme: "From Workshop at NSO during 22-26 September 2008 Panel Data Analysis."— Presentation transcript:
From Workshop at NSO during September 2008 Panel Data Analysis
Outline What are panel data? Why use panel data? Handling panel data in stata Describing panel data Within and Between variation Unobservables Testing the FE and RE assumptions
What are panel data? Panel data are a form of longitudinal data, involving regularly repeated observations on the same individuals. Individuals may be people, households, firms, area,etc Repeat observations may be different time periods or units within clusters (e.g. workers within firms)
Why use panel data? Repeated observations on individuals allow for possibility of isolating effects of unobserved differences between individuals We can study dynamics The ability to make causal inference is enhanced by temporal ordering Some phenomena are inherently longitudinal (e.g. poverty persistence; unstable employment)
But dont expect too much Variation between people usually far exceeds variation over time for an individual A panel with T waves doesnt give T times the information of a cross- section Variation over time may not exist for some important variables or may be inflated by measurement error
Some terminology A balanced panel has the same number of time observations (T) on each of the n individuals An unbalanced panel has different number of time observations (T i ) on each individual A compact panel covers only consecutive time periods for each individual- there are no gaps Attrition is the process of drop-out of individuals from the panel, leading to an unbalanced and possible non-compact panel A short panel has a large number of individual but few time observations on each, (e.g. BHPS has 5,500 households and 15 waves) A long panel has a long run of time observations on each individual, permitting separate time- series analysis for each
Handling panel data in stata For our purposes, the unit of analysis or case is either the person or household: If case = person, case contains information on persons state, perhaps at different dates If case = household, case contains info on some or all household members (cross- sectional only!) The data can be organized in two ways: Wide form-data is sometimes supplied in this format Long form-usually most convenient & needed for most panel data commands in Stata
Wide file format One row per case Observations on a variable for different time periods (or dates) held in different columns Variable name identifies time (via perfix) PIDawagebwagecwage (Wage at W1)(Wage at W2)(Wage at W3) missing missing …………
Long file format Potentially multiple rows per case, with Observations on a variable for different time periods (or dates) held in extra rows for each individual Case-row identifier identifies time (e.g. PID, wave) PIDwavewage ………
Panel and time variables Use tsset to tell Stata which are panel and time variables:. tsset pid wave panel variable: pid (unbalanced) time variable: wave, 1 to 14, but with gaps Note that tsset automatically sorts the data accordingly.
Describing panel data Ways of describing/summarizing panel data: Basic patterns of available cases Between-and within-group components of variation Transition tables Some basic notation: y it is the dependent variable to be analyses i indexes the individual (pid), i = 1,2,…., n t indexes the repeated observation / time period (wave), t = 1,2…, T i
Dependent variable y it may be: Continuous (e.g. wages); Mixed discrete/continuous (e.g. hours of work); Binary (e.g. employed/not employed); Ordered discrete (e.g. Likert scale for degree of happiness); Unordered discrete (e.g. occupation)
Describe patterns of panel data xtdes. xtdes pid: , ,..., 1.497e+08 n = wave: 1, 2,..., 14 T = 14 Delta(wave) = 1; (14-1)+1 = 14 (pid*wave uniquely identifies each observation) Distribution of T_i: min 5% 25% 50% 75% 95% max Freq. Percent Cum. | Pattern | | | … | | | (other patterns) | XXXXXXXXXXXXXX
Describe the pattern of panel data. tabulate wave wave | Freq. Percent Cum | 9, | 9, | 9, | 9, | 8, | 9, | 9, | 8, | 8, | 8, | 8, | 8, | 8, | 8, Total | 124, The number of observation declines across waves. This is consistent with attrition from the panel.
Between-and within-group variation Stata command, xtsum, summarizes within and between variation. But it does not give and exact decomposition: Converts sums of squares to variance using different degrees of freedom so they are not comparable Reports square root (i.e. standard deviation) of these variances Documentation is not very clear! But useful as a good approximation. xtsum
Between-and within-group variation xtsum. xtsum paygu Variable | Mean Std. Dev. Min Max | Observations paygu overall | | N = between | | n = within | | T-bar = display r(sd_w) display r(sd) display r(sd_w)^2 / r(sd)^2 // proportion of within variation display r(sd_b)^2 / r(sd)^2 // proportion of between variation pangu (gross monthly earnings) more between people than they change over time for the same people. This is implications for panel analysis because we often rely on changes over time.
Between-and within-group variation for discrete variable xttab Example: part-time work = 30 hours or less per weeks. xttab pt Overall Between Within pt | Freq. Percent Freq. Percent Percent | | Total | (n = 11056)
Describing panel data-summary Panel data involve 2 dimensions, group (typically individual) and time. We need to examine variation along each dimension to get a feel for the data. To fully exploit panel data, we need enough within- group (cross-time) variation. Can evaluate amount of within (and between) variation in different ways: Continuous variables: between and within standard deviation (and variance) using xtsum Categorical variables: between and within variation using xttab Binary variables: simple sequence description if not too many waves.
Some basic identification problems 1.Unobservable variables Can we identify the impact of unobservable? Can we distinguish the impact of unobservables from the impact of time- invariant observables? 2.Age, cohort and time effects-can they be distinguished? Behavior may change with age Current behavior may be effected by experience in formative years Time may effect behavior through changing social environment
Identification of unobservable (1) Example : wage models based on human capital theory: y it = z i α + x it β + u i + ε it where i = 1…n, t=1…T i y it = log wage z i = observable time-invariant factors (e.g. sex, year of birth) X it = observable time-variant factors (e.g. job tenure) u i = unobservable ability (assume not to change over time) ε it = luck Can we identify the effect of u i if we cant observe it?
Identification of unobservables (2) The identification of the effect of rests on assumptions about the correlation structure of the compound residual v it v it = u i + ε it if individual have been sampled at random, there is no correlation across different individuals cov (u i, u j ) = 0 cov ([ε i1 …ε it ], [ε j1 …ε jt ]) = 0 For any two (different) sampled individuals i and j But there may be some correlation over time for any individual: cov (v is, v it ) 0 for two different period s t, since: cov (v is, v it ) = cov (u i + ε is, u i + ε it ) = var(u i )+cov (ε is, ε it ) If we assume cov (ε is, ε it ) =0 then u i is the only source of correlation over time, so its variance can be identified from the correlation of the residuals.
Pooled regression for panel data The standard panel data regression model is: y it = z i α + x it β + u i + ε it We have observations indexed by t = 1….T i = 1….n. A pooled regression of y on z and x using all the data together would assume that there is no correlation across individuals, nor across time periods for any individual This would ignore the individual effect u, which generates correlation between the values of (u i + ε i1, u i + ε it ) for each individual I So pooled regression does t make best use of the data Under favorable conditions (if u i is uncorrelated with z i and x it ), pooled regression gives unbiased but inefficient results, with incorrect standard errors, t- ratios, etc. If u i is correlated with z i and x it, pooled regression is also biased
Fixed effect or random effects? Concepts and interpretation If individuals are randomly sampled from population then u i is random. In practice, with randomly sampled data, FE/RE choice is based on whether a futher assumption holds: that ui is uncorrelated with the regressors: E(u i | z i, X i ) = 0
Testing the hypothesis of uncorrelated effects The random effects estimator (and any estimator that uses between-group variation) is only unbiased if the following hypothesis is true: It is important to test H0. There are various equivalent ways of doing so, including: Hausman test: is the difference large? Between-within comparison: is large? Mundlak approach: estimate the model by GLS and test H 0 : = 0
BHPS example: feasible GLS estimates. xtreg lwage age cohort, re Random-effects GLS regression Number of obs = Group variable (i): pid Number of groups = R-sq: within = Obs per group: min = 1 between = avg = 5.9 overall = max = 14 Random effects u_i ~ Gaussian Wald chi2(2) = corr(u_i, X) = 0 (assumed) Prob > chi2 = lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval] age | cohort | _cons | sigma_u | sigma_e | rho | (fraction of variance due to u_i)
BHPS example: within-group estimates. xtreg lwage age cohort, fe Fixed-effects (within) regression Number of obs = Group variable (i): pid Number of groups = R-sq: within = Obs per group: min = 1 between = avg = 5.9 overall = max = 14 F(1,49537) = corr(u_i, Xb) = Prob > F = lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] age | cohort | (dropped) _cons | sigma_u | sigma_e | rho | (fraction of variance due to u_i) F test that all u_i=0: F(10076, 49537) = Prob > F =
Example: BHPS Hausman test. hausman fixed random ---- Coefficients ---- | (b) (B) (b-B) sqrt(diag(V_b-V_B)) | fixed random Difference S.E age | b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B) = Prob>chi2 =
Summary of random effects model Unlike a cross-sectional model, the RE model allows for an unobserved, time- invariant individual effects. The key assumption of the RE model is that the individual effect is uncorrelated with the regressors. Can test the key zero-correlation assumption using a Hausman or Mundlak test. RE is more efficient than FE because it uses between-group variation as well as within-group variation