# From Workshop at NSO during 22-26 September 2008 Panel Data Analysis.

## Presentation on theme: "From Workshop at NSO during 22-26 September 2008 Panel Data Analysis."— Presentation transcript:

From Workshop at NSO during 22-26 September 2008 Panel Data Analysis

Outline What are panel data? Why use panel data? Handling panel data in stata Describing panel data Within and Between variation Unobservables Testing the FE and RE assumptions

What are panel data? Panel data are a form of longitudinal data, involving regularly repeated observations on the same individuals. Individuals may be people, households, firms, area,etc Repeat observations may be different time periods or units within clusters (e.g. workers within firms)

Why use panel data? Repeated observations on individuals allow for possibility of isolating effects of unobserved differences between individuals We can study dynamics The ability to make causal inference is enhanced by temporal ordering Some phenomena are inherently longitudinal (e.g. poverty persistence; unstable employment)

But dont expect too much Variation between people usually far exceeds variation over time for an individual A panel with T waves doesnt give T times the information of a cross- section Variation over time may not exist for some important variables or may be inflated by measurement error

Some terminology A balanced panel has the same number of time observations (T) on each of the n individuals An unbalanced panel has different number of time observations (T i ) on each individual A compact panel covers only consecutive time periods for each individual- there are no gaps Attrition is the process of drop-out of individuals from the panel, leading to an unbalanced and possible non-compact panel A short panel has a large number of individual but few time observations on each, (e.g. BHPS has 5,500 households and 15 waves) A long panel has a long run of time observations on each individual, permitting separate time- series analysis for each

Handling panel data in stata For our purposes, the unit of analysis or case is either the person or household: If case = person, case contains information on persons state, perhaps at different dates If case = household, case contains info on some or all household members (cross- sectional only!) The data can be organized in two ways: Wide form-data is sometimes supplied in this format Long form-usually most convenient & needed for most panel data commands in Stata

Wide file format One row per case Observations on a variable for different time periods (or dates) held in different columns Variable name identifies time (via perfix) PIDawagebwagecwage (Wage at W1)(Wage at W2)(Wage at W3) 100017.27.57.7 100026.3missing6.3 100035.4 missing …………

Long file format Potentially multiple rows per case, with Observations on a variable for different time periods (or dates) held in extra rows for each individual Case-row identifier identifies time (e.g. PID, wave) PIDwavewage 1000117.2 1000127.5 1000137.7 1000216.3 1000236.3 1000315.4 1000325.4 ………

Panel and time variables Use tsset to tell Stata which are panel and time variables:. tsset pid wave panel variable: pid (unbalanced) time variable: wave, 1 to 14, but with gaps Note that tsset automatically sorts the data accordingly.

Describing panel data Ways of describing/summarizing panel data: Basic patterns of available cases Between-and within-group components of variation Transition tables Some basic notation: y it is the dependent variable to be analyses i indexes the individual (pid), i = 1,2,…., n t indexes the repeated observation / time period (wave), t = 1,2…, T i

Dependent variable y it may be: Continuous (e.g. wages); Mixed discrete/continuous (e.g. hours of work); Binary (e.g. employed/not employed); Ordered discrete (e.g. Likert scale for degree of happiness); Unordered discrete (e.g. occupation)

Describe patterns of panel data xtdes. xtdes pid: 10002251, 10004491,..., 1.497e+08 n = 16442 wave: 1, 2,..., 14 T = 14 Delta(wave) = 1; (14-1)+1 = 14 (pid*wave uniquely identifies each observation) Distribution of T_i: min 5% 25% 50% 75% 95% max 1 1 2 7 14 14 14 Freq. Percent Cum. | Pattern ---------------------------+---------------- 4410 26.82 26.82 | 11111111111111 995 6.05 32.87 | 1............. 646 3.93 36.80 | 11..................... …............. 35 0.21 84.69 |.........111.. 33 0.20 84.89 | 1.1........... 2485 15.11 100.00 | (other patterns) ---------------------------+---------------- 16442 100.00 | XXXXXXXXXXXXXX

Describe the pattern of panel data. tabulate wave wave | Freq. Percent Cum. ------------+----------------------------------- 1 | 9,912 7.97 7.97 2 | 9,459 7.61 15.58 3 | 9,024 7.26 22.84 4 | 9,060 7.29 30.13 5 | 8,827 7.10 37.23 6 | 9,137 7.35 44.58 7 | 9,118 7.33 51.91 8 | 8,940 7.19 59.11 9 | 8,820 7.09 66.20 10 | 8,701 7.00 73.20 11 | 8,590 6.91 80.11 12 | 8,383 6.74 86.85 13 | 8,264 6.65 93.50 14 | 8,080 6.50 100.00 ------------+----------------------------------- Total | 124,315 100.00 The number of observation declines across waves. This is consistent with attrition from the panel.

Between-and within-group variation Stata command, xtsum, summarizes within and between variation. But it does not give and exact decomposition: Converts sums of squares to variance using different degrees of freedom so they are not comparable Reports square root (i.e. standard deviation) of these variances Documentation is not very clear! But useful as a good approximation. xtsum

Between-and within-group variation xtsum. xtsum paygu Variable | Mean Std. Dev. Min Max | Observations -----------------+--------------------------------------------+---------------- paygu overall | 1224.762 1054.031.0833333 72055.43 | N = 67666 between | 812.5707 8.666667 11323 | n = 11149 within | 640.9227 -7782.167 64965.64 | T-bar = 6.06924. display r(sd_w) 640.92268. display r(sd) 1054.031. display r(sd_w)^2 / r(sd)^2 // proportion of within variation.36974691. display r(sd_b)^2 / r(sd)^2 // proportion of between variation.59431354 pangu (gross monthly earnings) more between people than they change over time for the same people. This is implications for panel analysis because we often rely on changes over time.

Between-and within-group variation for discrete variable xttab Example: part-time work = 30 hours or less per weeks. xttab pt Overall Between Within pt | Freq. Percent Freq. Percent Percent ----------+----------------------------------------------------- 0 | 48119 72.55 8820 79.78 83.77 1 | 18204 27.45 5027 45.47 57.14 ----------+----------------------------------------------------- Total | 66323 100.00 13847 125.24 74.10 (n = 11056)

Describing panel data-summary Panel data involve 2 dimensions, group (typically individual) and time. We need to examine variation along each dimension to get a feel for the data. To fully exploit panel data, we need enough within- group (cross-time) variation. Can evaluate amount of within (and between) variation in different ways: Continuous variables: between and within standard deviation (and variance) using xtsum Categorical variables: between and within variation using xttab Binary variables: simple sequence description if not too many waves.

Some basic identification problems 1.Unobservable variables Can we identify the impact of unobservable? Can we distinguish the impact of unobservables from the impact of time- invariant observables? 2.Age, cohort and time effects-can they be distinguished? Behavior may change with age Current behavior may be effected by experience in formative years Time may effect behavior through changing social environment

Identification of unobservable (1) Example : wage models based on human capital theory: y it = z i α + x it β + u i + ε it where i = 1…n, t=1…T i y it = log wage z i = observable time-invariant factors (e.g. sex, year of birth) X it = observable time-variant factors (e.g. job tenure) u i = unobservable ability (assume not to change over time) ε it = luck Can we identify the effect of u i if we cant observe it?

Identification of unobservables (2) The identification of the effect of rests on assumptions about the correlation structure of the compound residual v it v it = u i + ε it if individual have been sampled at random, there is no correlation across different individuals cov (u i, u j ) = 0 cov ([ε i1 …ε it ], [ε j1 …ε jt ]) = 0 For any two (different) sampled individuals i and j But there may be some correlation over time for any individual: cov (v is, v it ) 0 for two different period s t, since: cov (v is, v it ) = cov (u i + ε is, u i + ε it ) = var(u i )+cov (ε is, ε it ) If we assume cov (ε is, ε it ) =0 then u i is the only source of correlation over time, so its variance can be identified from the correlation of the residuals.

Pooled regression for panel data The standard panel data regression model is: y it = z i α + x it β + u i + ε it We have observations indexed by t = 1….T i = 1….n. A pooled regression of y on z and x using all the data together would assume that there is no correlation across individuals, nor across time periods for any individual This would ignore the individual effect u, which generates correlation between the values of (u i + ε i1, u i + ε it ) for each individual I So pooled regression does t make best use of the data Under favorable conditions (if u i is uncorrelated with z i and x it ), pooled regression gives unbiased but inefficient results, with incorrect standard errors, t- ratios, etc. If u i is correlated with z i and x it, pooled regression is also biased

Fixed effect or random effects? Concepts and interpretation If individuals are randomly sampled from population then u i is random. In practice, with randomly sampled data, FE/RE choice is based on whether a futher assumption holds: that ui is uncorrelated with the regressors: E(u i | z i, X i ) = 0

Testing the hypothesis of uncorrelated effects The random effects estimator (and any estimator that uses between-group variation) is only unbiased if the following hypothesis is true: It is important to test H0. There are various equivalent ways of doing so, including: Hausman test: is the difference large? Between-within comparison: is large? Mundlak approach: estimate the model by GLS and test H 0 : = 0

BHPS example: feasible GLS estimates. xtreg lwage age cohort, re Random-effects GLS regression Number of obs = 59615 Group variable (i): pid Number of groups = 10077 R-sq: within = 0.1296 Obs per group: min = 1 between = 0.0589 avg = 5.9 overall = 0.0503 max = 14 Random effects u_i ~ Gaussian Wald chi2(2) = 7967.85 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- age |.0305788.0003524 86.78 0.000.0298882.0312694 cohort |.0183379.0004847 37.84 0.000.017388.0192879 _cons | -35.09007.9586169 -36.60 0.000 -36.96892 -33.21121 -------------+---------------------------------------------------------------- sigma_u |.48687179 sigma_e |.28128391 rho |.74974873 (fraction of variance due to u_i) ------------------------------------------------------------------------------

BHPS example: within-group estimates. xtreg lwage age cohort, fe Fixed-effects (within) regression Number of obs = 59615 Group variable (i): pid Number of groups = 10077 R-sq: within = 0.1296 Obs per group: min = 1 between = 0.0543 avg = 5.9 overall = 0.0363 max = 14 F(1,49537) = 7377.78 corr(u_i, Xb) = -0.4386 Prob > F = 0.0000 ------------------------------------------------------------------------------ lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age |.0308941.0003597 85.89 0.000.0301892.0315991 cohort | (dropped) _cons |.8987139.0135417 66.37 0.000.8721721.9252558 -------------+---------------------------------------------------------------- sigma_u |.57521051 sigma_e |.28128107 rho |.80702022 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(10076, 49537) = 18.00 Prob > F = 0.0000

Example: BHPS Hausman test. hausman fixed random ---- Coefficients ---- | (b) (B) (b-B) sqrt(diag(V_b-V_B)) | fixed random Difference S.E. -------------+---------------------------------------------------------------- age |.0308941.0305788.0003153.0000722 ------------------------------------------------------------------------------ b = consistent under Ho and Ha; obtained from xtreg B = inconsistent under Ha, efficient under Ho; obtained from xtreg Test: Ho: difference in coefficients not systematic chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 19.08 Prob>chi2 = 0.0000

Summary of random effects model Unlike a cross-sectional model, the RE model allows for an unobserved, time- invariant individual effects. The key assumption of the RE model is that the individual effect is uncorrelated with the regressors. Can test the key zero-correlation assumption using a Hausman or Mundlak test. RE is more efficient than FE because it uses between-group variation as well as within-group variation

Similar presentations