Presentation on theme: "SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data."— Presentation transcript:
SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data
Overview Panel data What it is How to get to know the data Change over time Tabulating Calculating transition probabilities
What is panel data? A data set containing observations on multiple phenomena observed at a single point in time is called cross-sectional data A data set containing observations on a single phenomenon observed over multiple time periods is called time series data Observations on multiple phenomena over multiple time periods are panel data Cross sectional and time series data are one- dimensional, panel data are two-dimensional Panel data can be used to answer both longitudinal and cross- sectional questions!
Using panel data in Stata Data on n cases, over t time periods, giving a total of n × t observations One record per observation i.e. long format Stata tools for analyzing panel data begin with the prefix xt First need to tell Stata that you have panel data using xtset
. xtset pid wave panel variable: pid (unbalanced) time variable: wave, 1 to 15, but with gaps delta: 1 unit Time variable Unique cross-wave identifier Telling Stata you have time series data
. xtset pid wave panel variable: pid (unbalanced) time variable: wave, 1 to 15, but with gaps delta: 1 unit Period between observations in units of the time variable Cases not observed for every time period
Transition probability matrices in Stata Mean transition probabilities for all waves t to t+1 when you leave out the if statement
Change in a categorical variable over time A decision tree empl unemp olf 0.90 0.03 0.07 0.91 0.03 0.06 0.26 0.49 0.25 0.10 0.03 0.87
Change in a continuous variable over time Size transition matrix Quantile transition matrix Mean transition matrix Median transition matrix
Size transition matrix Absolute mobility e.g. movement in and out of poverty Boundaries set exogenously i.e. predetermined e.g. poverty defined a priori as an income below £5,000 Do not depend on distribution under investigation e.g. comparing mobility in 1990s and 2000s incorporates both movements of positions of individuals and economic growth
Quantile transition matrix Mobility as a relative concept Same number of individuals in each class Only records movements involving re-ranking Cannot take account of economic growth, for example when comparing matrices Cannot draw a complete picture if comparing mobility in different cohorts/countries/welfare regimes
Mean/median transition matrices Both absolute and relative approaches incorporated into matrices Class boundaries defined as percentages of mean or median income of the origin and destination distributions Example: 25%, 50%, 75% of median income Note that this is not the same as quartiles
Warning! Measurement error Causes an over-estimation of mobility If mothers and babys weight are reported to nearest half pound can affect which band the observations falls in A respondent may describe their marital status as separated in year 1 and single in year 2
Finally….. Greater challenges to understanding and checking panel data Transition matrices a good way to summarise mobility patterns Different methods of constructing matrices lead to distinct interpretations May need to take account of measurement error when modelling change
SC968 Panel data methods for sociologists Lecture 2, part 2 Concepts for panel data analysis
Overview Types of questions, types of variables: time-invariant, time-varying and trend Between- and within-individual variation Concept of individual heterogeneity From OLS to models that allow causal interpretations: fixed effects and random effects models The basics of these models implementation in Stata
Types of variable Those which vary between individuals but hardly ever over time Sex Ethnicity Parents social class when you were 14 The type of primary school you attended (once youve become an adult) Those which vary over time, but not between individuals The retail price index National unemployment rates Age, in a cohort study Those which vary both over time and between individuals Income Health Psychological wellbeing Number of children you have Marital status Trend variables Vary between individuals and over time, but in highly predictable ways: Age Year
Between- and within-individual variation If you have a sample with repeated observations on the same individuals, there are two sources of variance within the sample: The fact that individuals are systematically different from one another (between-individual variation) The fact that individuals behaviour varies between observations over time (within-individual variation) Total variation is the sum over all individuals and years, of the square of the difference between each observation of x and the mean Within variation is the sum of the squares of each individuals observation from his or her mean Between variation is the sum of squares of differences between individual means and the whole-sample mean Remember: From the variation, you get to the variance, you get to the Standard Deviation:
xtsum in STATA Similar to ordinary sum command All variation is between All variation is within, because this is a balanced sample Have chosen a balanced sample Most variation is between, because its fairly rare to switch between having and not having a partner
More on xtsum…. Observations with non-missing variable Average number of time-points Number of individuals Min & max refer to individual deviation from own averages, with global averages added back in. Min & max refer to x i -bar
The xttab command For simplicity, omitted jbstats of missing, maternity leave, gov training and other. Pooled sample, broken down by person/years Number of people who spent any time in this state Of those who spent any time in this state, the proportion of their time (on average) they spent in it.
Which statistical model for panel data? Your research question will guide which models are most suitable but the nature of your data is also important: What is the effect on income of having more children? What is the difference in income between individuals who have a different number of children? What is the difference in income before and after the birth of a child? What is the difference in income between men and women and before and after the birth of a child? How does income change in the time leading up to the birth of a child ? survival analysis later in this course! Is your research question cross-sectional or longitudinal, or both? Cross-sectional: exploit variation between individuals Longitudinal: exploit variation within individuals over time and permit causal interpretation of effects and can consider between variation if needed
Longitudinal analysis is concerned with modelling individual heterogeneity A very simple concept: people are different! In social science, when we talk about heterogeneity, we are really talking about unobservable (or unobserved) heterogeneity: Observed heterogeneity: differences in education levels, or parental background, or anything else that we can measure and control for in regressions Unobserved heterogeneity: anything which is fundamentally unmeasurable, or which is rather poorly measured, or which does not happen to be measured in the particular data set we are using. With panel data we can do something about unobserved heterogeneity as we can differentiate between person-level unobserved x that are identical over time and those that vary over time!
OLS with panel data pidwaveyx1 1123400 1224055 13273010 14325015 15370520 16403025 2118855 22214510 23227515 24247020 25276225 26312030 3178010 32117015 33136520 34240525 35240530 36247035 OLS t=1 : y=2448 -156*x1OLS pooled : y=1925 + 29*x1 Cross-sectional effect captures may be quite misleading (omitted variable bias)! By adding more data points from the same units at different points in time we can get better estimates. But assumptions of OLS may be violated!
An illustration of how unobserved heterogeneity matters Considering this is from panel data, two problems become apparent: Error terms for persons 1, 2 and 3 differ systematically The association between x and y appears to be biased Panel data allows you to: Break down the error term (w i ) in two components: the unobservable characteristics of the person (u i ), and genuine error (e i ). then model u i and e i w3w3 w1w1 u 1 ?
Expanding the OLS model to consider unobserved heterogeneity Individual-specific, fixed over time Varies over time, usual assumptions apply (mean zero, homoscedastic, uncorrelated with x or u or itself) Analytically, think of splitting the error term into its two components u i and … and consider that you have repeated observations over time.. and then reduce the complexity of the information available in some way, or add further assumptions. Your options: Focus on between variation: loose info on within variation Focus on within variation: loose info on between variation Model both types of variation making further assumptions
Within and between estimators Individual-specific, fixed over time Varies over time, usual assumptions apply (mean zero, homoscedastic, uncorrelated with x or u or itself) This is the between estimator And this is the within estimator – fixed effects θ measures the weight given to between-group variation, and is derived from the variances of u i and ε i Not interested in within variation? Use the means of all observations for all persons i Not interested in between variation? Why not remove it in that case! Interested in both? Well, lets treat x i _bar as imperfect to measure person fixed effect and use between variation where within variation is poorly captured
Between estimator Interpret as how much does y change between different people Not much used Except to calculate the θ parameter for random effects, but Stata does this, not you! Its inefficient compared to random effects It doesnt use as much information as is available in the data (only uses means) Assumption required: that u i is uncorrelated with x i Easy to see why: if they were correlated, how could one decide how much of the variation in y to attribute to the xs (via the betas) as opposed to the correlation? Cant estimate effects of variables where mean is invariant over individuals Age in a cohort study Macro-level variables
Focusing on within variation – the fixed effects family Fixed effects estimator Basic idea: For each individual, calculate the mean of x and the mean of y. Then run OLS on a transformed dataset where each y it is replaced by and each x it is replaced by xtreg y x, fe Identical to: Least Squares Dummy Variables regression areg, y x, absorb(pid) Include a dummy indicator for each individual; all individual level differences, including the idiosyncratic error term, will then be captured in the person-specific intercept. Members of the same family, which you may come across in the literature: First Differences regress D.(y x) For each individual, and each time periods y and x, calculate the difference between the value in this period and that in the last period. Then run OLS on a transformed dataset where each y it is replaced by (y it – y it-1 ) and each x it is replaced by (x it – x it-1 ) Hybrid models regress y x mean_x z run standard OLS but add of each time-varying variable as additional regressors
Fixed effects estimator Fixed effects : y=65*x1 Ignores between-group variation – so its an inefficient estimator However, few assumptions are required for FE to be consistent: u i is allowed to correlate with x i Disadvantage: cant estimate the effects of any time-invariant variables Need to consider change in interpretation of effects pidwaveyx1 11234003076.712.5-736.7-12.5 12240553076.712.5-671.7-7.5 132730103076.712.5-346.7-2.5 143250153076.712.5173.32.5 153705203076.712.5628.37.5 164030253076.712.5953.312.5 21188552442.817.5-557.8-12.5 222145102442.817.5-297.8-7.5 232275152442.817.5-167.8-2.5 242470202442.817.527.22.5 252762252442.817.5319.27.5 263120302442.817.5677.212.5 31780101765.822.5-985.8-12.5 321170151765.822.5-595.8-7.5 331365201765.822.5-400.8-2.5 342405251765.822.5639.22.5 352405301765.822.5639.27.5 362470351765.822.5704.212.5
Want to look at the effect of non-time varying x? Use and in OLS the effect of any unobserved characteristic otherwise transported in the effect is shifted to the effect of : approximates the coefficient in the FE model, gives you, approximately, the OLS estimate for non-time- varying variables pidwaveyxzx_bar 112340111.5 122405211.5 132730211.5 143250211.5 153705111.5 164030111.5 211885020.66 222145120.66 232275120.66 242470120.66 252762120.66 263120020.66 31780120.33 321170120.33 331365020.33 342405020.33 352405020.33 362470020.33 Disadvantage: can only control for unobserved heterogeneity associated with observed time-varying variables x i ; Hint: create yourself Typically no interest in the effect of so no need to worry about its interpretation. Note that is approximately equal to the effect in the pooled OLS z i : non-time varying individual characteristics for which you do not need to include group means
Random effects estimator Uses both within- and between-group variation, so makes best use of the data and is efficient. Starts off with the idea that using xi_bar is not the best we can do to capture within variation. the more imprecise the estimate of the person-level variation (as measured by the person x i _bar) the more we should draw on the information from other units (x_bar) Assumption required: that u i is uncorrelated with x i Rather heroic assumption – think of examples Will see a test for this later Note that the within and between effect is constrained to be identical (much more like OLS in this respect so no causal interpretation!). E.g., when you include a location indicator in your model, you are saying that the effect on y of moving to a new town is the same as the effect on y of living in different towns. When you include a female dummy, you are saying that the effect of being female on y is the same as the effect on y of changing gender. Random Effects Model here RE Generalised Least Squares
Estimating fixed effects in STATA u and e are the two parts of the error term Peaks at age 48 R-square-like statistic Talk about xtmixed
Between regression: Not much used, but useful to compare coefficients with fixed effects Coefficient on partner was negative and significant in FE model. In FE, the partner coeff really measures the events of gaining or losing a partner
Random effects regression Option theta gives a summary of weights Tells you how good an approximation x i _bar is of the person-level effect; or how much of the within variation we used to determine the effect size zero= OLS 1=FE estimators
And what about OLS? OLS simply treats within- and between-group variation as the same Pools data across waves
Test whether pooling data is valid If the u i do not vary between individuals, they can be treated as part of α and OLS is fine. Breusch-Pagan Lagrange multiplier test H 0 Variance of u i = 0 H 1 Variance of u i not equal to zero If H 0 is not rejected, you can pool the data and use OLS Post-estimation test after random effects
Comparing models Compare coefficients between models Reasonably similar – differences in partner and badhealth coeffs R-squareds are similar Within and between estimators maximise within and between r-2 respectively.