Presentation on theme: "Nonparametric estimation of non- response distribution in the Israeli Social Survey Yury Gubman Dmitri Romanov JSM 2009 Washington DC 4/8/2009."— Presentation transcript:
Nonparametric estimation of non- response distribution in the Israeli Social Survey Yury Gubman Dmitri Romanov JSM 2009 Washington DC 4/8/2009
2 Outline 1.Missing data generating mechanisms 2.Sharp bounds for conditional mean 3.Tests for MCAR and MAR assumptions 4.Empirical results - Israeli Social Survey 2006 5.Conclusions
3 Missing data generating mechanisms Missing Completely At Random (MCAR): Non-respondent's data is ignorable: (1) Missing At Random (MAR): Conditionally on some set of covariates, the non-respondent's data is ignorable. 1.Non-respondent's data provides no additional information about conditional distribution of y: (2) 2.Given a set of survey design covariates X, the probability of y to be missing does not depend on y: (3) MAR assumption can not be tested statistically using the survey data only, because the non-respondents data is not available. We overcome this difficulty by conditioning on the full-known administrative covariates of census type.
4 Sharp bounds for conditional mean Let z=1 for interview respondents, and 0 otherwise. Let w=1 for item respondents, and 0 otherwise. Let be survey strata, y - survey variable and x - covariate. By the Law of Iterated Expectations: (4) By Bayes theorem: (5) Using covariates from the administrative sources of census type allows us to assume: 1.There is no item non-response on the covariates x. 2.P(x), the overall population distribution x, is known.
5 Sharp bounds for conditional mean In addition: (6) Combine (4), (5) and (6). The data reveals nothing at all about and. The lower and upper bounds are obtained by minimization and maximization, respectively, of the result expression with respect to all unobserved values. Minimum and maximum exist because all survey variables are bounded.
6 Sharp bounds for conditional mean For full survey data:, where: The width of the interval between the bounds reflects both item and survey non-response.
7 Sharp bounds for conditional mean For item non-response analysis, the respondent's data should be treated. In this case, formula for sharp bounds may be simplified: The width of the interval between the bounds reflects item non- response only. Nothing was assumed about the true missing data generating mechanism
8 Testing MCAR and MAR Define: The explicit expression for is given by: and for by: and are asymptotically normal, and do not depend on and on sample size. Their standard deviations can be estimated using bootstrap. T-test for equal means, for two population with unknown and different variances, is used for checking the null hypothesis in the following cases.
9 Testing MCAR H 0 for testing overall non-response is given by: and for item non-response: If the H 0 is rejected, for some i, j, we will conclude that the probability to be non-respondent depends on x. In particular, MCAR assumption is violated.
10 Testing MAR (1) Let be a variable from the administrative database, which is strongly correlated with key survey variable y, and/or with survey topic. In such case, may be treated as a survey variable, and it is known for all sampled units. Under MAR, the respondent's data is sufficient to estimate conditional population distributions for all survey variables, and in particular for. The null hypothesis is given by: If null hypothesis is rejected, survey non-response distribution depends on survey topic and/or on survey variable, and this contradicts MAR.
11 Testing MAR (2) Use MAR definition:. Let X be a set of survey design covariates, which "controls" a bias in survey variables (due to non-response). Let be categorical full-observed and orthogonal to X covariate. Assuming MAR and conditional on X, the survey non-response rates should be independent of. H 0 for testing MAR assumption is given by: for overall non-response, and for item non-response. Rejection of H 0 means that, conditionally on the set of survey design covariates X, MAR assumption is violated. If is strongly correlated with some survey variable, rejection of H 0 means that the non-response depends on the survey variable or/and survey topic.
12 Israeli Social Survey 2006 The Israeli Social Survey (ISS) has been conducted annually since 2002 on a sample of persons aged 20 and older. The main purpose of the ISS is to provide up-to-date information on the welfare of Israelis and on their living conditions. The ISS is the first survey conducted by ICBS using the Population Register as a sampling frame. The sample size in 2006 was 9,499 persons. 562 persons did not belong to the sample frame (deceased, were abroad for over a year), and the final sample included 8,937 persons. 1,648 did not respond the survey (18.4 percent of the final sample).
13 Israeli Social Survey 2006 Four key ISS variables were chosen: 1.Worked last week (4 categories) - no item non-response; 2.Optimism – general (3 categories) - item non-response rate is 11.0 percent; 3.Gross salary from all places of work (10 categories) - item non- response rate is 5.3 percent; 4.Degree of religiosity – Jews (5 categories) - item non-response rate is 0.5 percent. We use three administrative covariates which are highly correlated with survey topic and some important survey variables: 1.Reported income from work (Tax Authority); 2.Work status (Tax Authority); 3.Degree of religiosity of Jews (derived from the educational databases).
14 Empirical results – testing covariate's conditional distributions Respondent ’ s and non-respondent ’ s distributions significantly differ
15 Empirical results: testing MAR for interview non-response H 0 is rejected, p-value<0.01 for three covariates
16 Empirical results testing MAR for item non-response H 0 is rejected, p-value<0.01 for three covariates
17 Conclusions We propose nonparametric statistical tests for checking validity of MCAR and MAR assumptions, where the test statistics are based on the width of the interval between the estimated sharp bounds for conditional mean. Significant departures from MAR assumption were found in the ISS 2006 data. Non-response propensity varies significantly between population groups assumed to be homogenous according to the survey design. ISS survey design can be improved using available administrative covariates, such as income, labor market status, and degree of religiosity of Jews.