Presentation on theme: "Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing."— Presentation transcript:
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing Census Giancarlo Carbonetti, Marco Fortini Istat – Italian National Statistical Institute General Censuses Directorate May 13th 2008
Joint UNECE Eurostat Meeting 2 Outline Introduction Some aspects related to the use of samples of households for long form enumerations Sampling strategies Simulation study Some results Conclusions
Joint UNECE Eurostat Meeting 3 Introduction - 1 Main critical issue of the last Census Huge organizational (and economical) effort of Municipal Census Offices sudden and time-concentrated increase of workload for largest municipalities, massive network of enumerators and coordinators to be trained and managed lack of adequately skilled resources, high turn over rates Main objectives for the next Census to improve the census operations efficiency to reduce the municipalities workload to keep an high level of quality
Joint UNECE Eurostat Meeting 4 Introduction - 2 Innovations proposed to reach the objectives the use of population registers mail out of census forms mixed mode of data collection mainly based on mail and web Expected consequences with the innovations the increase of “back office” work the reduction of enumerators number (“front office” work) How it is possible increasing the response rates A proposal: the use of a “short form” version of the questionnaire is considered to reach high response rates.
Joint UNECE Eurostat Meeting 5 Introduction - 3 Consequences of the use of short form increasing the response rates reducing as much as possible the response time delay This approach risks information loss!!! How to preserve the richness of the census information by a selection of a sample of households to which a “long form” version of the questionnaire is supplied Strategy: the simultaneously use of short and long forms.
Joint UNECE Eurostat Meeting 6 Some aspects related to the use of samples of households for long form enumeration - 1 Which type of information can be surveyed by means of a sample of long forms and which must be collected on the whole population? The overall set of census variables is partitioned into two subsets the demographic variables (gender, date of birth, marital status, nationality, …) the remaining variables (educational level, occupational status, commuting) Short form accounts for merely the first set of variables whereas long form accounts for the whole set
Joint UNECE Eurostat Meeting 7 Some aspects related to the use of samples of households for long form enumeration - 2 Which is the population municipality threshold under which the sampling strategy cannot be adopted? An option we are taking into consideration is to sample in municipalities with more than 5,000 inhabitants long forms will be submitted to a sample of households short forms will be administered to remaining households In municipalities smaller than 5,000 inhabitants long forms will be submitted to the whole population
Joint UNECE Eurostat Meeting 8 Some aspects related to the use of samples of households for long form enumeration - 3 Which domains have to be considered to plan the sample and to produce accurate estimates? New “census domains” have been defined an appropriate methodology was adopted to build up census domains by aggregating the smallest census areas the new “areas” are referred to sub-municipal level Accuracy of sampling estimates for different territorial levels a similar precision is expected for estimates among areas higher precision is expected for larger territorial reference (from sub-municipal to nationwide level)
Joint UNECE Eurostat Meeting 9 Some aspects related to the use of samples of households for long form enumeration - 4 Which statistical methodology performs the most accurate estimation? … in terms of … sampling design use of appropriate lists efficient estimation methods sampling error assessment The answer to this question is the aim of the study of which some results will be presented.
Joint UNECE Eurostat Meeting 10 Sampling strategies Two different sampling designs have been tested Simple Random Sampling of HOUseholds from Administrative Registers (SRSHOU) managed by municipalities Area Frame Sampling based on a Simple Random Sampling of ENumeration Areas (SRSENA) which implies a complete data collection of households dwelling in the selected enumeration areas (from Digital Geocoded Database) Different studies have been conducted To compare the two different approaches (with a sampling ratio of about one third of the whole population considered) To evaluate in the SRSHOU the improvement of the estimates precision for increasing sampling ratio (10%, 15%, 20%, 33%) To introduce some stratifications of the units involved
Joint UNECE Eurostat Meeting 11 Simulation study - 1 Main features of the sampling designs Domains: the “new areas” referred to sub-municipal districts Target variables: “variables” related to cross-classification of educational level, employment status and commuting with demographic variables Sampling units: “households” or “enumeration area” Estimator: “calibrated estimators” by using final weights properly modified so to make the sample more representative The sampling strategies were compared to each other through Monte Carlo sampling replications (carried out on 2001 Italian Census data) in order to assess the sampling error defined by the coefficient of variation (CV) which represents an accuracy measurement of the sampling estimates.
Joint UNECE Eurostat Meeting 12 Simulation study - 2 Geographical area Classes of population size (a) Total 10,000-20,00020,000-100,000 more than 100,000 North 466 16 Center 233 8 South 466 16 Total1015 40 (a) It has been considered the legal (official) population date referred to the 2001 Census of Population. Because of the strong differences among the Italian municipalities, 40 of them with different population size and from different regions of Italy were considered
Joint UNECE Eurostat Meeting 13 Simulation study - 3 Sampled UnitsUniverse% Areas4973,347(*)14.85% Enumeration areas30,890382,5348.08% Households2,243,51121,810,67610.29% Individuals5,537,58256,594,0219.78% (*) Estimated number Amount of units involved by the simulation study
Joint UNECE Eurostat Meeting 14 Scatter plot of cv and p (estimates) for each census area. SRSHOU design (sampling ratio=33%). City of Perugia. 1% 2% 3%
Joint UNECE Eurostat Meeting 15 Distribution of median cv for classes of p for SRSHOU design and SRSENA design (both with sampling ratio=33%). Comparison of 4 municipalities. Classes of p Milano (111 areas)Bologna (32 areas)Padova (18 areas)Livorno (13 areas) SRSHOUSRSENASRSHOUSRSENASRSHOUSRSENASRSHOUSRSENA < 0.05% 97.7894.1296.5294.3199.6598.34102.21101.61 0.05%├0.1% 51.6151.5950.6749.5451.7054.1350.6952.06 0.1%├0.25% 34.6734.9235.0035.2035.3736.0335.0835.67 0.25%├0.5% 22.9624.3824.1724.7325.5826.4523.7024.37 0.5%├1% 16.8618.7116.8518.3216.9517.8117.1618.72 1%├2.5% 10.6112.2110.7411.9511.0712.0011.3412.90 2.5%├5% 7.028.537.078.257.358.487.179.00 5%├10% 4.845.974.755.745.055.854.886.39 10%├15% 3.174.413.094.093.194.373.064.82 15%├20% 2.443.462.383.122.443.142.443.39 20%├30% 1.892.611.922.482.082.732.052.88 ≥ 30% 1.351.781.321.601.391.721.402.00 THIS IS DUE TO THE CLUSTER EFFECT
Joint UNECE Eurostat Meeting 16 Loss of efficiency (in terms of CV for classes of p) of estimation with SRSENA with respect to SRSHOU design (both with sampling ratio=33%). Comparison of 4 municipalities. Classes of p Milano (111 areas) Bologna (32 areas) Padova (18 areas) Livorno (13 areas) < 0.05% 3.652.211.310.60 0.05%├0.1% 0.031.13-2.43-1.37 0.1%├0.25% -0.25-0.20-0.66-0.59 0.25%├0.5% -1.42-0.56-0.87-0.68 0.5%├1% -1.85-1.47-0.87-1.56 1%├2.5% -1.60-1.22-0.93-1.56 2.5%├5% -1.51-1.18-1.13-1.83 5%├10% -1.13-0.99-0.80-1.51 10%├15% -1.24-1.01-1.18-1.76 15%├20% -1.02-0.73-0.70-0.95 20%├30% -0.72-0.56-0.65-0.82 ≥ 30% -0.43-0.29-0.33-0.60 [CV (SRSHOU_s.r. 33%) -CV (SRSENA_s.r. 33%) ]
Joint UNECE Eurostat Meeting 17 Distribution of median cv for classes of p. Comparison of 4 different sampling ratios with the SRSHOU design. Classes of p sampling ratio= 10% sampling ratio= 15% sampling ratio= 20% sampling ratio= 33% 170 areas140 areas111 areas204 areas < 0.05%220.51157.20142.0098.21 0.05%├0.1%111.4887.2274.2051.14 0.1%├0.25%75.5759.8349.9734.76 0.25%├0.5%50.7039.9233.9723.44 0.5%├1%35.5428.1023.7416.56 1%├2.5%23.6218.5615.3310.68 2.5%├5%15.5012.2910.097.04 5%├10%10.468.266.934.82 10%├15%7.065.404.403.13 15%├20%5.574.273.542.42 20%├30%4.503.482.841.93 ≥ 30%3.202.421.941.34
Joint UNECE Eurostat Meeting 18 Gain of efficiency (in terms of CV for classes of p) of estimation with SRSHOU design by increasing sampling ratio from 10% to 33%. Classes of p increasing s.r. from 10% to 15% increasing s.r. from 10% to 20% increasing s.r. from 10% to 33% < 0.05%28.7135.6055.46 0.05%├0.1%21.7633.4454.13 0.1%├0.25%20.8333.8854.00 0.25%├0.5%21.2633.0053.77 0.5%├1%20.9333.2053.40 1%├2.5%21.4235.1054.78 2.5%├5%20.7134.9054.58 5%├10%21.0333.7553.92 10%├15%23.5137.6855.67 15%├20%23.3436.4556.55 20%├30%22.6736.8957.11 ≥ 30%24.3839.3858.13 [CV (SRSHOU_s.r. 10%) -CV (SRSHOU_s.r. N%) ]x100/[CV (SRSHOU_s.r. 10%) ] Gain between 21-23 percent Gain between 33-38 percent Gain between 53-58 percent
Joint UNECE Eurostat Meeting 19 Distribution of median cv for five classes of p and three classes of area (according to population size). Comparison of 4 different sampling ratios with the SRSHOU design. Classes of p Population by area (thousands) Sampling ratio 10%15%20%33% 0.1%├0.25% <1090.0071.2759.4840.20 10├1276.2360.0350.4534.41 ≥ 1266.6553.0443.5130.58 0.5%├1% < 1043.1133.5028.9719.53 10├1235.0827.4622.9916.48 ≥ 1231.2524.9520.9714.85 2.5%├5% < 1019.1214.6812.228.25 10├1215.5812.369.897.08 ≥ 1214.0010.989.066.35 10%├15% < 108.786.445.223.67 10├127.005.464.413.13 ≥ 126.294.793.892.83 20%├30% < 105.464.163.442.27 10├124.573.422.942.01 ≥ 124.053.202.591.77
Joint UNECE Eurostat Meeting 20 Median CV for some classes of p and for three classes of area (according to population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU design. Graph referred to area size less than 10,000 inhabitants.
Joint UNECE Eurostat Meeting 21 Median CV for some classes of p and for three classes of area (according to population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU design. Graph referred to area size between 10,000 and 12,000 inhabitants. The gain of efficiency (in terms of CV) for census areas with size between 10,000 and 12,000 with respect to census areas with less than 10,000 is about 14-20 percent. Similar results are obtained for all tested sampling ratios.
Joint UNECE Eurostat Meeting 22 Median CV for some classes of p and for three classes of area (according to population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU design. Graph referred to area size more than 12,000 inhabitants. The gain of efficiency (in terms of CV) for census areas with size more than 12,000 with respect to census areas with less than 10,000 is about 22-28 percent. As before, similar results are obtained for all tested sampling ratios.
Joint UNECE Eurostat Meeting 23 Distribution of the estimates referred to areas larger than 12,000 inhabitants for classes of cv. Comparison of percentage frequencies for 4 different sampling ratios with the SRSHOU design. Classes of coefficient of variation % Sampling ratio 10%15%20%33% < 2%0.572.696.3913.14 2%├5%13.0417.5318.4023.64 5%├10%16.1818.0226.2828.64 10%├20%29.1430.1623.5416.20 20%├50%25.0919.7116.7513.32 50%├100%9.327.215.693.44 100%├200%4.403.652.001.61 ≥ 200%2.251.030.95- HA – high accuracy MA – medium accuracy LA – low accuracy
Joint UNECE Eurostat Meeting 24 Distribution of the estimates referred to areas larger than 12,000 inhabitants for classes of cv. Comparison of percentage frequencies for 4 different sampling ratios with the SRSHOU design - 2 Classes of cv% Sampling ratio 10%15%20%33% < 10%29.8038.2551.0765.42 10%├50%54.2349.8740.2929.52 ≥ 50%15.9711.898.645.05 HA - high accuracy MA - medium accuracy LA - low accuracy
Joint UNECE Eurostat Meeting 25 Generic sampled area a Territory R S given by aggregation of K sampled areas Percentage expected reduction of CV in R S Estimates of p referred to territory given by aggregation of areas. Territory R given by aggregation of sampled areas and not sampled areas Quote of sub-population of R elegible for drawing the LF sample. Percentage expected reduction of CV in R
Joint UNECE Eurostat Meeting 26 Conclusions - 1 As expected, the most accurate estimates were obtained for: simple random sampling of households from administrative registers largest sampling ratio Better efficiency of estimates for largest areas (>12,000 inhabitants) this result could represent a suggestion for planning the sampling design by defining larger census areas (of about 15,000 people) The estimates referred to large domains given by aggregation of areas show high accuracy the accuracy increases with the domain’s number in case in which a part of the large domain is totally surveyed, the estimates show a further increasing in accuracy
Joint UNECE Eurostat Meeting 27 Conclusions - 2 However area frame sampling is only slightly less efficient than SRSHOU, thus it could be adopted where reliable administrative registers are not available Sampling ratio will be chosen considering trade-off between: needed financial savings accuracy required at different territorial domains Further analyses will be conducted on small area estimation techniques to produce more accurate estimates for: smallest territorial levels rare populations
Joint UNECE Eurostat Meeting 28 Thank you for your attention and …
Joint UNECE Eurostat Meeting 29 … have a good lunch!!!
Joint UNECE Eurostat Meeting 31 Simulation study - 4 Cross-classification cells educational level, employment status, commuting and gender 90 simple estimation cells Calibration constraints defined by cross-classifying gender by age, and gender by marital status Computational algorithm implemented by SAS code for each municipality and for each alternative sampling design: step 1) selection of a sample (of households or enumeration areas) step 2) computation of final weights step 3) estimation of the relative frequency p for each target cell step 4) iteration of steps 1), 2) and 3) for 1,000 sampling replications step 5) computation of sampling distribution mean and standard error for each one of the 90 frequency cells
Joint UNECE Eurostat Meeting 32 Evaluation criterion: the coefficient of variation In order to compare the sampling strategies has been considered as evaluation criterion the coefficient of variation CV : which represents an accuracy measurement of the sampling estimates. Consequently, the percentage maximum expected error can be computed: Δ% ≈ 1.96 · CV which is implied (with a probability of 0.95) by the estimation method. The distribution of the empirical CV’s for all the 90 target cells was determined. After having classified the target cells depending on their value p, CV’s distribution related to the cells in the same p group has been studied.
Joint UNECE Eurostat Meeting 33 Estimate referred to the generic sampled area a Estimate referred to the territory R S given by aggregation of K sampled areas where Percentage expected reduction of CV for K>5 → red%>50% for K>30 → red%>80% for K>100 → red%>90% Number of areas K ─ Percentage expected reduction of CV Estimates of p referred to territory given by aggregation of areas. Case 1: aggregation of sampled areas.
Joint UNECE Eurostat Meeting 34 Territory R S referred to Sampled areas: long form to a sample of households. Territory R NS of Not Sampled areas: long form to all the households. 100 400 Sub-population of R elegible for drawing the LF sample. Number of areas K Estimates of p referred to territory given by aggregation of areas. Case 2: aggregation of sampled and not sampled areas. ─ γ=1 ─ γ=0.7 ─ γ=0.6 ─ γ=0.5 Percentage expected reduction of CV