Presentation is loading. Please wait.

Presentation is loading. Please wait.

Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data Lara L. Cleveland IPUMS-International November 14, 2010 Havana,

Similar presentations


Presentation on theme: "Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data Lara L. Cleveland IPUMS-International November 14, 2010 Havana,"— Presentation transcript:

1 Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data Lara L. Cleveland IPUMS-International November 14, 2010 Havana, Cuba

2 Overview Characteristics of Complex Samples  Public Use Census Data  IPUMS-International Census Samples Adjusting for Sampling Error  Assessment Strategy  Results  Recommendations and Future Work The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

3 Overview Characteristics of Complex Samples  Public Use Census Data  IPUMS-International Census Samples Adjusting for Sampling Error  Assessment Strategy  Results  Recommendations and Future Work The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

4 Public Use Census Microdata Publicly available census microdata often derive from complex samples. HOWEVER, social science researchers commonly apply methods designed for simple random samples.

5 Public Use Census Data: Complex Samples Clustering  By household (sample households rather than individuals)  Some samples geographically clustered  Can result in underestimated standard errors Differential weighting  Oversample select populations  Also leads to underestimated standard errors Stratification  Explicitly by person or household characteristics  Implicitly by geographical area  Can result in overestimated standard errors underestimated standard errors overestimated standard errors

6 IPUMS-I Data Processing Data received varies in quality, detail and extent of documentation 3 Sampling Processes  Country-produced public use sample  Sample drawn by partner country to IPUMS-I specifications  Full count data sampled by IPUMS-I

7 Samples Drawn by IPUMS-I High density (typically 10% samples) Household samples  Clustered by household Systematic sample (every n th household)  Typically geographic sorting – presumed here  Implicit geographic stratification Uniformly weighted (self-weighting)

8 Variance Estimation: Data Quality Assessment/Improvement As researchers and data users  Assess accuracy of the data  Calculate precise estimates As data custodians and disseminators  Distribute quality data samples  Create tools to facilitate research

9 Overview Characteristics of Complex Samples  Public Use Census Data  IPUMS-International Census Samples Adjusting for Sampling Error  Assessment Strategy  Results  Recommendations and Future Work The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

10 Assessment Strategy Create or specify variables to account for sampling error for use in current statistical packages  Cluster (Household identifier)  Strata (Pseudo-strata) Compare estimates from full count data to estimates from sample data using 3 methods:  Subsample Replicate  Taylor Series Linearization  Simple Random Sample (SRS)

11 Assessing Accuracy: Full Count Data “True” or “Gold Standard” Estimates  Full count census data  Simulate sample design 100 – 10% replicates  Estimate the mean and standard error of the mean for several household and person-level variables Recent census data from 4 countries: Bolivia 2001, Ghana 2000, Mongolia 2000, Rwanda 2002 Full count, clean, well formatted data requiring no special corrections

12 Assessing Accuracy: Sample Data Sub-sample Replicate  Mimic sample design – 100 10% subsamples  Labor and resource heavy Taylor Series Linearization  Clustering: household identifier  Stratification: pseudo-strata variable 10 adjacent households within geographic unit Incomplete strata pooled with preceding strata  Available in most statistical packages Simple Random Sample as control/comparison

13 Overview Characteristics of Complex Samples  Public Use Census Data  IPUMS-International Census Samples Adjusting for Sampling Error  Assessment Strategy  Results  Recommendations and Future Work The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

14 Table Format From Full Count Data – “Gold Standard” 1) Full Count Mean 2) S.E. of mean from Full Count Replicate From Sample Data: Ratios of Standard Errors 3) SE (Sub-sample Replicate) / SE (Full Count Replicate) 4) SE (Sample Taylor Series) / SE (Full Count Replicate) 5) SE (SRS) / SE (Full Count Replicate) Ratios ~1.0: Sample estimate resembles “true” value >1.0: Sample estimate overestimates SE <1.0: Sample estimate underestimates SE

15 Full Count Parameter Estimate SE from Full Count Replicate Ratio of (SE) Estimates: 10% Sample to Full Count Replicate Selected Characteristics Subsample Replicate Taylor Series with Pseudo-Strata Simple Random Sample Household(1)(2)(3)(4)(5) HH Size (mean) 4.710.0050.80.9 Electric Light (%) 4.180.0340.9 1.3 Toilet (%) 0.380.0130.9 1.0 Radio (%) 43.110.1030.91.0 Earth Floor (%) 85.280.0730.80.91.0 Home Ownership (%) 86.410.0561.1 1.3 Non-relatives (mean) 0.300.0021.11.01.1 Person Subsample Replicate Pseudostrata and HH Cluster Simple Random Sample Age (mean) 20.770.0150.91.01.1 Sex (%) 46.810.0450.91.01.1 Religion Catholic (%) Protestant (%) 46.69 26.16 0.100 0.077 1.0 1.1 1.0 1.1 0.5 0.6 Married (%) 17.640.0390.91.0 Literate (%) 39.750.0600.9 0.8 Employed (%) 40.940.0480.9 1.0 Table 1.Rwanda 2002: Comparing Complete Count and Sample Standard Error Estimates ~1.0

16 Full Count Parameter Estimate SE from Full Count Replicate Ratio of (SE) Estimates: 10% Sample to Full Count Replicate Selected Characteristics Subsample Replicate Taylor Series with Pseudo-Strata Simple Random Sample Household(1)(2)(3)(4)(5) HH Size (mean) 4.450.0080.9 1.0 Electric Light (%) 67.530.0981.11.01.8 Toilet (%) 62.460.1351.11.21.4 Kitchen (%) 39.080.1451.0 1.3 Bathroom (%) 21.740.0961.01.11.5 Phone(%) 17.010.1361.0 1.1 Non-relatives (mean) 0.110.0020.91.0 Person Subsample Replicate Pseudostrata and HH Cluster Simple Random Sample Age (mean) 24.570.0341.0 Sex (%) 49.470.0780.91.01.2 Ethnicity Khalkh (%) Kazak (%) 81.59 4.28 0.111 0.047 0.9 1.0 1.1 0.6 0.8 Married (%) 32.330.0810.91.01.1 Literate (%) 81.560.0711.11.0 Employed (%) 32.470.0950.9 Table 2.Mongolia 2000: Comparing Complete Count and Sample Standard Error Estimates ~1.0

17 Full Count Parameter Estimate SE from Full Count Replicate Ratio of (SE) Estimates: 10% Sample to Full Count Replicate Selected Characteristics Subsample Replicate Taylor Series with Pseudo-Strata Simple Random Sample Household(1)(2)(3)(4)(5) HH Size (mean) 3.930.00461.0 1.1 Electric Light (%) 60.510.05361.11.21.9 Toilet (%) 59.480.06491.01.11.6 Kitchen (%) 70.620.08820.91.01.1 Phone (%) 21.330.06051.31.11.4 Radio (%) 71.170.08190.91.01.1 Earth Floor (%) 35.660.05191.21.31.9 Non-relatives (mean) 0.190.00121.0 1.1 Person SubsampleP-S and ClusterSimple Random Age (mean) 24.700.00041.01.11.0 Sex (%) 49.840.00240.9 1.1 Ethnicity Quechua (%) Aymara (%) 30.69 25.19 0.0053 0.0047 1.0 0.8 1.0 0.9 0.8 Married (%) 26.090.00230.91.0 Literate (%) 74.990.00250.9 Employed (%) 34.370.00221.1 1.0 Table 3.Bolivia 2001: Comparing Complete Count and Sample Standard Error Estimates ? ? ? ~1.0

18 Taylor Series LinearizationSRS PersonMean SE (Full Count Replicate) Adjusted for Clustering and Implicit Stratification Effect of Clustering (Adjusted for Strata Only) Effect of Stratification (Adjusted for Cluster Only) Combined Effect of Clustering and Stratification Age (mean)24.7 0.00041.11.01.11.0 Sex (%)49.8 0.00240.91.10.91.1 Ethnicity Quechua (%) Aymara (%) 30.7 25.2 0.0053 0.0047 1.0 0.9 0.6 0.5 1.4 0.8 Married (%)26.1 0.00231.0 Literate (%)75.0 0.00250.9 1.00.9 Worked (%)34.4 0.00221.11.01.21.0 Table 4. Bolivia 2001: Decomposition of Clustering and Stratification Effects on Taylor Series Standard Error Estimates from the 10% Sample ?

19 Full Count Parameter Estimate SE from Full Count Replicate Ratio of (SE) Estimates: 10% Sample to Full Count Replicate Selected Characteristics Subsample Replicate Taylor Series with Pseudo-Strata Simple Random Sample Household(1)(2)(3)(4)(5) HH Size (mean) 4.990.0051.11.0 Electric Light (%) 43.540.0421.5 1.8 Toilet (%) 8.490.0261.21.51.7 Kitchen (%) 46.170.0621.2 Bathroom (%) 23.470.0461.51.4 Non-relatives (mean) 0.140.0010.91.0 Person Subsample Replicate Pseudostrata and HH Cluster Simple Random Sample Age (mean) 23.900.0131.01.11.0 Sex (%) 49.480.0351.0 Ethnicity Akan (%) Mole-dagbani (%) 45.28 15.25 0.066 0.051 0.9 1.01.00.5 Married (%) 29.280.0291.2 1.1 Literate (%) 34.000.0381.01.10.9 Employed (%) 42.440.0381.31.10.9 Table 5.Ghana 2000: Comparing Complete Count and Sample Standard Error Estimates ~1.0 ? ?? ? ?

20 Overview Characteristics of Complex Samples  Public Use Census Data  IPUMS-International Census Samples Adjusting for Sampling Error  Assessment Strategy  Results  Recommendations and Future Work The IPUMS projects are funded by the National Science Foundation and the National Institutes of Health

21 Recommendations: Clustering Many research projects need not worry  Subpopulations that rely on only one person per HH (e.g., fertility, aging, some work-related studies) Design research to select a single person from the household Use household identifier in stat packages Future: Variable that includes identifier for geographic clustering as needed

22 Recommendations: Stratification Most researchers need no modification  Stratification increases precision  Estimates are conservative If concerned, use pseudo-strata  Investigations of weak relationships  For some sub-population studies Future: Pseudo-strata variable to specify information about implicit stratification

23 Recommendations: Web Guidance

24

25 Current and future work Determine optimal pseudo-strata size Investigate Ghana data distribution Seek more geographic detail in the data Compare estimates to published population counts Additional data quality tests

26 Thank you! Questions? Lara L. Cleveland IPUMS International Minnesota Population Center University of Minnesota 50 Willey Hall 225 – 19 th Avenue South Minneapolis, MN 55455

27 Full Count Parameter Estimate SE from Full Count Replicate Ratio of (SE) Estimates: 10% Sample to Full Count Replicate Selected Characteristics Subsample Replicate Taylor Series with Pseudo-Strata Simple Random Sample Household(1)(2)(3)(4)(5) HH Size (mean) 4.710.0050.80.9 Electric Light (%) 4.180.0340.9 1.3 Toilet (%) 0.380.0130.9 1.0 Radio (%) 43.110.1030.91.0 Earth Floor (%) 85.280.0730.80.91.0 Home Ownership (%) 86.410.0561.1 1.3 Non-relatives (mean) 0.300.0021.11.01.1 Person Subsample Replicate Pseudostrata and HH Cluster Simple Random Sample Age (mean) 20.770.0150.91.01.1 Sex (%) 46.810.0450.91.01.1 Religion Catholic (%) Protestant (%) 46.69 26.16 0.100 0.077 1.0 1.1 1.0 1.1 0.5 0.6 Married (%) 17.640.0390.91.0 Literate (%) 39.750.0600.9 0.8 Employed (%) 40.940.0480.9 1.0 Table 1.Rwanda 2002: Comparing Complete Count and Sample Standard Error Estimates

28 Full Count Parameter Estimate SE from Full Count Replicate Ratio of (SE) Estimates: 10% Sample to Full Count Replicate Selected Characteristics Subsample Replicate Taylor Series with Pseudo-Strata Simple Random Sample Household(1)(2)(3)(4)(5) HH Size (mean) 4.450.0080.9 1.0 Electric Light (%) 67.530.0981.11.01.8 Toilet (%) 62.460.1351.11.21.4 Kitchen (%) 39.080.1451.0 1.3 Bathroom (%) 21.740.0961.01.11.5 Phone(%) 17.010.1361.0 1.1 Non-relatives (mean) 0.110.0020.91.0 Person Subsample Replicate Pseudostrata and HH Cluster Simple Random Sample Age (mean) 24.570.0341.0 Sex (%) 49.470.0780.91.01.2 Ethnicity Khalkh (%) Kazak (%) 81.59 4.28 0.111 0.047 0.9 1.0 1.1 0.6 0.8 Married (%) 32.330.0810.91.01.1 Literate (%) 81.560.0711.11.0 Employed (%) 32.470.0950.9 Table 2.Mongolia 2000: Comparing Complete Count and Sample Standard Error Estimates

29 Full Count Parameter Estimate SE from Full Count Replicate Ratio of (SE) Estimates: 10% Sample to Full Count Replicate Selected Characteristics Subsample Replicate Taylor Series with Pseudo-Strata Simple Random Sample Household(1)(2)(3)(4)(5) HH Size (mean) 3.930.00461.0 1.1 Electric Light (%) 60.510.05361.11.21.9 Toilet (%) 59.480.06491.01.11.6 Kitchen (%) 70.620.08820.91.01.1 Phone (%) 21.330.06051.31.11.4 Radio (%) 71.170.08190.91.01.1 Earth Floor (%) 35.660.05191.21.31.9 Non-relatives (mean) 0.190.00121.0 1.1 Person SubsampleP-S and ClusterSimple Random Age (mean) 24.700.00041.01.11.0 Sex (%) 49.840.00240.9 1.1 Ethnicity Quechua (%) Aymara (%) 30.69 25.19 0.0053 0.0047 1.0 0.8 1.0 0.9 0.8 Married (%) 26.090.00230.91.0 Literate (%) 74.990.00250.9 Employed (%) 34.370.00221.1 1.0 Table 3.Bolivia 2001: Comparing Complete Count and Sample Standard Error Estimates

30 Taylor Series LinearizationSRS PersonMean SE (Full Count Replicate) Adjusted for Clustering and Implicit Stratification Effect of Clustering (Adjusted for Strata Only) Effect of Stratification (Adjusted for Cluster Only) Combined Effect of Clustering and Stratification Age (mean)24.7 0.00041.11.01.11.0 Sex (%)49.8 0.00240.91.10.91.1 Ethnicity Quechua (%) Aymara (%) 30.7 25.2 0.0053 0.0047 1.0 0.9 0.6 0.5 1.4 0.8 Married (%)26.1 0.00231.0 Literate (%)75.0 0.00250.9 1.00.9 Worked (%)34.4 0.00221.11.01.21.0 Table 4. Bolivia 2001: Decomposition of Clustering and Stratification Effects on Taylor Series Standard Error Estimates from the 10% Sample

31 Full Count Parameter Estimate SE from Full Count Replicate Ratio of (SE) Estimates: 10% Sample to Full Count Replicate Selected Characteristics Subsample Replicate Taylor Series with Pseudo-Strata Simple Random Sample Household(1)(2)(3)(4)(5) HH Size (mean) 4.990.0051.11.0 Electric Light (%) 43.540.0421.5 1.8 Toilet (%) 8.490.0261.21.51.7 Kitchen (%) 46.170.0621.2 Bathroom (%) 23.470.0461.51.4 Non-relatives (mean) 0.140.0010.91.0 Person Subsample Replicate Pseudostrata and HH Cluster Simple Random Sample Age (mean) 23.900.0131.01.11.0 Sex (%) 49.480.0351.0 Ethnicity Akan (%) Mole-dagbani (%) 45.28 15.25 0.066 0.051 0.9 1.01.00.5 Married (%) 29.280.0291.2 1.1 Literate (%) 34.000.0381.01.10.9 Employed (%) 42.440.0381.31.10.9 Table 5.Ghana 2000: Comparing Complete Count and Sample Standard Error Estimates


Download ppt "Variance Estimation: Drawing Statistical Inferences from IPUMS-International Census Data Lara L. Cleveland IPUMS-International November 14, 2010 Havana,"

Similar presentations


Ads by Google