Presentation on theme: "Evaluating Methods of Standard Error Estimation for Use with the Current Population Survey’s Public Use Data The Hawaii Coverage For All Technical Workshop."— Presentation transcript:
Evaluating Methods of Standard Error Estimation for Use with the Current Population Survey’s Public Use Data The Hawaii Coverage For All Technical Workshop Honolulu, Hawaii February 7, 2003 Presented by: Michael Davern, Ph.D. University of Minnesota Division of Health Services Research and Policy School of Public Health Supported by a grant from The Robert Wood Johnson Foundation
This paper is a Work in Progress Paper is co-authored with: James Lepkowski, University of Michigan Gestur Davidson, University of Minnesota/SHADAC Arthur Jones Jr., US Census Bureau Lynn A. Blewett, University of Minnesota/SHADAC Estimates have not cleared final Census Review –Estimates are therefore PRELIMINARY –We hope to present it at AAPOR in May of 2003
The Problem: CPS is a complex survey –Sample Design information is necessary to estimate appropriate standard errors –Important components of the sampling design are not released to the public Public use data are widely used by policy-makers and academics –Significance tests in research are likely biased due to standard error estimation –These significance tests provide important rules for “evidence” in the policy analysis and academic literature
The Result: Thus what constitutes “evidence” in policy analysis and academic journals—and the inferences drawn from that evidence--may not be valid In other words: What we know from research using Census Bureau public use data products may not be usefully accurate In a quick search we found over 50 journal articles in the top social science journals that used Census Bureau public Use data.
The Analysis: We identified four approaches to estimating the standard error on the public use data –The Simple Random Sample (SRS) approach – Generalized variance parameter (GVP) approach (Census Bureau’s Standard) –Robust variance estimation (aka sandwich estimator or Huber-White estimator) –Taylor Series with a stratum and cluster variable defined
The Data: The CPS uses a complex sampling design with the following features: –Country is divided into Primary Sampling Units A PSU is a county or group of contiguous counties “Self-representing” PSUs are Metro Areas that are selected with certainty Non-self-representing PSUs are sampled through a stratification process within each state –Within PSUs, a groups of housing units are identified and called Ultimate Sampling Units (USUs)
The Data: –On average 4 housing units are selected from a USU using a systematic sampling method –Information is collected on everyone within a selected household –Due to the rotation schedule, about 45 percent of the households that were interviewed in the monthly CPS were interviewed in the previous year during that month.
The Variables and Standard Error Estimation We run the state rates of health insurance coverage, and poverty. We also run the state average income We estimate the standard errors for these rates/averages in the following manner: –SRS uses normalized weights and conventional calculations to determine standard errors –GVP approach uses the parameters in the Source and Accuracy Statement from the Census Bureau to correct for the complex sampling design (this is the technique used by the Census).
Standard Error Estimation –Robust standard errors use the person weights to account for the degree of heterogeneity in the probability of selection –Taylor Series on the Public Use file uses the ‘Lowest’ level of identifiable geography as the stratum variable and household as the cluster variable Lowest level of identifiable geography is either: –(1) largest 250 MSAs, –(2) Other counties with over 100,000 in population, –(3) non-MSA and non-identified county within a state
The Standard Error “Standard” Ultimate Cluster Method is the current standard way to estimate standard errors for survey data –Taylor series combined with an identified ultimate cluster and stratum variable –The Ultimate cluster for the CPS is the PSU –We used the Census internal data that has the PSU identifiers In the Taylor Series the State is stratum and PSU is cluster (except DC)
These Results are Preliminary and Subject to Internal Census Bureau Review Please do not cite our work without permission
Findings Health Insurance Coverage on Average: –Robust is 8% larger than SRS –Taylor Series public use file is 54% larger than SRS –GVP is 17% smaller than SRS –Taylor Series on internal file is 138% larger than SRS
Findings Percent in Poverty on Average: –Robust is 7% larger than SRS –Taylor Series public use file is 77% larger than SRS –GVP is 81% larger than SRS –Taylor Series on internal file is 190% larger than SRS
Findings Individual (adult) Income on Average: –Robust is 6% larger than SRS –Taylor Series public use file is 7% larger than SRS –GVP is 154% percent smaller than SRS –Taylor Series on internal file is 123% larger than SRS
Discussion GVPs are all over the board compared to the Standard Error “Standard” –Std. Errors for Income are too high, for poverty too low and health insurance they are way too low Robust Std. Error estimates are consistently too small –The main cause of standard error inflation is not differential probability of selection but rather intra-cluster correlation To the extent households have a high intra-cluster correlation, then the Taylor Series is better than the 3 other public use file estimates –Poverty and health insurance have high intra-household correlations but not individual income
Discussion Larger states are likely to have increased numbers of PSUs in the Census Internal file than are recognized in the Public Use File (where we only see their aggregation) By their very construction, the increased number of PSUs result in more “within-PSU” homogeneity being recognized: –States with more PSU’s in the internal data have much higher Std. Errors (using the “Standard”) than currently being estimated –Greater homogeneity within PSUs or households reduces the “effective” sample size (there is less ‘independent’ information than the full sample size would suggest) Consequences of this especially with health insurance and poverty estimates, as expected.
Conclusion Census is not going to release PSU identifiers to public The data are widely used for important policy and academic research –The work done on public use file has biased standard errors and may not support inferences by meeting the statistical standard for evidence Therefore, I feel it is the responsibility of the Census Bureau to improve its GVPs or come up with a better substitute –What is currently offered is inadequate
SHADAC Contact Information www.shadac.org 2221 University Avenue, Suite 345 Minneapolis Minnesota 55414 (612) 624-4802 Principal Investigator: Lynn Blewett, Ph.D. (firstname.lastname@example.org)email@example.com Co-Principal Investigator: Kathleen Call, Ph.D. (firstname.lastname@example.org)email@example.com Center Director: Kelli Johnson, M.B.A. (firstname.lastname@example.org)email@example.com Senior Research Associate: Timothy Beebe, Ph.D. (firstname.lastname@example.org)email@example.com Research Associate: Michael Davern, Ph.D. (firstname.lastname@example.org)email@example.com