Presentation is loading. Please wait.

Presentation is loading. Please wait.

Practical Approaches to Design and Inference Through the Integration of Complex Survey Data and Non-Survey Information Sources John L. Eltinge, Scott.

Similar presentations


Presentation on theme: "Practical Approaches to Design and Inference Through the Integration of Complex Survey Data and Non-Survey Information Sources John L. Eltinge, Scott."— Presentation transcript:

1 Practical Approaches to Design and Inference Through the Integration of Complex Survey Data and Non-Survey Information Sources John L. Eltinge, Scott Fricker & Daniel Yang, BLS Rachel M. Harter & Jamie Ridenhour, RTI International Biometric Society – ENAR Meetings Session #51 “Statistical Challenges of Survey and Surveillance Data in the U.S. Government” March 16, 2015 Thank you. I’d especially like to thank the organizer and chair for proposing this session as a way to explore common ground between sample survey work and biostatistics. And I believe that one especially promising part of that common ground centers on the integration of complex survey data with other information sources. For this presentation, my co-authors and I will outline some general concepts and practical approaches that can be useful in the integration of these data sources.

2 Acknowledgements and Disclaimer
The authors thank Jennifer Edgar, David Friedman, Steve Henderson, Brett McBride, Bruce Meyer, Bill Mockovak, Jay Ryan, Adam Safir and Lucilla Tan for many helpful discussions of topics considered in this paper. The numerical results in Section IV are part of a larger study in Fricker, Yang and Eltinge (2015). The views expressed here are those of the authors and do not necessarily reflect the policies of the U.S. Bureau of Labor Statistics, the U.S. Food and Drug Administration nor RTI International.

3 Overview I. Introduction: Data from Surveys and Alternative Sources
II. Framework for Integration III. Design Example: Frame Enrichment from Alternative Data Sources IV. Collection Example: Administrative Data for Sample Units We’ll develop those ideas in four sections. We’ll begin with a brief review of some distinctions between surveys and other data sources. That will lead to a general framework for integration of these sources, based on customary total survey error models. And then we will illustrate these general ideas with two examples. The first example involves enrichment of frames based on auxiliary information. And the second example centers on the possible use of administrative records to reduce data collection burden for some sample units.

4 I. Data from Surveys and Alternative Sources
A. Prospective Data Sources for Government Agencies, Other Large-Scale Statistical Organizations 1. Traditional sample surveys (e.g., Fuller, 1999): a. High degree of design control, replicability b. Specification of - Target population(s), parameter(s) - Components of uncertainty considered in inference In considering the landscape of potential data sources, we’ll begin with traditional sample surveys. At least in principle, we expect to have a high degree of design control and replicability for our surveys. And in parallel with that, we generally expect to have a clear specification of the target populations and parameters of interest. And we expect the design to provide information on the applicable components of uncertainty that can allow us to carry out formal inference, like construction of confidence intervals.

5 I. Surveys and Alternative Sources (Continued)
2. Alternative Data Sources: “Big,” “Non-designed” or “Organic Data” (Groves, 2011, 2013; Couper, 2013): - Generated for non-statistical purposes - Limited (or no) “design control” - Often “tall and thin” = “variable poor” a. Specialized admin (taxes, regulation, benefits) Ex: Automobile titles (transactions & tax) b. Commercial transactions Ex: Subscription lists To some degree, those ideas carry over to work with broader classes of non-survey data. Bob Groves, Mick Couper and others have used the terms “non-designed data” or “organic data” to describe data sources that developed originally for non-statistical purposes. Other persons use broader terms like “big data.” And discussions of these data generally highlight the fact that with very few exceptions, statistical organizations have little or no design control over the format and statistical features of those data. In addition, these data are often “tall and thin” – in the sense that the source covers many population units, but provides only a few variables for those units There are quite a few examples of “organic data” that are receiving attention at present. Some involve very specialized administrative sources. For example, BLS uses automobile title data to capture some information on consumer expenditures related to vehicle purchases – an important but narrow slice of the overall picture of consumer expenditures and prices. And there are potential sources of information on commercial transactions, like grocery scanner data and subscription lists.

6 I. Surveys and Alternative Sources (Continued)
c. Internal corporate files (with informed consent) Ex: Employment, wage, benefit and price files d. Web-scraped data on product features, prices e. Social media Ex: Unemployment, job openings (Shapiro, 2014) f. Search engine results Ex: Disease outbreaks (Google flu) Ex: Demographics (Cressie et al., 2013) For BLS employment, wage and benefit surveys, internal corporate files provide additional prospective data sources. And one could consider more speculative sources, like web-scraped data on price and product features; social media traces and search engine results.

7 I. Surveys and Alternative Sources (Continued)
C. Two General Approaches to Integration 1. Use alternative source to supplement the survey - Enhance sample frames - Target subpopulations in sampling - Improve unit contact, other fieldwork - Direct replacement of burdensome, expensive or error-prone survey items - Improved auxiliary info for edit, imputation or weighting These alternative sources are potentially valuable, but by themselves they generally do not provide sufficient information for valid population inferences. However, if we integrate these alternative sources with a sample survey, we often can produce high-quality population inferences at reduced costs in ways that would not be possible with either of the sources by itself. For integration of these sources, there are two complementary approaches. First, one can use alternative sources to supplement a sample survey. For example, organic data can enhance sample frames and to support the efficient targeting of specified subpopulations in sampling, as we will see in our first detailed example in a moment. In other cases, the auxiliary information can improve unit contact efforts – for example, by providing applicable telephone numbers – or help with other fieldwork steps. Also, as we will see in our second detailed example, one may consider using the alternative source to replace survey items that are especially burdensome, expensive, or error-prone. And finally, auxiliary information may help in practical edit, imputation or weighting procedures.

8 I. Surveys and Alternative Sources (Continued)
2. Focus primary attention on the “organic” data source (some good statistical properties, with limitations - coverage/representativeness, definitional issues, measurement biases, aggregation effects) a. Specialized sample survey - adjust for limitations Ex: U.S. Current Employment Survey b. Need to develop broader classes of supplementary survey designs On the other hand, in some cases, we can focus primary attention on the “organic” data source, provided that source has some good statistical properties, but with important limitations, like coverage problems, definitional issues or measurement biases. In this second scenario, we then develop a specialized sample survey to collect enough data to allow rigorous adjustments for these limitations, and thus allow satisfactory population inference. That general idea has been used for a long time. A prominent example is the U.S. Current Employment Survey. And the expanded availability of organic data opens the door for much deeper development of classes of these supplementary designs, and I believe that this will be a very rich area for future statistical research. But in the interest of time, the remainder of this talk will focus on the first scenario, in which we use organic data to supplement a standard survey.

9 II. Framework for Integration
A. General Design Goal: Balance Multiple Dimensions of Quality, Cost & Risk B. Six Quality Dimensions (e.g., Brackstone, 1999): Qualitative (timeliness, relevance, comparability, coherence and accessibility) Quantitative (total survey error model components) With that background in mind, we can think about extensions of survey concepts to allow the integration of multiple data sources. Stated at a very broad level, survey design requires us to balance multiple dimensions of quality, cost and risk. The quality dimensions involve several qualitative characteristics – like timeliness, relevance, comparability, coherence and accessibility. And they also involve a quantitative dimension – generally labeled accuracy – which includes all of the usual components of a total survey error model.

10 II. Framework for Integration (Continued)
C. Total Survey Error: An Estimator-Focused Approach: (Estimator) – (True value) = (frame error) + (sampling error) + (nonresponse effects) + (measurement error) + (processing effects) Andersen et al. (1979), Groves (1989), Weisberg (2005), Biemer (2010), Lyberg (2012), Kenett and Shmueli (2014), many others In this slide, we have a relatively simple form of a total survey error model, in which the difference between an estimator and the true target parameter value is viewed as the sum of several distinct error terms like frame error, sampling error, nonresponse effects, measurement error and processing effects. There is a very rich literature that has explored these models in depth, and I’ve cited a few of the seminal works and review papers at the bottom of this slide.

11 II. Framework for Integration (Continued)
D. Integration of non-survey sources with survey data fits with extensions of TSE models to non-survey settings, e.g., Biemer (2014) Davern (2007, 2009, 2010) FCSM (1980, SPWP #6) Herzog, Winkler and Scheuren (2007) Iwig et al. (2013, Data Quality Assessment Tool) IAOS (2008) Conference Proceedings Jabine and Scheuren (1985) Jeskanen-Sundstrom (2007) Ord and Iglarsh (2007) Penneck (2007) Royce (2007) Winkler (2009) Zhang (2009, 2011, 2012) In selected areas – primarily related to administrative records - statisticians have explored extensions of total survey error models. And I’ve listed a few of the references here.

12 II. Framework for Integration (Continued)
E. General approach: Integrate organic data source(s) to reduce magnitudes of TSE component(s), costs F. Properties center on empirical results, but we often have limited information in initial exploration 1. Extend standard design ideas to obtain needed additional information quickly and at low cost. 2. Often must supplement with sensitivity analyses For the most part, that literature has been descriptive in nature – providing tools for evaluation of the quality of an organic source. But those same tools provide a useful framework for integration of the organic data with our sample survey. In particular, we want to integrate those organic sources in a way that reduces the magnitudes of one or more of our total survey error components, or reduces costs without inflating those error components. In implementing those ideas, we have a MAJOR CHALLENGE because at initial design stages we often have limited empirical information on the magnitudes of these error components and costs. Consequently, we need to extend standard design ideas in ways that allow us to obtain some error and cost information quickly and at relatively low cost. In addition, we often need to supplement that information with sensitivity analyses.

13 III. Example: Frame Enrichment from Alternative Data Sources
A. Frames: List of Prospective Sampling Units 1. Example: Survey of youths aged 11 to 16 to evaluate an anti-tobacco media campaign 2. Design: Address-based sampling (ABS) a. Frames use updates from the U.S. Postal Service Computerized Delivery Sequence (USPS-CDS) file b. Overviews: Iannacchione (2011), Link (2010) So those are the general ideas, and the remainder of this presentation will illustrate those ideas with two brief examples. The first example involves use of auxiliary data to enrich our frame – or the list of households from which we will select our sample units. This arose in a survey of youths age 11 to 16 to evaluate an anti-tobacco media campaign. This particular survey used address-based sampling - or ABS - in which frames that incorporate updates from the U.S. Postal Service’s computerized delivery sequence file.  Iannacchione (2011) and Link (2010) provide useful overviews of ABS.

14 III. Frame Enrichment (Continued)
3. Supplementary information from some vendors a. For field operations: Indicators for vacancy, educational housing (dormitories), seasonal housing, post office boxes, matching with telephone lists b. Subpopulation membership indicators: Basic demographics, presence of children in the household c. Geography: latitude and longitude (matching with specific blocks, other Census geographical groups, jurisdictions) ABS frames generally come from vendors that are licensed to use the U.S.P.S. file updates, and the variables that accompany the addresses, such as indicators for vacancy, educational housing (dormitories), seasonal housing, post office boxes, and the like.  These vendors support direct marketing, and their proprietary frames include numerous additional variables to help identify households with various characteristics, such as basic demographics or the presence of children in the household.  Some vendors also offer to match addresses to telephone lists.  Often the addresses have latitudes and longitudes appended so that the addresses can be mapped to specific blocks or other census geographies.

15 III. Frame Enrichment (Continued)
B. Some research on the auxiliary variables in address- based frames or on marketing database variables appended to the frames: Amaya et al. (2014) DiSogra et al. (2010) Dekker and Murphy (2014) Harter and McMichael (2013) Hubbard et al. (2014) McMichael et al. (2014) Ridenhour et al. (2014) Roth et al. (2013) Valliant et al. (2014) The literature contains several evaluations of how well the address frames cover the population of housing unit.  Relatively fewer researchers have examined the quality and characteristics of auxiliary variables.  Some examples include Dekker and Murphy (2014), who examined the drop point[1] variable from the U.S.P.S. data.  Amaya et al. (2014) also examined the drop point variable, along with the vacancy flag and the college flag.  Harter and McMichael (2013) studied the coverage of telephone numbers appended to an address frame.  DiSogra et al. (2010) compared vendor auxiliary demographic data (age and education of head of household, race/ethnicity, income, marital status, home ownership, number of adults, presence of children) with survey responses to evaluate quality.  West et al. (2014) compared auxiliary data from two vendors in their ability to predict response propensity.  Hubbard et al (2014) compared the auxiliary data from two vendors in predicting age eligibility and survey outcomes.  McMichael et al (2014) and Valliant et al (2014) demonstrate that imperfect auxiliary data have the potential to improve stratification and design efficiency.  As the precursor to this paper, Ridenhour et al. (2014) examined specifically age and race/ethnicity in stratifying the sample for the anti-tobacco media evaluation. [1] Drop points are central locations for mail delivery serving multiple housing units without specific unit designations.  Drop points may be apartment buildings, gated communities, dormitories, assisted living facilities, or other clustered types of housing.

16 III. Frame Enrichment (Continued)
C. Tobacco-Use Study: 1. Initial strata and primary sample units based on large geographical aggregates 2. Within selected block groups: Further stratification of households based on refined frame information? Subpop membership: race-ethnic classification, presence of youth, specific ages of youth Smoking-related: education, income, other SES

17 III. Frame Enrichment (Continued)
D. Question: Worthwhile to use additional variables in frame? 1. Evaluation: Cost-variance trade-offs a. Costs of acquiring more variables, linking to current frame b. Possible cost reductions - Facilitate unit contact - Screening for targeted subpopulations

18 III. Frame Enrichment (Continued)
2. Variances of key estimators a. Prevalence (and initiation) rates for usage of specific types of tobacco b. Coefficients of related logistic reg models 3. Important cautionary note: Under differential sampling rates facilitated by enriched frames, variance-cost trade-offs may not be well approximated by customary measures like raw sample counts, design effects

19 III. Frame Enrichment (Continued)
4. Further complications: a. Quality of additional frame variables? - Proportion missing? Informative missingness? - Out of date (e.g., household move, dissolution) - Misclassification b. Impact on sampling properties of enriched frame?

20 III. Frame Enrichment (Continued)
E. Evaluate Efficiency of Prospective Alternative Design that Uses More Frame Variables 1. Use previous survey information for: a. Subpop “hit rates,” means, variances b. Relative costs for screening, in-depth data collection c. Data quality 2. Sensitivity analyses often required

21 IV. Collection Example: Administrative Data for Sample Units
A. U.S. Consumer Expenditure Survey: Consent to Link with Administrative Data 1. Large-scale household survey that collects many items on consumer expenditures, as well as income and assets 2. Voluntary responses, so respondent cooperation is crucial 3. Respondent concerns potentially include both burden and privacy So the tobacco example shows one way in which alternative data sources can help at an early stage of the survey process – frame refinement and sample selection. Alternative sources can also help in data collection and estimation work, and we’ll illustrate that idea with an example from the U.S. Consumer Expenditure Survey. The Consumer Expenditure Survey is a large-scale household survey managed by the Bureau of Labor Statistics, with fieldwork performed by the Census Bureau. This survey collects data on a wide range of consumer expenditures, as well as information on family income and assets. Response to this survey is voluntary, so it is important to maintain a high level of respondent cooperation. In working to maintain cooperation, we are sensitive to potential respondent concerns related to both burden and privacy.

22 IV. Administrative Data for Sample Units (Continued)
4. From 2011 CE Research Section: “We’d like to produce additional statistical data, without taking up your time with more questions, by combining your survey answers with data from other government agencies. Do you have any objections?” 5. Previous studies: Davis, Elkin, McBride and To (2013) As part of the effort to understand respondent concerns in greater depth, the 2011 research section of the Consumer Expenditure Survey included the question (read quotation). Data from this question were previously studied by Davis et al.

23 IV. Administrative Data for Sample Units (Continued)
B. Unweighted Summary of Responses (Davis et al., Table 2A.1): RESPOBJ Count Percent Yes, object 942 18.89 No, do not object 3951 79.24 Do not Know/Refusal 93 1.87 The Davis et al. analysis included the unweighted summary of responses to this question. (Read numerical results from table.)

24 IV. Administrative Data for Sample Units (Continued)
C. Sensitivity analysis - Impact on estimator quality if: 1. Ask “do you object” question: a. If unit objects: Treat as nonrespondent for linkage to certain govt-related variables b. If unit does not object: Link unit to obtain those variables from government source 2. Emphasize: Approach (1) not currently carried out in the field – requires exploratory analysis and legal/regulatory clearance Due to the concern about respondent burden, we wanted to explore the extent to which prospective replacement of survey responses with government administrative data – for specified variables – might allow the production of estimates of acceptable quality. Specifically, the idea is that if a sample unit objected to linkage with government records, then we would treat that unit as a item respondent for a given government-related variable. Conversely, if the sample unit did not object, we would obtain those variables from that government source. In posing this question, I will emphasize that for the current study, we have not linked our sample units with any auxiliary government source. Before doing that, we would need to carry out extensive exploratory analyses and have appropriate legal and regulatory clearance. For the remainder of this discussion, I will present some elements of the sensitivity analysis.

25 IV. Administrative Data for Sample Units (Continued)
D. Steps in Sensitivity Analysis: Extending previous “consent to link” literature (privacy vs. burden) 1. Propensity models for P(Do not object) Predictor variables from: - Demographics, socioeconomic status - Behavioral variables (respondent’s effort in responding, stated concerns about privacy, being too busy, etc.) - Interactions among some predictors For this sensitivity analysis, our first step was to fit models for the probability that a given sample unit would not object to linkage with government data. For this model, we used a large group of predictor variables. Some were standard demographic and socioeconomic status variables like age, gender and educational attainment of the reference person and housing tenure. In addition, we used several behavioral variables based on interviewer observations of the respondent’s effort in responding, stated concerns about privacy or being too busy, and related characteristics. In addition, we included some two factor interactions between these predictors.

26 IV. Administrative Data for Sample Units (Continued)
2. Explore Estimators of Population Means for: FINCBTAX: Total amount of family income before taxes = “Income” ZPROPTAX: Property taxes EVEHPUR: Vehicle purchase cost (outlays for vehicle purchases including downpayment, principal and interest paid on loans, or if not financed, purchase amount) = “Vehicle cost” We then used the results from those propensity models to explore the properties of estimators of population means for three variables that might be obtained from government administrative record files. The first is family income before taxes. The second is property tax paid by the consumer unit. The third is vehicle purchase cost. For audience members who are interested in the details of the Consumer Expenditure Survey public use data files, these three variables are labeled FINCBTAX, ZPROPTAX and EVEHPUR, respectively.

27 IV. Administrative Data for Sample Units (Continued)
3. Three estimation approaches: a. Use all data (as in current production) b. Use only data from the “not object” units, but with no weighting adjustment c. Use only data from the “not object” units, and adjust customary weights with additional factor: 1/P(Do not object to linkage with government data) 4. Standard errors account for the sample design through balanced repeated replication For each of these variables, we compared three estimation methods. The first uses data from all current respondents. This is what we use in current production. The second uses data only from the units that stated that they would not object to linkage of their data with government records. For this second approach, we used the standard survey weights, without any adjustment for omission of the units that objected to linkage. The third approach uses the same basic data as in the second approach, but adjusts weights by a factor equal to the inverse of the estimated probability that a given unit would not object to linkage with government data. Note that the final two approaches use the implicit assumption that data obtained through a government file would be the same as data obtained through our Consumer Expenditure interviews. For all of these approaches, standard errors were computed through customary balanced repeated replication methods – with Fay factor adjustments - that account for the complex sample design.

28 Full Sample Analysis Variable N Mean SE Income 4893 50939.00 1227.51
Property tax 454.15 10.41 Vehicle cost 599.59 33.22 This table presents the results of the first analysis, using all of the original sample respondents for these variables.

29 For “Not Object” Units: No Adjustment
Variable N Mean SE Income 3951 Property tax 429.12 10.76 Vehicle cost 619.14 37.05 This table presents results of the second analysis. Note that the number of applicable units have dropped from 4893 to 3951.

30 For “Not Object” Units: Propensity Adjustment
Variable N Mean SE Income 3915 Property tax 434.74 11.39 Vehicle cost 607.80 36.63 This table presents results from the third analysis, which included the propensity-score adjustment for the omission of units that objected to linkage with government records. Note: P.S.: Propensity Scores = 1 - P(Object).

31 Full Sample vs. Unadjusted “Not Object”
Variable Point Est. Diff. Agree - Full Income 580.98 3.32 Property tax -25.02 6.08 -4.12 Vehicle cost 19.55 10.83 1.81 𝑆𝐸 𝐵𝑅𝑅 ( 𝜃 𝐴𝑔𝑟𝑒𝑒 − 𝜃 𝐹𝑢𝑙𝑙 ) 𝑡( 𝜃 𝐴𝑔𝑟𝑒𝑒 − 𝜃 𝐹𝑢𝑙𝑙 ) This fourth table presents the differences between the mean estimates under the first and second approaches, respectively. The second column presents the differences in the mean estimates, the third column gives the corresponding standard errors for these differences, and the final column presents the associated t statistics. Based on these t statistics, we have fairly strong indications of a difference between the first and second approaches for the income and property-tax variables. In other words, if we consider linkage in the form presented here, it would be important to carry out adjustments to account for the approximately 20% of the units that objected to linkage, and thus were omitted.

32 Full Sample vs. “Not Object” with Adjustment
Variable Point Est. Diff. Agree P.S. - Full Income 619.09 1.90 Property tax -19.41 6.48 -2.99 Vehicle cost 8.21 15.34 0.54 𝑆𝐸 𝐵𝑅𝑅 ( 𝜃 𝐴𝑔𝑟𝑒𝑒 𝑃.𝑆. − 𝜃 𝐹𝑢𝑙𝑙 ) 𝑡( 𝜃 𝐴𝑔𝑟𝑒𝑒 𝑃.𝑆. − 𝜃 𝐹𝑢𝑙𝑙 ) This final table compares the results of the first and third approaches. Again here, the second column presents the differences between the mean estimates, the third column gives the standard errors of those differences, and the final column presents the associated t statistics. It is especially interesting to note here that even after adjustment for the propensity to consent to link, we are still seeing fairly strong indications of bias for estimation of the property-tax variable, and marginal indications of bias for the family income variable. Consequently, we would need to study alternative approaches before considering this type of linkage in practice.

33 V. Closing Remarks A. Summary: Integration of Multiple Data Sources 1. Distinct cases: a. Survey dominant, but supplemented by “organic” source – tobacco & CE examples b. “Organic” source dominant, with survey to “fill in the gaps” 2. Practical issue: Design and analytic approaches based on limited information, sensitivity analyses

34 V. Closing Remarks (Continued)
B. Prospective Extensions 1. Empirical assessment of cost and risk components a. Initial exploratory analyses & pilot studies b. During production processes 2. Other components of total survey error

35 John L. Eltinge Associate Commissioner Office of Survey Methods Research


Download ppt "Practical Approaches to Design and Inference Through the Integration of Complex Survey Data and Non-Survey Information Sources John L. Eltinge, Scott."

Similar presentations


Ads by Google