Presentation on theme: "Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau."— Presentation transcript:
Description, Characterization and Optimization of Drill-Down Methods for Outlier Detection and Treatment in Establishment Surveys J. L. Eltinge, U.S. Bureau of Labor Statistics ICES III Session #66 – June 21, 2007
2 Acknowledgements and Disclaimer: The author thanks Jean-Francois Beaumont, Terry Burdette, Pat Cantwell, Larry Ernst, Julie Gershunskaya, Pat Getz, Howard Hogan, Erin Huband, Larry Huff, John Kovar, Mary Mulry and Susana Rubin-Bleuer for many helpful discussions. This paper expands on many of the ideas developed by Pat Cantwell originally in Eltinge and Cantwell (2006). The views expressed in this paper are those of the author and do not necessarily represent the policies of the U.S. Bureau of Labor Statistics.
3 Overview: I.Drill-Down Procedures for Outlier Detection II.Available Information III.Costs of Drill-Down Procedures IV.Risks of Drill-Down Procedures V.Optimization of Drill-Down Procedures
4 I.Drill-Down Methods of Outlier Detection A.Outliers:Extreme Values 1. Usually (not always) large positive values 2. Review Article: Lee (1995) 3. Variant on Chambers (1986): a. Representative outliers b. Non-representative outliers c. Gross measurement error
5 B.Predominant Literature Focuses on: 1. Extreme values of: a. Unweighted individual observation b. Weighted individual observation 2.The impact of (1.a) and (1.b) on estimators at fairly high levels of aggregation a. Means, totals, other descriptive quantities b. Regression coefficients, other analytic parameters
6 C. Drill-Down Methods 1.(Implicit) assumptions in most outlier literature: a. Low or zero cost of data review, relative to other cost components b. Reference distribution(s) known or readily determined at relatively low cost 2. Issues: a. For many surveys, modal task is data review - Substantial overall expense b. Reference distributions not obvious nor readily obtained (especially for establishment surveys)
7 3. Some agency programs use drill-down procedures: a.Begin data review with examination of relatively fine estimation cells b.Identify estimation cells with extreme initial point estimates c.Examine microdata in identified extreme cells d.Limited formal literature: Exceptions: Luzi and Pallara (1999), DiZio et al. (2005)
8 D.Questions: 1.Under what conditions are drill-down procedures preferable to standard methods of outlier detection and treatment, based on a balanced assessment of: a. Available information b. Costs c. Risks 2. Does the characterization in (1) shed any light on possible approaches to optimization of drill-down procedures?
9 II.Available Information A. Usual Outlier Framework: Reference Distributions from 1. Internal reference distribution: a. Outliers defined with respect to quantiles or other functionals of the full set of sample responses b. Limitations: Small subpopulations; time constraints 2. External reference distributions: a. Observations from similar surveys in previous periods b. Related data from frame or other administrative records c. Limitation: Full comparability?
10 B.Information for Drill-Down Procedures 1.Cell level: a. Generally an implicit prior distribution based on: - Historical and seasonal patterns for an individual cell and related cells; recent aggregate changes - Special information on, e.g., strikes, weather b. Consider formalization through modeling or a full Bayesian framework?
11 2.Microdata level a.Individual observations from current or previous waves of the survey b.Again here, could consider formalization through a Bayesian approach c.For many cells, sample sizes too small to make direct use of tails of empirical distriution alone 3. For both cell and microdata level reviews, the critical values (and corresponding tail probabilities) often remain implicit
12 III.Costs of Drill-Down Procedures A.Review of fewer units at the microdata level should reduce costs B.Quantification of (A) depends on fixed and variable components, e.g., 1. Fixed costs of training for specific industry 2. Incremental cost of review of - one additional cell - one additional response within cell C.Evaluations in (B) complicated by 1. Peak-load staffing constraints 2. Limitations on available accounting information 3. Non-monetary cost constraints, e.g., time
13 IV. Risks of Drill-Down Procedures A. Context for Development and Evaluation: Six Cases (Eltinge and Cantwell, 2006) Case 1: Traditional randomization-based inference for aggregates of the finite population Cases 2 and 3: View finite population as a realization of a superpopulation model Predict function of or estimate
14 Cases 1-3 have dominated literature to date Primary results: Bias-variance trade-offs Reduction of overall mean squared error Explicitly or implicitly use some modeling conditions, e.g., Weibull or other distributional assumption Randomization performance still of interest
15 Case 4 and 5: View true finite population values as sum: where represents long-term smooth trend and represents an irregular component, of true values, both generated by superpopulation models (cf. some discussion of outliers in time series, e.g., Galeano et al., 2006) Prediction for functions of (Case 4) or superpopulation parameter (Case 5) Detailed development depends heavily on model- identification issues, available auxiliary information
16 Case 6: Distinguish between central portion and fringes of population Multivariate normal example: Within central ellipsoid Conceptual links with topcoding, disclosure limitation, core CPI Need to explore: Interest only in central quantiles (Rao et al., 1990; Francisco and Fuller, 1991), or in the core subpopulation as such?
17 B. Cell-Level Risks of Type I and Type II Error 1.Distinguish between a. Primary estimands (examined directly in drill-down procedure) - Risk of implicit overfitting within the selected cell b. Secondary estimands (not examined directly, but important for some subsequent publications) - Risk of masking outliers in dimensions orthogonal to the primary estimand 2. Impact on MSE for resulting primary, secondary estimators
18 2.Unit-level deletion within the extreme cells approximates the survey-weighted influence functions for the cell-level estimand: cf. standard literature on survey-weighted influence functions for aggregate-level estimands (Smith, 1987; Zaslavsky et al , p. 861): where
19 C. Evaluation and Reduction of Risks Not Fully Reflected in Mean Squared Error 1.Squared error loss may not fully reflect risk functions of program managers, other stakeholders 2.Alternative: Risks associated with low-probability event that published estimate differs markedly from: a. True value b. Predicted value based on auxiliary information 3.Consider application of other risk measures, e.g., false discovery rate in machine learning D.Operational Risk: Will a given procedure for outlier detection and treatment be carried out as specified?
20 VI.Closing Remarks A.Summary: Drill-Down Procedures 1. Contrast with standard approaches to outliers and influential observations 2. Requires consideration of a. Available information b. Costs c. Risks 3. Optimization approaches
21 B.Alternatives to Current Drill-Down Procedures 1. Apply adaptive sampling procedures (Seber and Thompson, 1996) to selection of some cells for additional drill-down review a. Condition: Network structure informative for presence of outliers b. May be of special interest for outliers arising from gross errors from a common data-collection or administrative-record source c. Extend inference to account for cells that are not examined in depth
22 2.Instead of cells defined a priori (e.g., by geography, industry and size class), consider cells generated through tree-based machine learning methods (e.g., Brieman et al., 1984) a. Resulting properties depend on specific pruning method used for the trees b. Standard cross-validation methods have some imitations for complex survey data c. Screening to identify problems masked by customary cell structure