Presentation on theme: "The Statistical Administrative Records System and Administrative Records Experiment 2000: System Design, Successes, and Challenges Dean H. Judson Planning,"— Presentation transcript:
The Statistical Administrative Records System and Administrative Records Experiment 2000: System Design, Successes, and Challenges Dean H. Judson Planning, Research and Evaluation Division U.S. Census Bureau
Outline of Presentation General principles for using administrative records properly Overview of StARS/AREX history, goals and design Applications and evaluations: StARS 1999 and StARS 2000 versus Census 2000
General Principles for Using Administrative Records Properly
How Administrative Records Are Created and Used Policy changes which change the definition of events and objects Ontologies and thresholds for observation Data entry errors and coding schemes Data management issues Query structure and spurious structure Data collection
Some Important Principles Database Population ! Database Truth ! The true Data exist in the real world, as does the true Population. But, the database gives us information that points to the Truth, and points to the Population.
Population in StARS Database Resident U.S. Population on April 1, 2000 Deceased Non-U.S. Residents Accidental Duplication Oops! Accidentally included contractors! Population in Employee Database Current employees of Company X, October 1, 2001 Terminated, not yet entered in database Accidental Duplication
State 1 State 2 State 3 State 1 State 2 State 3 State 4 State 1 State 2 State 3 State 1 State 2 State 1 State 2 State 3 State 1 State 2 State 1 State 2 State 1 State 2 State 3 State 4 Proper Representation Incomplete Representation Ambiguous RepresentationMeaningless States Data Quality The function that maps from real world to database allows one to reconstruct the real world from the database values. Source: Wand and Wang, 1996:90 Ontologies and Data Quality Real world Database Real world Database Real world Database Real world Database
Coverage of Target Population Intensity/Content of Data Collection LowHigh Low High Administrative Records/ Data Warehouse Careful, well-done sample survey Coverage versus Intensity/Content: How can we get the best of both?
Original DW Database (X) Augmented DW Database, with X and estimated Ys Carefully Collected Data (Y) Representative Sample of X X Ground Truth Estimated Model: Y=f(X) A Model for Borrowing Strength
Statistical Administrative Records System and Administrative Records Experiment
Background and History Statistical Administrative Records System –Six large Federal input files: IRS 1040, IRS 1099, Selective Service, Medicare, Indian Health Service, HUD-TRACS/MTCS –One lookup file: SSA/Census NUMIDENT AREX 2000 –Attempt to use StARS data to simulate administrative records census
What Was the Purpose of StARS 1999 and AREX 2000? Test the feasibility of an administrative records census –StARS: Nationwide –AREX: two counties in Maryland, three in Colorado MD 1.4M persons in 558K households CO: 1.2M persons in 459K households Test two methods for conducting an administrative records census –top-down method –bottom-up method (match to address list, addtl operations)
Can We Do This? Title 13, U.S. Code (§6, (a)-(c) abridged: –The Secretary…may call upon any other department…of the Federal Government…for information pertinent to the work provided for in this title…To the maximum extent possible, the Secretary…shall use [such] information instead of conducting direct inquiries Privacy Act, 1974 (Title 5 §6, abridged): –No agency shall disclose any record…unless…to the Bureau of the Census for purposes of planning or carrying out a census or survey or related [title 13] activity –Each agency that maintains a system of records shall…publish in the Federal Register upon establishment…the existence and character of the system of records (Published StARS in FR, January 1999)
The Statistical Administrative Records System-1999 TY98 IRS 1040 119,946,193 TY98 IRS 1099 598,075,971 Medicare 56,837,022 Selective Service 13,176,234 HUD TRACS 3,342,234 Indian Health Service 3,106,821 Edited IRS 1040 243,260,776 Edited IRS 1099 Edited Medicare Edited Selective Service Edited HUD TRACS Edited Indian Health Service NUMIDENT 676,589,439 Census NUMIDENT 396,185,872 Address Processing 795,742,702 Person Characteristics File (PCF) 396,185,872 Hygiene & Unduplication 136,154,293 Geocoding 102,965,122 (75.6% Coded) 33,189,171 (24.4% Uncoded) Person Processing 875,750,973 SSN Validation (PVS) 844,945,296 Valid (96.5%) Unduplication 279,601,038 Remove Deceased/Create Composite Record 257,764,909 Extraction of AREX Test Site Records 1,459,760 in Baltimore Site 1,229,274 in Colorado Site Invalid SSNs 30,805,677 (3.5%) Race Model Gender Model Mortality Model TIGERCode 1ABI ? Research
Edited MTCS 6,208,615 Edited IRS IMF 253,825,653 Edited HUD TRACS 1,991,655 Edited SSS 14,538,895 Edited Medicare 59,197,759 Edited IRS IRMF 568,109,788 Statistical Administrative Records System-2000 (DRAFT) TY99 IRS IMF 124,729,862 TY99 IRS IRMF 583,642,950 Medicare 59,198,432 Selective Service 13,370,053 HUD TRACS 1,991,672 Indian Health Service 2,730,407 Edited IHS 2,728,548 NUMIDENT 721,228,119 Census NUMIDENT 408,447,131 Address Processing 725,230,009 Hygiene & Unduplication 158,593,956 Geocoding 125,647,359 Person Processing 905,432,071 SSN Validation 895,196,891 Unduplication 289,968,449 Remove Deceased/Create Composite Record 265,950,850 Invalid SSNs 10,235,180 Race Model Gender Model Mortality Model TIGER/MAFCode 1ABI ? HUD MTCS 6,232,562 Person Characteristics File (PCF) 408,447,131
Administrative Records Experiment in 2000 (AREX 2000) Five selected sites in Maryland and Colorado –MD: Baltimore city, Baltimore county; –CO: El Paso county, Douglas county, Jefferson county Attempt to simulate an Administrative Records Census Not all aspects of an Administrative Records Census are simulated –Group Quarters survey –Coverage measurement survey Special operations not included in StARS –Request for physical address (PO boxes/Rural Routes) –Clerical hand geocoding –Field verification of addresses not matched to DMAF
AREX 2000 Evaluations Process: Analyzing selected components of the AREX implementation processing Outcomes: Block level analysis: Age/Race/Sex/Hispanicity comparisons to Census 2000 Household level analysis: –Comparing household distributions for matched addresses –Assessing the feasibility of using administrative records in lieu of a field interview to obtain data on nonresponding households Available at www.census.gov/pred/www/rpts.html#AREX www.census.gov/pred/www/rpts.html#AREX (Synthesis of results from the Administrative Records Experiment in 2000)
Characteristics of Files Included in the StARS System IRS Individual Master 1040 File: –Tax year data; April, 2000 refers to tax year 1999 –TY 99 file arrives October, 2000 –Business entities, estates, other institutions included –~120 million return records/year; maximum of six person records per return –Households below the filing threshold do not need to file –Late filers systematically different than early filers –Tax Filing Unit Housing Unit: 10-20% of addresses are PO Boxes, business addresses, tax preparers (Czajka, 2000) –TY95+: SSNs of dependents requested, recorded –.5% of primary filer, 1.6% of secondary filer, 3.4% of dependents SSNs in error (Czajka, 1987) –Age, race, sex, Hispanic origin microdata not available
Characteristics of Files Included in the StARS System, cont. IRS Information Returns Master File: –Tax year data; April, 2000 refers to tax year 1999 –TY 99 file arrives October, 2000 –Business entities, estates, other institutions included –~700 million records/year –Recipient address Housing Unit –10-20% of addresses are PO Boxes, business addresses, tax preparers –Extremely limited microdata content: Age, race, sex, Hispanic origin microdata not available; name information often truncated –Possible source of information on undocumented persons
Characteristics of Files Included in the StARS System, cont. Selective Service File: –Requested 4/1/99(00) file cut date –~13 million records –Registration required in 1940, suspended in 1975, resumed in 1980 –Presumably, males 18-25 are required to inform SSS when they move –Females, non-immigrant aliens, hospitalized, incarcerated, and institutionalized males, and members of the armed forces are exempt –Limited microdata content: Race, Hispanic origin microdata not available –Address information may not be current
Characteristics of Files Included in the StARS System, cont. Medicare Enrollment Database (EDB): –Requested 4/1/99(00) file cut date -- c urrent and historical Medicare enrollment (Active and Inactive cases) –~ 40 million records at any one point in time –Recipient Address Housing Unit Proxy recipients listed on the file (e.g., John Does benefits c/o Jane Doe; John Does benefits c/o nursing home) –Used in population estimates system for 65+ household population estimates –A small portion of records at any point in time are almost certainly deceased (Kim and Sater, 2000) –Coverage is high (93-102%) but not perfect and unevenly distributed geographically Snowbird states appear to have lower ratios of Medicare to 65+ population than non-snowbird states (Kim and Sater, 2000)
Characteristics of Files Included in the StARS System, cont. Indian Health Service patient file: –Requested 4/1/99(00) file cut date –~10 million patient/transaction records –Transaction record person record –Unduplication about 10 million patient records, 2 million unduplicated SSNs –Many missing SSNs (about 20%) –Integral part of our race model
Characteristics of Files Included in the StARS System, cont. Housing and Urban Development Tenant Rental Assistance Certification System (HUD-TRACS/MTCS): –Requested 4/1/99(00) file cut date –HUD subsidy payments –TRACS 1999: ~ 3.3 million records –TRACS 2000: ~ 2 million records –Short form data for all members of household (Race/Hispanic only for head of household) –Address information may represent project or landlord address
Characteristics of Files Included in the StARS System, cont. Census NUMIDENT File: –~700 million transaction records 400 million individual SSN records –Post 1985: Enumeration at birth –For each SSN: Date of birth, gender, race, place of birth About 50-60 million persons on the file are deceased but not identified as such No current residence information on the file Taxpayer ID Numbers (TINs) not on the file Demographic properties: –About 35% of SSNs on file have alternate names (marriage, divorce, etc.) –About 6% missing gender –Race coding has changed (prior to 1980, 3 races: White, Black, Other); 20% either unknown or other –About 25% of SSNs have transactions with different race codes
Creating Final StARS Database Select best address and demographics based on –geocodability –currency –quality Impute missing demographics (from NUMIDENT/PERSON CHARACTERISTICS FILE) Flag records for deceased people Final database is like the census
Address Processing Results (StARS 1999) Almost 800 million addresses at start About 6 percent identified as potential businesses 136 million address records after unduplication About 75 percent geocoded –85 percent geocoding rate for city-style addresses
Person Processing Results (StARS 1999) 875 million records at start 845 million have valid SSN record (96.5%) 280 million after unduplication by SSN 261 million after removal of known deceased 257 million after removal of known deceased and persons residing in outlying territories StARS 2000: 266 million after removal of known deceased before April 1, 2000 and persons residing in outlying territories
Additional Operations of AREX 2000 Clerical geocoding Request for physical address (for P.O. Boxes, Etc.) Match to Decennial Master Address File Field address verification
Major Analytic Issues with StARS Processing Ontologies –The way in which an administrative agency defines the world may not match the way the Census Bureau defines the world, e.g., –A delivery address suitable for receiving a payment check may not suffice for putting individuals at a street address –Difficult to distinguish individual units within the Basic Street Address –Race coding: Hispanic Origin is a separate race on NUMIDENT –Transaction data person data –How many names does a person have (and in what order)? Proxies – IRS & Medicare records –JOHN WILSONThe address is (presumably) for Mary Smith. John Wilson may or –C/O MARY SMITHmay not live there. –1004 LAUREL LANE –ROCKMONT, MD 22345
Major Analytic Issues with StARS Processing, cont. Addresses that are difficult to place on the ground –About 10 % of addresses are rural style –PO Boxes: 45% for IHS, 9.5% for Medicare, 7.5% for IRS 1040, 6.8% for SSS, 3.8% for IRS 1099,.4% for HUD-TRACS (Huang and Kim, 2000) –1995 IRS/CPS match: 86.5% of tax return cases had the same address as residence address, 94% coded to same county (Sater, 1995) John Smith H&R BLOCK P.O. BOX 12 GREENWAY, MD 29752 –Addresses with both business and residential components Dean H. Judson JUDSON OLD GROWTH LOGGING SERVICES 45850 BACKWOODS HIGHWAY BOONDOCKS, OR 96432
Major Analytic Issues with StARS Processing, cont. Unduplication and matching –Addresses and personal characteristics are measured with substantial variation Often not obvious whether a particular pair of records represent a duplicate or not. Yet, with multiple files, unduplication decisions must be made. –Address matching: 101 Elm Rd, # 197132 101 Elm St, apt 197701 Versus 101 Elm Rd, #197132 101 Elm St, apt 197132
Major Analytic Issues with StARS Processing, cont. Variations in data from different sources –Of the 50% of SSNs found on multiple files, about 1% have more than one gender recorded about 32% have multiple addresses about 2% have multiple races (Huang and Kim, 2000) Imputation from the NUMIDENT –Many files have limited microdata. For those that are found on the NUMIDENT, we can impute microdata from the approximately equivalent NUMIDENT fields. Race Model (Bye, 1998,1999) Gender Model (Thompson, 1999) Mortality Model (Falkenstein, Resnick, and Judson, 2000) –StARS 2002 NUMIDENT Race Enhancement Match NUMIDENT to Census 2000 Use Census 2000 race response to improve imputation model
Major Analytic Issues with StARS Processing, cont. Changing information states –Distinct problem from point in time data collection –Information states change over time/over databases Address information ages over time and varies over databasesSAM SMITH BOX 2 RURAL ROUTE 37486 MAIN STREET WESTPORT, VA 32784FAIRFIELD, VA 33412 (Dated 10/14/98 from Medicare)(From TY97 IRS file, filed sometime in 1998) Mortality information ages over time and varies over databases One database provides information about the other, provided that matching can be performed Data processing requires complex, and substantively important, decision logic at each step
Applications SSN search and validation with GEOkey –Earlier: 90% found in validation step, 5% in search step –2001 Evaluation: 92% found in search (with GEOkey) alone –Apparently, our computer search outperforms SSA manual system CPS/NHIS/ACS to Census matching evaluations –Compare different race responses –Compare survey and Census coverage –Compare variations in Poverty estimates Evaluation of synthetic estimation methods (Popoff, Judson and Fadali, 2001) Multiple-system Estimation for coverage evaluation –Additional information to aid dual-system estimation (Asher and Feinberg, 2001) –Erroneous enumerations (Biemer, Brown, Wiesen, and Judson, 2001)
Evaluations Numident/PCF 1998 versus 1998 National estimates (Miller, Judson and Sater, 2000) State level comparisons of StARS 2000 versus Census 2000 County StARS-synthetic methods versus county ratio estimates and Census 2000 Detailed comparison by (fully crossed) age, race, sex, and Hispanic origin counts versus Census 2000, at the county level AREX tract, block, household evaluations on February 19th
Numident/PCF 1998 versus 1998 National Estimates
State Level Comparisons of Census 2000 to StARS 2000
County StARS-synthetic Methods versus 1999 Estimates
County StARS-synthetic methods versus 1999 Estimates versus Census 2000 % Hispanic (StARS 99 vs. 99 Estimates vs. Census 2000, selected counties where StARS and Estimates deviate by more than 4 percentage points, counties in Colorado) 0 10 20 30 40 50 60 70 80 90 Alamosa Archuleta Bent Chaffee Conejos Costilla Crowley Fremont Garfield Huerfano Kiowa La Plata Las Animas Lincoln Mineral Morgan Otero Phillips Pueblo Saguache San Juan StARS 99 Census 2000 99 Estimates Counties in which StARS 99 is closer to Census 2000 are marked with a star.
Fully crossed age, race, sex, and Hispanic Origin array (ARSH array) For every county in the U.S., count the number of nondeceased persons by: –Single year of age (0,101+) –Race (four groups) –Sex (two groups) –Hispanic origin (Hispanic/non) –Potentially 102 x 4 x 2 x 2 = 1632 cells per county, 3141x1632 = 5,126,112 in the U.S. Error Measures: –Simple difference (C-S) –Algebraic percent error (S-C)/C
Note: Each data point is a single countys ARSH cell.
Note: Each data point is a single countys ARSH cell.
Age/Sex distributions, selected counties in Texas Anderson County (N of Houston)Andrews County (Far west, NM border) Atascosa County (Southern part of state) Brazos County (W of Houston)
Concluding Thoughts Historians of science will say that there was an explosion of research into Administrative Records and Data Warehousing in the late 20 th /early 21 st century Using these databases in a statistically-principled way requires a new statistical paradigm: –Not survey sampling per se –Not econometric modeling per se –Not coverage measurement per se –Something new These databases have some similar, but many different data quality issues than usual survey or census data We are attacking these issues with real Census applications
For Further Reading Alvey, W., and Scheuren, F. (1982). Background for an Administrative Records Census. Proceedings of the Social Statistics Section. Alexandria, VA: American Statistical Association. Asher, J., and Feinberg, S. (2001). Statistical Variations on an Administrative Records Census. Proceedings of the Social Statistics Section. Alexandria, VA: American Statistical Association. Biemer, P., Brown, G., Weisen, C., and Judson, D.H. (2001). Triple system estimation in the presence of erroneous enumerations. Proceedings of the Social Statistics Section. Alexandria, VA: American Statistical Association. Under review at the Journal of Official Statistics. Bye, B. (1997). Administrative Record Census for 2010 Design Proposal, Final Report. Rockville, MD: Westat, Inc. Bye, B. (1998). Race and ethnicity modeling with SSA Numident Data: Interim report: File development and tabulations. Unpublished document available from the U.S. Bureau of the Census. Bryant, C. (1995). Comparing the LUCA address list to local records. Paper presented at the 1995 State Data Center Meeting, San Francisco, CA, April 4, 1995. Czajka, J., Moreno, L., and Schirm, A.L. (1997). On the Feasibility of Using Internal Revenue Service Records to Count the U.S. Population. Washington, DC: Mathematica Policy Research, Inc. Czajka, J. (1999). Can we count on administrative records in future U.S. Censuses? Presentation at the Bureau of the Census, December 15, 1999. Falkenstein, Matthew, Resnick, Dean R., and Judson, Dean. H. (2000). The Mortality Module of the Statistical Administrative Records System. Administrative Records Memorandum Series, U.S. Census Bureau. Farber, Jim, and Shaw, Kevin M. (2002). Dual System Estimates of Housing Units Based on Administrative Records. To appear in the 2002 Proceedings of the American Statistical Association, Government Statistics Section [CD- ROM], Alexandria, VA: American Statistical Association. Heimovitz, Harley K (2002). Administrative Records Experiment 2000: Outcomes. To appear in the 2002 Proceedings of the American Statistical Association, Government Statistics Section [CD-ROM], Alexandria, VA: American Statistical Association. Huang, E., and Kim, J. (2000). One Percent Sample Study Report (SRD-DRAFT). Unpublished document available from the U.S. Bureau of the Census, February 10, 2000.
For Further Reading Judson, D.H., and Popoff, C.L. (2000). Research Use of Administrative Records. University of Nevada: Nevada State Demographers Office. Judson, D. H. (2000). The Statistical Administrative Records System: System Design, Successes, and Challenges. Paper presented at the 2000 Data Quality Workshop, Morristown, NJ, Nov 30-Dec 1. Judson, D.H., Popoff, Carole L., and Batutis, Michael (2001). An Evaluation of the Accuracy of U.S. Census Bureau County Population Estimation Methods. Statistics in Transition, 5:185-215. Judson, D.H. (2001). A Partial Order Approach to Record Linkage. Paper presented at the Federal Committee on Statistical Methodology, Washington, DC, November 14, 2001. Judson, D.H. (2002). Adventures in Bayesian Record Linkage. Paper presented at the Classification Society of North America, June 11, 2002. Judson, Dean H. (2002). Merging Administrative Records Databases in the Absence of a Register: Data Quality Concerns and Outcomes of an Experiment in Administrative Records Use. Paper presented at the UNECE- EUROSTAT work session on registers and administrative records in social and demographic statistics, Geneva, Switzerland, 9-11 December 2002). Kim, M. O., and Sater, D. (2000). Defining the Medicare Data Universe for the U.S. Census Bureau's Population Estimates Program. Paper presented at the Southern Demographic Association meetings, New Orleans, LA, August 29, 2000. Leggieri, Charlene, and Prevost, Ron (1999). Expansion Of Administrative Records Uses At The Census Bureau: A Long-Range Research Plan. Paper presented at the November 1999 Meeting of the Federal Committee on Statistical Methodology, Washington D.C. Miller, E., Judson, D.H., and Sater, D. (2000). The 100% Census NUMIDENT: Demographic Analysis of Modeled Race and Hispanic Origin Estimates Based Exclusively on Administrative Records Data, Paper presented at the Southern Demographic Association meetings, New Orleans, LA, August 29, 2000. Popoff, C.L., Judson, D.H., and Fadali, Betsy (2001). Measuring the Number of People Without Health Insurance: A Test of a Synthetic Estimates Approach for Small Area Estimates using SIPP Microdata. Paper presented at the Federal Committee on Statistical Methodology, Washington, DC, November 14, 2001.
For Further Reading Sailer, P., Weber, M., and Yau, E. (1993). How Well Can IRS Count the Population? 1993 Proceedings of the Survey Research Methods Section. Alexandria, VA: American Statistical Association. Sater, D. (1995). Differences in Location of Households and Tax Filing Units. Paper presented at the 1995 meeting of the Population Association of America, San Francisco, CA, April 6, 1995. Stuart, E. and Zaslavsky, A.M. (2002). Using administrative records to predict census day residency. In Constantine Gatsonis, Robert E. Kass, Alicia Carriquiry, Andrew Gelman, David Higdon, Donna K. Pauler, Isabella Verdinelli (Eds.), Case Studies in Bayesian Statistics Volume VI. New York, NY: Springer. Thompson, Herbert (1999). The Development of a Gender Model with SSA Numident Data. Administrative Records Research Memorandum Series #32, U.S. Census Bureau. Wand, Y., and Wang, R. Y. (1996). Anchoring data quality dimensions in ontological foundations. Communications of the ACM, 39: 86-95. Zanutto, Elaine, and Zaslavsky, Alan M. (2001). Using Administrative Records to Impute for Nonresponse. In R. Groves, R.J.A. Little, and J.Eltinge (Eds), Survey Nonresponse. New York: John Wiley.
Glossary of Terms Administrative records: Data collected wherein the primary purpose is to administer a regulation or record a transaction rather than data collection per se. Administrative Records Census: A Census of Population and Housing in which a predominant component of the census-taking is performed by using administrative records databases. In practice, field operations (for example, for coverage measurement or for Group Quarters enumeration) often coincide. AREX2000: Administrative Records Experiment in 2000, an experimental attempt to simulate an Administrative Records Census in two sites in the U.S. Basic Street Address: The primary street number and street name, omitting apartment numbers or other within- structure identifiers. CPS: Current Population Survey, an ongoing survey administered by the U.S. Census Bureau. Data Quality: The ability to construct a mapping from the ontological representation of a data item in a database to its appropriate ontological representation in the real world. Master Address File (MAF): A file of addresses maintained by the U.S. Census Bureau for the purpose of taking its decennial census, and acting as a frame for ongoing sample surveys. The Decennial Master Address File is referred to as the DMAF. Master Housing File: A file of addresses developed by the Statistical Administrative Records System. Microdata: Data on individual person or housing characteristics, i.e., race, sex, age, street address, zip code. Ontology: The study of what is, that is, the categories by which we understand the world. StARS: Statistical Administrative Records System, an experimental database that combines information from several major Federal databases into one database that can be used for census-taking purposes.