# Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.

## Presentation on theme: "Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA."— Presentation transcript:

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA

Setting for problem Agency seeks to release microdata. Risk of re-identifications from matching to external databases. Statistical disclosure limitation applied to data before release.

Measures of identification disclosure risk Number of population uniques: Does not incorporate intruders knowledge. May not be useful for continuous data. Hard to gauge effects of SDL procedures. Hard to estimate accurately. Probability-based methods (Direct matching using external databases. Indirect matching using existing data set.) Require assumptions about intruder behavior. May be costly to obtain external databases.

Notation for methods Actual record j : Released record j : Available data: Unavailable + perturbed data combined:

Probability of identification Let J = j when record j in Z matches the target record, t. J = r + 1 when target is not in Z.

Calculating CASE 1: Target assumed to be in Z: Units whose do not match targets values have zero probability. For matches, probability equals 1/n t where n t is number of matches in Z. Probability equals zero for j = r+1.

Calculating CASE 2: Target not assumed to be in Z: Units whose do not match targets values have zero probability. For matches, probability is 1/N t where N t is number of matches in popn. For j = r+1, probability is (N t – n t ) / N t

Splitting

Calculating Data swapping: Repeatedly simulate swapping mechanism using Z. Estimate probabilities for combinations of original + swapped values.

Calculating Noise addition: Assume variable k perturbed using Gaussian noise with mean zero and known variance σ 2.

Calculating First distribution is for SDL methods. Second distribution is best model for predicting unavailable variables given what is known.

Calculating when values in U are not perturbed. Intruders may act this way to avoid computations. It is prudent to evaluate risk assuming they do.

Calculating Assume independence to obtain: where

Simulations 51,016 heads of household from 2000 CPS. Potentially available variables: Age, Sex, Race, Marital Status, Property Tax Unavailable variables: Education, Income, Social Security, Child Support Payments

Simulations: SDL Procedures Age: Group in five year intervals. Race and Marital Status: Swap randomly 30% of values for each variable. Property taxes: For positive taxes, add noise from N(0, 290 2 ). Constrain values to be positive. Do not alter 0s. Other variables: Leave at original values.

Simulations: Targets Everyman : has values near median for all variables. Unique : Sample unique on combination of age, sex, race, marital status. Big I : Highest income in data set. Big P : Highest property tax in data set.

Simulations: Summary of results Swaps needed to protect Unique. Age recode plus swaps good protection. Knowing property taxes greatly increases probabilities of identification. Adding noise to positive tax values is not sufficient. (Top-coding helps.)

Similar presentations