Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.

Similar presentations


Presentation on theme: "1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University."— Presentation transcript:

1 1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University Yosi Rinott Hebrew University

2 2 Disclosure Risk Measures Notation: Sample (size n): Population (size N): Tables with K cells: m-way table Risk Measures: = expected number of correct matches of sample uniques Estimates:

3 3 On Definitions of Disclosure Risk In the statistics literature, we present examples of risk measures, and, but we lack formal definitions of when a file is safe In the computer science literature, there is a formal definition of disclosure risk ( e.g., Dinur, Dwork, Nisim (2004-5), Adam and Wortman(1989), who write “it may be argued that elimination of disclosure is possible only by elimination of statistics”) In some of the CS literature any data must be released with noise. The noise must be small enough so that legitimate information on large subsets of the data will be useful, and large enough so that information on small subsets, or individuals will be too noisy and therefore useless (regardless of whether they are obtained by direct queries or differencing etc.)

4 4 On Definitions of Disclosure Risk Worst Case scenario of the CS approach (for example, that the intruder has all information on anyone in the data set except the individual being snooped) simplify definitions, there is no need to consider other, more realistic but more complicated scenarios. But would Statistics Bureaus and statisticians agree to adding noise to any data? Other approaches like query restriction or query auditing do not lead to formal definitions.

5 5 Definition of Disclosure Risk Numerical Data Base, A Query is a sum over a subset of. Query is perturbed by adding some noise of magnitude Proven that almost all can be reconstructed if and none of them can be reconstructed if Adding noise of order hides information on individuals and small groups, but yields meaningful information about sums of O(n) units for which noise of order is natural. Work further expanded to lessen the magnitude of the noise by limiting the number of queries.

6 6 Definition of Disclosure Risk Collaboration with the CS and Statistical Community where: 1. In the statistical community, there is a need for more formal and clear definitions of disclosure risk 2. In the CS community, there is a need for statistical methods to preserve the utility of the data - allow sufficient statistics to be released without perturbation - methods for adding correlated noise - sub-sampling and other methods for data masking Can the formal notions from CS and the practical approach of statisticians lead to a compromise that will allow us to set practical but well defined standard for disclosure risk?

7 7 Probabilistic Models Focus on sample microdata and not whole population (sampling provides a priori protection against disclosure) Standard (natural) Assumptions ind. Bernoulli or Poisson sampling In particular the size biased Poisson distribution

8 8 Probabilistic Models Add iid As ( ) we obtain the mu-argus assumption As ( ) we obtain the above Poisson Model

9 9 Mu-Argus Model (Benedetti, Capobianchi, Franconi (1998)) is the sampling weight of individual i obtained from design or post-stratification where If then but are underestimated risk is under estimated Monotonicity: if we replace by some, risk estimates increase to the correct level in, but how to estimate ?

10 10 Poisson Log-linear Models (Skinner, Holmes (1998), Elamir, Skinner (2005), Skinner, Shlomo (2005)) Monotonicity in the size of the model (number of parameters): Saturated (“big” model) data under fitted risk underestimated Independent (“small” model) data over fitted risk overestimated Intermediate models with conditional independence involves smaller products of marginal proportions and therefore we expect monotonicity of the models, so similar to the choice of, there will be a model which will give a good risk estimate

11 11 Neighborhood of a Log-linear Model Log-linear models takes into account a neighborhood of cells to infer on for determining the risk. For example: Independence Neighborhood, k=(i,j): The estimate is the product of marginal proportions obtained by fixing one attribute at a time, thus if one attribute is income group then inference on very rich involves information on very poor, provided there is another attribute in common, such as marital status. i j

12 12 Discussion of Neighborhoods How likely is a sample unique a population unique? If a sample unique has mostly small or empty neighboring cells, it is more likely to be a population unique. Argus is based on weights and no learning from other cells. The log linear Poisson model takes into account neighborhoods, reduces the number of parameters and also reduces their standard deviation and hence of risk measures (provided that the model is valid). Are there other types of neighborhoods which may be more natural? We focus on ORDINAL variables

13 13 Proposed Neighborhoods Local smoothers for large sparse (ordinal) tables, e.g. Bishop, Fienberg, Holland (1975), Simonoff (1998) Use local neighborhoods to fit a simple smooth function to or to estimate smoothly Construct neighborhood of cells of k, by varying the coordinates of ordinal attributes, and fixing non-ordinal attributes Neighborhood of cell k at distance c from cell k

14 14 Proposed Neighborhoods j i Regressors, for cell k: Define structural zeros if all neighborhoods of a cell which are used in the regression contain only empty cells

15 15 Example Population from 1995 Israeli Census File, Age>15, N=746,949, n=14,939, and K=337,920 Key: Sex(2), Age groups(16), Groups of years of study(10), Number of years in Israel(11), Income groups(12), Number of persons in household (8) Sex is not ordinal and is fixed Weights for Argus obtained by post-stratification on weighting classes: sex, age and geographical location

16 16 Example Model True Values430.01,125.8 Argus114.5456.0 Log-linear model: Independence773.81,774.1 Log-linear model: 2-way Interactions470.01,178.1 Neighborhood Method786.82,146.9 Neighborhood Method w/out structural zeros 385.41,674.1 Neighborhood Method723.32,099.6 Neighborhood Method w/out structural zeros 344.81,624.2

17 17 Results of Example Independent log-linear model and neighborhoods over estimate the two risk measures Argus Model under estimates The all 2-way interaction log-linear Poisson Model has the best estimates Taking into account the structural zeros in the neighborhoods yield more reasonable estimates

18 18 Conclusions Need to refine the neighborhood approach, define the model better and develop MLE theory We expect the new model to work well in multi-way tables when simple log-linear models are not valid Incorporate the approach into a more general regression model, the Negative Binomial Regression, which subsumes both the Poisson Risk Model and the Argus Model


Download ppt "1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University."

Similar presentations


Ads by Google