Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12.

Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12

Announcements No class on Wednesday, Oct 31 Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation” Lecture 17: 590.03 Fall 122

Outline (continuation of last class) Relaxing differential privacy for utility – E-privacy [M et al VLDB ‘09] Application of Differential Privacy in US Census [M et al ICDE ‘08] Lecture 17: 590.03 Fall 123

E-PRIVACY Lecture 17: 590.03 Fall 124

Defining Privacy... nothing about an individual should be learnable from the database that cannot be learned without access to the database. T. Dalenius, 1977 Problem with this approach: Analyst knows Bob has green hair. Analysts learns from published data that people with green hair have 99% probability of cancer. Therefore analyst knows Bob has high risk of cancer, even if Bob is not in the published data. Lecture 17: 590.03 Fall 125

Defining Privacy Therefore analyst knows Bob has high risk of cancer, even if Bob is not in the published data. This should not be considered a privacy breach – such correlations are exactly what we want the analyst to learn. Lecture 17: 590.03 Fall 126

Counterfactual Approach Consider 2 distributions: Pr[Bob has cancer | adversary’s prior + output of mechanism on D] – “What adversary learns about Bob after seeing the published information” Pr[Bob has cancer | adversary’s prior + output of mechanism on D -Bob ] where D’ = D – {Bob} – “What adversary would have learnt about Bob even (in the hypothetical case) when Bob was not in the data” – Must be careful … when removing Bob you may also need to remove other tuples correlated with Bob … Lecture 17: 590.03 Fall 127

Counterfactual Privacy Consider a set of data evolution scenarios (adversaries) {θ} For every property s Bob about Bob, and output of mechanism w |log P(s Bob | θ, M(D) = w) - log P(s Bob | θ, M(D -Bob ) = w) | ≤ ε When {θ} is the set of all product distributions which are independent across individuals, – D -Bob = D – {Bob’s record} – A mechanism satisfies the above definition if and only if it satisfies differential privacy Lecture 17: 590.03 Fall 128

Counterfactual Privacy Consider a set of data evolution scenarios (adversaries) {θ} For every property s Bob about Bob, and output of mechanism w |log P(s Bob | θ, M(D) = w) - log P(s Bob | θ, M(D -Bob ) = w) | ≤ ε What about for other sets of prior distributions {θ}? – Open Question: If {θ} contains correlations, then definition of D -Bob itself is not very clear (like discussed in previous class about count constraints, or social networks) Lecture 17: 590.03 Fall 129

Certain vs Uncertain Adversaries Suppose an adversary has an uncertain prior: Consider a two sided coin A certain adversary knows bias of coin = p (for some p) – Exactly knows the bias of the coin – Knows every coin flip is random draw which is H with prob p An uncertain adversary may think the coin’s bias is in [p – δ, p+δ] – Does not exactly know the bias of the coin. Assumes coin’s bias θ is drawn from some probability distribution π – Given θ, every coin flip is random draw which is H with prob θ Lecture 17: 590.03 Fall 1210

Learning In machine learning/statistics, you want to use the observed data to learn something about the population. E.g., Given 10 flips of a coin, what is the bias of the coin? Assume your population is drawn from some prior distribution θ. We don’t know the θ, but may know some θ’s are more likely than others (like π, a probability distribution over θ’s) We want to learn the best θ that explains the observations … If you are certain about θ in the first place, there is no need for statistics/machine learning. Lecture 17: 590.03 Fall 1211

Uncertain Priors and Learning In many privacy scenarios, statistician (advertiser, epidemiologist, machine learning operator, …) is the adversary. But statistician does not have a certain prior … (otherwise no learning) Maybe, we can model a class of “realistic” adversaries using uncertain priors. Lecture 17: 590.03 Fall 1212

E-Privacy Counterfactual privacy with realistic adversaries Consider a set of data evolution scenarios (uncertain adversaries) {π} For every property s Bob about Bob, and output of mechanism w |log P(s Bob | π, M(D) = w) - log P(s Bob | π, M(D -Bob ) = w) | ≤ ε where P(s Bob | π, M(D) = w) = ∫ θ P(s Bob | θ, M(D) = w) P(θ | π) dθ Lecture 17: 590.03 Fall 1213

Realistic Adversaries Suppose your domain has 3 values green (g), red (r) and blue (b). Suppose individuals are assumed to be drawn from some common distribution θ = {p g, p r, p b } Lecture 17: 590.03 Fall 1214 {p g = 1, p r = 0, p b = 0} {p g = 0, p r = 1, p b = 0} {p g = 0, p r = 0, p b = 1}

Modeling Realistic Adversaries E.g., Dirichlet D(α g, α r, α b ) priors to model uncertainty. – Maximum probability given to {p* g, p* r, p* b }, where p* g = α g / (α g + α r + α b ) D(6,2,2) D(3,7,5) D(6,2,6) D(2,3,4) {p* g =0.2, p r =0.47, p b = 0.53} {p* g =0.43, p r =0.14, p b = 0.43} 15Lecture 17: 590.03 Fall 12

E.g., Dirichlet Prior Dirichlet D(α g, α r, α b ). – Call α = (α g + α r + α b ) the stubbornness of the prior. – As α increases, more probability is given to {p* g, p* r, p* b }. – When α  ∞, {p* g, p* r, p* b } has probability 1 and we get the independence assumption. Lecture 17: 590.03 Fall 1216 D(2,3,4) D(6,2,6) α = 10α = 14

Better Utility … Suppose we consider a class of uncertain adversaries, who are characterized by Dirichlet distributions with known stubborness (α), but unknown shape (α g, α b, α r ). Algorithm: Generalization – Size of each group > α/ε-1. – If size of a group = δ(α/ ε-1), then most frequent sensitive value appears in at most 1-1/(ε+δ) fraction of tuples in the group. Hence, we now also have a sliding scale to assess the power of adversaries (based on stubbornness) – More α implies more coarsening implies less utility Lecture 17: 590.03 Fall 1217

Summary Lots of Open Questions: What is the relationship between counterfactual privacy and Pufferfish? In what other ways can causality theory be used? For defining correlations? What are other interesting ways to instantiate E-privacy, and what are efficient algorithms for E-privacy? … Lecture 17: 590.03 Fall 1218

DIFFERENTIAL PRIVACY IN US CENSUS Lecture 17: 590.03 Fall 1219

OnTheMap: A Census application that plots commuting patterns of workers Workplace (Public) Residences Residences (Sensitive) 20 http://onthemap.ces.census.gov/ Lecture 17: 590.03 Fall 12

OnTheMap: A Census application that plots commuting patterns of workers Worker IDOriginDestination 1223MD11511DC22122 1332MD2123DC22122 1432VA11211DC22122 2345PA12121DC24132 1432PA11122DC24132 1665MD1121DC24132 1244DC22122 Census Blocks Residence (Sensitive) Workplace (Quasi-identifier) 21Lecture 17: 590.03 Fall 12

Why publish commute patterns? To compute Quarterly Workforce Indicators – Total employment – Average Earnings – New Hires & Separations – Unemployment Statistics E.g., Missouri state used this data to formulate a method allowing QWI to suggest industrial sectors where transitional training might be most effective … to proactively reduce time spent on unemployment insurance … 22Lecture 17: 590.03 Fall 12

A Synthetic Data Generator (Dirichlet resampling) += Multi-set of Origins for workers in Washington DC. Noise (fake workers) Step 1: Noise Addition (for each destination) D (7, 5, 4)A (2, 3, 3)D+A (9, 8, 7) Washington DC SomersetFuller Noise added to an origin with at least 1 worker is > 0 23 Noise infused data Lecture 17: 590.03 Fall 12

A Synthetic Data Generator (Dirichlet resampling) Step 2: Dirichlet Resampling (for each destination) (9, 8, 7)(9, 7, 7) Draw a point at random Replace two of the same kind. (9, 9, 7) S : Synthetic Data frequency of block b in D+A = 0  frequency of b in S = 0 i.e., block b is ignored by the algorithm. frequency of block b in D+A = 0  frequency of b in S = 0 i.e., block b is ignored by the algorithm. 24Lecture 17: 590.03 Fall 12

How should we add noise? Intuitively, more noise yields more privacy … How much noise should we add ? To which blocks should we add noise? Currently this is poorly understood. – Total amount of noise added is a state secret – Only 3-4 people in the US know this value in the current implementation of OnTheMap. 25Lecture 17: 590.03 Fall 12

How should we add noise? Intuitively, more noise yields more privacy … How much noise should we add ? To which blocks should we add noise? Currently this is poorly understood. – Total amount of noise added is a state secret – Only 3-4 people in the US know this value in the current implementation of OnTheMap. 1. How much noise should we add? 2. To which blocks should we add noise? 1. How much noise should we add? 2. To which blocks should we add noise? 26Lecture 17: 590.03 Fall 12

Privacy of Synthetic Data Theorem 1: The Dirichlet resampling algorithm preserves privacy if and only if for every destination d, the noise added to each block is at least where m(d) is the size of the synthetic population for destination d and ε is the privacy parameter. 27 m(d) ε - 1 Lecture 17: 590.03 Fall 12

1. How much noise should we add? Noise required per block: (differential privacy) Add noise to every block on the map. There are 8 million Census blocks on the map! 1 million original workers and 16 billion fake workers!!! 28 1 million original and synthetic workers. lesser privacy 2. To which blocks should we add noise? Lecture 17: 590.03 Fall 12

Intuition behind Theorem 1. 29 Two possible inputs blue and red are two different origin blocks. Adversary knows individual 1 is Either blue or red. Adversary knows individual 1 is Either blue or red. Adversary knows individuals [2..n] are blue. D2D2 D1D1 Lecture 17: 590.03 Fall 12

Intuition behind Theorem 1. 30 Two possible inputs blue and red are two different origin blocks. Noise Addition D2D2 D1D1 Lecture 17: 590.03 Fall 12

Intuition behind Theorem 1. 31 Noise infused inputs blue and red are two different origin blocks. For every output … O Dirichlet Resampling D2D2 D1D1 Pr[D 1  O] = 1/10 * 2/11 * 3/12 * 4/13 * 5/14 * 6/15 Pr[D 2  O] = 2/10 * 3/11 * 4/12 * 5/13 * 6/14 * 7/15 Pr[D 1  O] = 1/10 * 2/11 * 3/12 * 4/13 * 5/14 * 6/15 Pr[D 2  O] = 2/10 * 3/11 * 4/12 * 5/13 * 6/14 * 7/15 = 7 Pr[D 1  O] Pr[D 2  O] Lecture 17: 590.03 Fall 12

Intuition behind Theorem 1. 32 Noise infused inputs blue and red are two different origin blocks. For every output … O Adversary infers that it is very likely individual 1 is red … … unless noise added is very large. Dirichlet Resampling D2D2 D1D1 Lecture 17: 590.03 Fall 12

Privacy Analysis: Summary Chose differential privacy. – Guards against powerful adversaries. – Measures privacy as a distance between prior and posterior. Derived necessary and sufficient conditions when OnTheMap preserves privacy. The above conditions make the data published by OnTheMap useless. 33Lecture 17: 590.03 Fall 12

But, breach occurs with very low probability. 34 Noise infused inputs blue and red are two different origin blocks. For every output … O Dirichlet Resampling D2D2 D1D1 Probability of O ≈ 10 -4 Lecture 17: 590.03 Fall 12

Negligible function Definition: f(x) is negligible if it goes to 0 faster than the inverse of any polynomial. e.g., 2 -x and e -x are negligible functions. 35Lecture 17: 590.03 Fall 12

(ε,δ)-Indistinguishability Pr[D 1  T] ≤ e ε Pr[D 2  T] + δ(|D 2 |) 36 For any subset of outputs T O1O1 D2D2 D1D1 For every pair of inputs that differ in one value O2O2 O3O3 O4O4 If T occurs with negligible probability, the adversary is allowed to distinguish between D1 and D2 by a factor > ε using O i in T. Lecture 17: 590.03 Fall 12

Conditions for (ε,δ)-Indistinguishability Theorem 2: The Dirichlet resampling algorithm preserves (ε,δ)- indistinguishability if for every destination d, the noise added to each block is at least where n(d) is the number of workers commuting to d and m(d) ≤ n(d). 37 log n(d) Lecture 17: 590.03 Fall 12

Probabilistic Differential Privacy (ε,δ)-Indistinguishability is an asymptotic measure – May not guarantee privacy when number of workers at a destination is small. Definition (Disclosure Set Disc(D, ε)): The set of output tables that breach ε-differential privacy for D and some other table D’ that differs from D in one value. 38Lecture 17: 590.03 Fall 12

Probabilistic Differential Privacy Adversary may distinguish between D 1 and D 2 based on a set of unlikely outputs with probability at most δ 39 For every probable output OD2D2 D1D1 For every pair of inputs that differ in one value Pr[O | 1 - δ Pr[D 1  O] Pr[D 2  O] Lecture 17: 590.03 Fall 12

1. How much noise should we add? Noise required per block: 40 1 million original and synthetic workers. lesser privacy Differential Privacy Probabilistic Differential Privacy (δ = 10 -5 ) Lecture 17: 590.03 Fall 12

Prob. Differential Privacy: Summary Ignoring privacy breaches that occur due to low probability outputs drastically reduces noise. Two ways to bound low probability outputs – (ε,δ)-Indistinguishability and Negligible functions. Noise required for privacy ≥ (log n(d)) per block – (ε,δ)-Probabilistic differential privacy and Disclosure sets. Efficient algorithm to calculate noise per block (see paper). Does probabilistic differential privacy allow useful information to be published? 41Lecture 17: 590.03 Fall 12

1. How much noise should we add? Noise required per block: 42 1 million original and synthetic workers. lesser privacy Differential Privacy Probabilistic Differential Privacy (δ = 10 -5 ) 2. To which blocks should we add noise? Why not add noise to every block? Lecture 17: 590.03 Fall 12

Why not add noise to every block? Noise required per block: (probabilistic differential privacy) There are about 8 million blocks on the map! – Total noise added is about 6 million. Causes non-trivial spurious commute patterns. – Roughly 1 million fake workers from West Coast (out of a total 7 million points in D+A). – Hence, 1/7 of the synthetic data have residences in West Coast and work in Washington DC. lesser privacy 1 million original and synthetic workers. 43Lecture 17: 590.03 Fall 12

2. To which blocks should we add noise? Noise required per block: (probabilistic differential privacy) Adding noise to all blocks creates spurious commute patterns. lesser privacy 44 1 million original and synthetic workers. Why not add noise only to blocks that appear in the original data? Why not add noise only to blocks that appear in the original data? Lecture 17: 590.03 Fall 12

Theorem 3: Adding noise only to blocks that appear in the data breaches privacy. If a block b does not appear in the original data and no noise is added to b then b cannot appear in the synthetic data. If a block b does not appear in the original data and no noise is added to b then b cannot appear in the synthetic data. 45Lecture 17: 590.03 Fall 12

Theorem 3: Adding noise only to blocks that appear in the data breaches privacy. Worker W comes from Somerset or Fayette. No one else comes from there. If S has a synthetic worker from Somerset Then W comes from Somerset!! Worker W comes from Somerset or Fayette. No one else comes from there. If S has a synthetic worker from Somerset Then W comes from Somerset!! Somerset 1 Fayette 0 Somerset 1 Fayette 0 Somerset 0 Fayette 1 Somerset 0 Fayette 1 46Lecture 17: 590.03 Fall 12

Ignoring outliers degrades utility Each of these points are outliers. Contribute to about half the workers. Each of these points are outliers. Contribute to about half the workers. 47Lecture 17: 590.03 Fall 12

Our solution to “Where to add noise?” Step 1 : Coarsen the domain – Based on an existing public dataset (Census Transportation Planning Package, CTPP). 48Lecture 17: 590.03 Fall 12

Our solution to “Where to add noise?” Step 1 : Coarsen the domain Step 2: Probabilistically drop blocks with 0 support – Pick a function f: {b 1, …, b k }  (0,1] (based on external data) – For every block b with 0 support, ignore b with probability f(b) Theorem 4: Parameter ε increases by 49 b max ( max ( 2 noise per block, f(b) ) ) Lecture 17: 590.03 Fall 12

Utility of the provably private algorithm Experimental Setup: OTM: Currently published OnTheMap data used as original data. All destinations in Minnesota. 120, 690 origins per destination. – chosen by pruning out blocks that are > 100 miles from the destination. ε = 100, δ = 10 -5 Additional leakage due to probabilistic pruning = 4 (min f(b) = 0.0378) Utility measured by average commute distance for each destination block. 50Lecture 17: 590.03 Fall 12

Utility of the provably private algorithm Utility measured by average commute distance for each destination block. Short commutes have low error in both sparse and dense regions. 51Lecture 17: 590.03 Fall 12

Utility of the provably private algorithm 52 Long commutes in sparse regions are overestimated. Lecture 17: 590.03 Fall 12

OnTheMap: Summary OnTheMap: A real census application. – Synthetically generated data published for economic research. – Currently, privacy implications are poorly understood. Parameters to the algorithm are state secret. First formal privacy analysis of this application. – Analyzed the privacy of OnTheMap using variants of Differential Privacy. – First solutions to publish useful information despite sparse data. 53Lecture 17: 590.03 Fall 12

Next Class No class on Wednesday, Oct 31 Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation” Lecture 17: 590.03 Fall 1254

References [M et al ICDE ‘08] A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, L. Vilhuber, “Privacy: From Theory to Practice on the Map”, ICDE 2008 [M et al VLDB ‘09] A. Machanavajjhala, J. Gehrke, M. Gotz, “Data Publishing against Realistic Adversaries”, PVLDB 2(1) 2009 Lecture 17: 590.03 Fall 1255

Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12.

Similar presentations

Presentation on theme: "Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12.

Similar presentations

Presentation on theme: "Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12."— Presentation transcript:

Similar presentations

About project

Feedback