Presentation is loading. Please wait.

Presentation is loading. Please wait.

Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12.

Similar presentations


Presentation on theme: "Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12."— Presentation transcript:

1 Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12

2 Announcements No class on Wednesday, Oct 31 Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation” Lecture 17: 590.03 Fall 122

3 Outline (continuation of last class) Relaxing differential privacy for utility – E-privacy [M et al VLDB ‘09] Application of Differential Privacy in US Census [M et al ICDE ‘08] Lecture 17: 590.03 Fall 123

4 E-PRIVACY Lecture 17: 590.03 Fall 124

5 Defining Privacy... nothing about an individual should be learnable from the database that cannot be learned without access to the database. T. Dalenius, 1977 Problem with this approach: Analyst knows Bob has green hair. Analysts learns from published data that people with green hair have 99% probability of cancer. Therefore analyst knows Bob has high risk of cancer, even if Bob is not in the published data. Lecture 17: 590.03 Fall 125

6 Defining Privacy Therefore analyst knows Bob has high risk of cancer, even if Bob is not in the published data. This should not be considered a privacy breach – such correlations are exactly what we want the analyst to learn. Lecture 17: 590.03 Fall 126

7 Counterfactual Approach Consider 2 distributions: Pr[Bob has cancer | adversary’s prior + output of mechanism on D] – “What adversary learns about Bob after seeing the published information” Pr[Bob has cancer | adversary’s prior + output of mechanism on D -Bob ] where D’ = D – {Bob} – “What adversary would have learnt about Bob even (in the hypothetical case) when Bob was not in the data” – Must be careful … when removing Bob you may also need to remove other tuples correlated with Bob … Lecture 17: 590.03 Fall 127

8 Counterfactual Privacy Consider a set of data evolution scenarios (adversaries) {θ} For every property s Bob about Bob, and output of mechanism w |log P(s Bob | θ, M(D) = w) - log P(s Bob | θ, M(D -Bob ) = w) | ≤ ε When {θ} is the set of all product distributions which are independent across individuals, – D -Bob = D – {Bob’s record} – A mechanism satisfies the above definition if and only if it satisfies differential privacy Lecture 17: 590.03 Fall 128

9 Counterfactual Privacy Consider a set of data evolution scenarios (adversaries) {θ} For every property s Bob about Bob, and output of mechanism w |log P(s Bob | θ, M(D) = w) - log P(s Bob | θ, M(D -Bob ) = w) | ≤ ε What about for other sets of prior distributions {θ}? – Open Question: If {θ} contains correlations, then definition of D -Bob itself is not very clear (like discussed in previous class about count constraints, or social networks) Lecture 17: 590.03 Fall 129

10 Certain vs Uncertain Adversaries Suppose an adversary has an uncertain prior: Consider a two sided coin A certain adversary knows bias of coin = p (for some p) – Exactly knows the bias of the coin – Knows every coin flip is random draw which is H with prob p An uncertain adversary may think the coin’s bias is in [p – δ, p+δ] – Does not exactly know the bias of the coin. Assumes coin’s bias θ is drawn from some probability distribution π – Given θ, every coin flip is random draw which is H with prob θ Lecture 17: 590.03 Fall 1210

11 Learning In machine learning/statistics, you want to use the observed data to learn something about the population. E.g., Given 10 flips of a coin, what is the bias of the coin? Assume your population is drawn from some prior distribution θ. We don’t know the θ, but may know some θ’s are more likely than others (like π, a probability distribution over θ’s) We want to learn the best θ that explains the observations … If you are certain about θ in the first place, there is no need for statistics/machine learning. Lecture 17: 590.03 Fall 1211

12 Uncertain Priors and Learning In many privacy scenarios, statistician (advertiser, epidemiologist, machine learning operator, …) is the adversary. But statistician does not have a certain prior … (otherwise no learning) Maybe, we can model a class of “realistic” adversaries using uncertain priors. Lecture 17: 590.03 Fall 1212

13 E-Privacy Counterfactual privacy with realistic adversaries Consider a set of data evolution scenarios (uncertain adversaries) {π} For every property s Bob about Bob, and output of mechanism w |log P(s Bob | π, M(D) = w) - log P(s Bob | π, M(D -Bob ) = w) | ≤ ε where P(s Bob | π, M(D) = w) = ∫ θ P(s Bob | θ, M(D) = w) P(θ | π) dθ Lecture 17: 590.03 Fall 1213

14 Realistic Adversaries Suppose your domain has 3 values green (g), red (r) and blue (b). Suppose individuals are assumed to be drawn from some common distribution θ = {p g, p r, p b } Lecture 17: 590.03 Fall 1214 {p g = 1, p r = 0, p b = 0} {p g = 0, p r = 1, p b = 0} {p g = 0, p r = 0, p b = 1}

15 Modeling Realistic Adversaries E.g., Dirichlet D(α g, α r, α b ) priors to model uncertainty. – Maximum probability given to {p* g, p* r, p* b }, where p* g = α g / (α g + α r + α b ) D(6,2,2) D(3,7,5) D(6,2,6) D(2,3,4) {p* g =0.2, p r =0.47, p b = 0.53} {p* g =0.43, p r =0.14, p b = 0.43} 15Lecture 17: 590.03 Fall 12

16 E.g., Dirichlet Prior Dirichlet D(α g, α r, α b ). – Call α = (α g + α r + α b ) the stubbornness of the prior. – As α increases, more probability is given to {p* g, p* r, p* b }. – When α  ∞, {p* g, p* r, p* b } has probability 1 and we get the independence assumption. Lecture 17: 590.03 Fall 1216 D(2,3,4) D(6,2,6) α = 10α = 14

17 Better Utility … Suppose we consider a class of uncertain adversaries, who are characterized by Dirichlet distributions with known stubborness (α), but unknown shape (α g, α b, α r ). Algorithm: Generalization – Size of each group > α/ε-1. – If size of a group = δ(α/ ε-1), then most frequent sensitive value appears in at most 1-1/(ε+δ) fraction of tuples in the group. Hence, we now also have a sliding scale to assess the power of adversaries (based on stubbornness) – More α implies more coarsening implies less utility Lecture 17: 590.03 Fall 1217

18 Summary Lots of Open Questions: What is the relationship between counterfactual privacy and Pufferfish? In what other ways can causality theory be used? For defining correlations? What are other interesting ways to instantiate E-privacy, and what are efficient algorithms for E-privacy? … Lecture 17: 590.03 Fall 1218

19 DIFFERENTIAL PRIVACY IN US CENSUS Lecture 17: 590.03 Fall 1219

20 OnTheMap: A Census application that plots commuting patterns of workers Workplace (Public) Residences Residences (Sensitive) 20 http://onthemap.ces.census.gov/ Lecture 17: 590.03 Fall 12

21 OnTheMap: A Census application that plots commuting patterns of workers Worker IDOriginDestination 1223MD11511DC22122 1332MD2123DC22122 1432VA11211DC22122 2345PA12121DC24132 1432PA11122DC24132 1665MD1121DC24132 1244DC22122 Census Blocks Residence (Sensitive) Workplace (Quasi-identifier) 21Lecture 17: 590.03 Fall 12

22 Why publish commute patterns? To compute Quarterly Workforce Indicators – Total employment – Average Earnings – New Hires & Separations – Unemployment Statistics E.g., Missouri state used this data to formulate a method allowing QWI to suggest industrial sectors where transitional training might be most effective … to proactively reduce time spent on unemployment insurance … 22Lecture 17: 590.03 Fall 12

23 A Synthetic Data Generator (Dirichlet resampling) += Multi-set of Origins for workers in Washington DC. Noise (fake workers) Step 1: Noise Addition (for each destination) D (7, 5, 4)A (2, 3, 3)D+A (9, 8, 7) Washington DC SomersetFuller Noise added to an origin with at least 1 worker is > 0 23 Noise infused data Lecture 17: 590.03 Fall 12

24 A Synthetic Data Generator (Dirichlet resampling) Step 2: Dirichlet Resampling (for each destination) (9, 8, 7)(9, 7, 7) Draw a point at random Replace two of the same kind. (9, 9, 7) S : Synthetic Data frequency of block b in D+A = 0  frequency of b in S = 0 i.e., block b is ignored by the algorithm. frequency of block b in D+A = 0  frequency of b in S = 0 i.e., block b is ignored by the algorithm. 24Lecture 17: 590.03 Fall 12

25 How should we add noise? Intuitively, more noise yields more privacy … How much noise should we add ? To which blocks should we add noise? Currently this is poorly understood. – Total amount of noise added is a state secret – Only 3-4 people in the US know this value in the current implementation of OnTheMap. 25Lecture 17: 590.03 Fall 12

26 How should we add noise? Intuitively, more noise yields more privacy … How much noise should we add ? To which blocks should we add noise? Currently this is poorly understood. – Total amount of noise added is a state secret – Only 3-4 people in the US know this value in the current implementation of OnTheMap. 1. How much noise should we add? 2. To which blocks should we add noise? 1. How much noise should we add? 2. To which blocks should we add noise? 26Lecture 17: 590.03 Fall 12

27 Privacy of Synthetic Data Theorem 1: The Dirichlet resampling algorithm preserves privacy if and only if for every destination d, the noise added to each block is at least where m(d) is the size of the synthetic population for destination d and ε is the privacy parameter. 27 m(d) ε - 1 Lecture 17: 590.03 Fall 12

28 1. How much noise should we add? Noise required per block: (differential privacy) Add noise to every block on the map. There are 8 million Census blocks on the map! 1 million original workers and 16 billion fake workers!!! 28 1 million original and synthetic workers. lesser privacy 2. To which blocks should we add noise? Lecture 17: 590.03 Fall 12

29 Intuition behind Theorem 1. 29 Two possible inputs blue and red are two different origin blocks. Adversary knows individual 1 is Either blue or red. Adversary knows individual 1 is Either blue or red. Adversary knows individuals [2..n] are blue. D2D2 D1D1 Lecture 17: 590.03 Fall 12

30 Intuition behind Theorem 1. 30 Two possible inputs blue and red are two different origin blocks. Noise Addition D2D2 D1D1 Lecture 17: 590.03 Fall 12

31 Intuition behind Theorem 1. 31 Noise infused inputs blue and red are two different origin blocks. For every output … O Dirichlet Resampling D2D2 D1D1 Pr[D 1  O] = 1/10 * 2/11 * 3/12 * 4/13 * 5/14 * 6/15 Pr[D 2  O] = 2/10 * 3/11 * 4/12 * 5/13 * 6/14 * 7/15 Pr[D 1  O] = 1/10 * 2/11 * 3/12 * 4/13 * 5/14 * 6/15 Pr[D 2  O] = 2/10 * 3/11 * 4/12 * 5/13 * 6/14 * 7/15 = 7 Pr[D 1  O] Pr[D 2  O] Lecture 17: 590.03 Fall 12

32 Intuition behind Theorem 1. 32 Noise infused inputs blue and red are two different origin blocks. For every output … O Adversary infers that it is very likely individual 1 is red … … unless noise added is very large. Dirichlet Resampling D2D2 D1D1 Lecture 17: 590.03 Fall 12

33 Privacy Analysis: Summary Chose differential privacy. – Guards against powerful adversaries. – Measures privacy as a distance between prior and posterior. Derived necessary and sufficient conditions when OnTheMap preserves privacy. The above conditions make the data published by OnTheMap useless. 33Lecture 17: 590.03 Fall 12

34 But, breach occurs with very low probability. 34 Noise infused inputs blue and red are two different origin blocks. For every output … O Dirichlet Resampling D2D2 D1D1 Probability of O ≈ 10 -4 Lecture 17: 590.03 Fall 12

35 Negligible function Definition: f(x) is negligible if it goes to 0 faster than the inverse of any polynomial. e.g., 2 -x and e -x are negligible functions. 35Lecture 17: 590.03 Fall 12

36 (ε,δ)-Indistinguishability Pr[D 1  T] ≤ e ε Pr[D 2  T] + δ(|D 2 |) 36 For any subset of outputs T O1O1 D2D2 D1D1 For every pair of inputs that differ in one value O2O2 O3O3 O4O4 If T occurs with negligible probability, the adversary is allowed to distinguish between D1 and D2 by a factor > ε using O i in T. Lecture 17: 590.03 Fall 12

37 Conditions for (ε,δ)-Indistinguishability Theorem 2: The Dirichlet resampling algorithm preserves (ε,δ)- indistinguishability if for every destination d, the noise added to each block is at least where n(d) is the number of workers commuting to d and m(d) ≤ n(d). 37 log n(d) Lecture 17: 590.03 Fall 12

38 Probabilistic Differential Privacy (ε,δ)-Indistinguishability is an asymptotic measure – May not guarantee privacy when number of workers at a destination is small. Definition (Disclosure Set Disc(D, ε)): The set of output tables that breach ε-differential privacy for D and some other table D’ that differs from D in one value. 38Lecture 17: 590.03 Fall 12

39 Probabilistic Differential Privacy Adversary may distinguish between D 1 and D 2 based on a set of unlikely outputs with probability at most δ 39 For every probable output OD2D2 D1D1 For every pair of inputs that differ in one value Pr[O | 1 - δ Pr[D 1  O] Pr[D 2  O] Lecture 17: 590.03 Fall 12

40 1. How much noise should we add? Noise required per block: 40 1 million original and synthetic workers. lesser privacy Differential Privacy Probabilistic Differential Privacy (δ = 10 -5 ) Lecture 17: 590.03 Fall 12

41 Prob. Differential Privacy: Summary Ignoring privacy breaches that occur due to low probability outputs drastically reduces noise. Two ways to bound low probability outputs – (ε,δ)-Indistinguishability and Negligible functions. Noise required for privacy ≥ (log n(d)) per block – (ε,δ)-Probabilistic differential privacy and Disclosure sets. Efficient algorithm to calculate noise per block (see paper). Does probabilistic differential privacy allow useful information to be published? 41Lecture 17: 590.03 Fall 12

42 1. How much noise should we add? Noise required per block: 42 1 million original and synthetic workers. lesser privacy Differential Privacy Probabilistic Differential Privacy (δ = 10 -5 ) 2. To which blocks should we add noise? Why not add noise to every block? Lecture 17: 590.03 Fall 12

43 Why not add noise to every block? Noise required per block: (probabilistic differential privacy) There are about 8 million blocks on the map! – Total noise added is about 6 million. Causes non-trivial spurious commute patterns. – Roughly 1 million fake workers from West Coast (out of a total 7 million points in D+A). – Hence, 1/7 of the synthetic data have residences in West Coast and work in Washington DC. lesser privacy 1 million original and synthetic workers. 43Lecture 17: 590.03 Fall 12

44 2. To which blocks should we add noise? Noise required per block: (probabilistic differential privacy) Adding noise to all blocks creates spurious commute patterns. lesser privacy 44 1 million original and synthetic workers. Why not add noise only to blocks that appear in the original data? Why not add noise only to blocks that appear in the original data? Lecture 17: 590.03 Fall 12

45 Theorem 3: Adding noise only to blocks that appear in the data breaches privacy. If a block b does not appear in the original data and no noise is added to b then b cannot appear in the synthetic data. If a block b does not appear in the original data and no noise is added to b then b cannot appear in the synthetic data. 45Lecture 17: 590.03 Fall 12

46 Theorem 3: Adding noise only to blocks that appear in the data breaches privacy. Worker W comes from Somerset or Fayette. No one else comes from there. If S has a synthetic worker from Somerset Then W comes from Somerset!! Worker W comes from Somerset or Fayette. No one else comes from there. If S has a synthetic worker from Somerset Then W comes from Somerset!! Somerset 1 Fayette 0 Somerset 1 Fayette 0 Somerset 0 Fayette 1 Somerset 0 Fayette 1 46Lecture 17: 590.03 Fall 12

47 Ignoring outliers degrades utility Each of these points are outliers. Contribute to about half the workers. Each of these points are outliers. Contribute to about half the workers. 47Lecture 17: 590.03 Fall 12

48 Our solution to “Where to add noise?” Step 1 : Coarsen the domain – Based on an existing public dataset (Census Transportation Planning Package, CTPP). 48Lecture 17: 590.03 Fall 12

49 Our solution to “Where to add noise?” Step 1 : Coarsen the domain Step 2: Probabilistically drop blocks with 0 support – Pick a function f: {b 1, …, b k }  (0,1] (based on external data) – For every block b with 0 support, ignore b with probability f(b) Theorem 4: Parameter ε increases by 49 b max ( max ( 2 noise per block, f(b) ) ) Lecture 17: 590.03 Fall 12

50 Utility of the provably private algorithm Experimental Setup: OTM: Currently published OnTheMap data used as original data. All destinations in Minnesota. 120, 690 origins per destination. – chosen by pruning out blocks that are > 100 miles from the destination. ε = 100, δ = 10 -5 Additional leakage due to probabilistic pruning = 4 (min f(b) = 0.0378) Utility measured by average commute distance for each destination block. 50Lecture 17: 590.03 Fall 12

51 Utility of the provably private algorithm Utility measured by average commute distance for each destination block. Short commutes have low error in both sparse and dense regions. 51Lecture 17: 590.03 Fall 12

52 Utility of the provably private algorithm 52 Long commutes in sparse regions are overestimated. Lecture 17: 590.03 Fall 12

53 OnTheMap: Summary OnTheMap: A real census application. – Synthetically generated data published for economic research. – Currently, privacy implications are poorly understood. Parameters to the algorithm are state secret. First formal privacy analysis of this application. – Analyzed the privacy of OnTheMap using variants of Differential Privacy. – First solutions to publish useful information despite sparse data. 53Lecture 17: 590.03 Fall 12

54 Next Class No class on Wednesday, Oct 31 Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation” Lecture 17: 590.03 Fall 1254

55 References [M et al ICDE ‘08] A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, L. Vilhuber, “Privacy: From Theory to Practice on the Map”, ICDE 2008 [M et al VLDB ‘09] A. Machanavajjhala, J. Gehrke, M. Gotz, “Data Publishing against Realistic Adversaries”, PVLDB 2(1) 2009 Lecture 17: 590.03 Fall 1255


Download ppt "Differential Privacy in US Census CompSci 590.03 Instructor: Ashwin Machanavajjhala 1Lecture 17: 590.03 Fall 12."

Similar presentations


Ads by Google