Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

Similar presentations


Presentation on theme: "Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis."— Presentation transcript:

1 Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.

2 motivation several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available –termed microdata (vs. aggregated macrodata) used for analysis –often required and imposed by law to protect privacy microdata are sanitized –explicit identifiers (SSN, name, phone #) are removed is this sufficient for preserving privacy? no! susceptible to link attacks –publicly available databases (voter lists, city directories) can reveal the “hidden” identity

3 link attack example [Sweeney01] managed to re-identify the medical record of the governor of Massachussetts –MA collects and publishes sanitized medical data for state employees (microdata) left circle –voter registration list of MA (publicly available data) right circle looking for governor’s record join the tables: –6 people had his birth date –3 were men –1 in his zipcode regarding the US 1990 census data –87% of the population are unique based on (zipcode, gender, dob)

4 Microdata NameAgeZipcodeDisease Bob2112000dyspepsia Alice2214000bronchitis Andy2418000flu David2325000gastritis Gary4120000flu Helen3627000gastritis Jane3733000dyspepsia Ken4035000flu Linda4326000gastritis Paul5233000dyspepsia Steve5634000gastritis

5 Inference Attack Published table An adversary Quasi-identifier (QI) attributes AgeZipcodeDisease 2112000dyspepsia 2214000bronchitis 2418000flu 2325000gastritis 4120000flu 3627000gastritis 3733000dyspepsia 4035000flu 4326000gastritis 5233000dyspepsia 5634000gastritis NameAgeZipcode Bob2112000

6 k-anonymity [Samarati and Sweeney02] Transform the QI values into less specific forms generalize AgeZipcodeDisease 2112000dyspepsia 2214000bronchitis 2418000flu 2325000gastritis 4120000flu 3627000gastritis 3733000dyspepsia 4035000flu 4326000gastritis 5233000dyspepsia 5634000gastritis AgeZipcodeDisease [21, 22][12k, 14k]dyspepsia [21, 22][12k, 14k]bronchitis [23, 24][18k, 25k]flu [23, 24][18k, 25k]gastritis [36, 41][20k, 27k]flu [36, 41][20k, 27k]gastritis [37, 43][26k, 35k]dyspepsia [37, 43][26k, 35k]flu [37, 43][26k, 35k]gastritis [52, 56][33k, 34k]dyspepsia [52, 56][33k, 34k]gastritis

7 Generalization Transform each QI value into a less specific form A generalized table An adversary NameAgeZipcode Bob2112000 AgeZipcodeDisease [21, 22][12k, 14k]dyspepsia [21, 22][12k, 14k]bronchitis [23, 24][18k, 25k]flu [23, 24][18k, 25k]gastritis [36, 41][20k, 27k]flu [36, 41][20k, 27k]gastritis [37, 43][26k, 35k]dyspepsia [37, 43][26k, 35k]flu [37, 43][26k, 35k]gastritis [52, 56][33k, 34k]dyspepsia [52, 56][33k, 34k]gastritis

8 Graphically… 12000 14000 18000 25000 20000 26000 27000 33000 34000 35000 2122232436374041435256 Bob Alice

9 Why not… 12000 14000 18000 25000 20000 26000 27000 33000 34000 35000 2122232436374041435256 How many people with age in [30, 50] contracted flu?

10 k-anonymity AgeZipcodeDisease [21, 22][12k, 14k]dyspepsia [21, 22][12k, 14k]bronchitis [23, 24][18k, 25k]flu [23, 24][18k, 25k]gastritis [36, 41][20k, 27k]flu [36, 41][20k, 27k]gastritis [37, 43][26k, 35k]dyspepsia [37, 43][26k, 35k]flu [37, 43][26k, 35k]gastritis [52, 56][33k, 34k]dyspepsia [52, 56][33k, 34k]gastritis AgeZipcodeDisease [21, 56][12k, 35k]dyspepsia [21, 56][12k, 35k]bronchitis [21, 56][12k, 35k]flu [21, 56][12k, 35k]gastritis [21, 56][12k, 35k]flu [21, 56][12k, 35k]gastritis [21, 56][12k, 35k]dyspepsia [21, 56][12k, 35k]flu [21, 56][12k, 35k]gastritis [21, 56][12k, 35k]dyspepsia [21, 56][12k, 35k]gastritis How many people with age in [30, 50] contracted flu? generalization with low utility: answer less accurately: [0..3] generalization with high utility: answer queries more accurately: 2.

11 k-anonymity with utility Among all generalizations that enforce k- anonymity, we should maximize utility by minimizing the “rectangle” sizes! Several measures. E.g. to minimize the maximal perimeter size of the rectangles.

12 Mondrian [LDR06] Recursive half-plane partitioning, alternating dimensions. let k=2

13 Mondrian [LDR06] Unbounded approximation ratio! let k=4

14 Our contributions [DXT+07] Proved that to find the optimal partitioning is NP-hard. Proved that to find a partitioning with approximation ratio less than 1.25 is also NP-hard. Provided three algorithms with tradeoffs in complexity and approximation ratio.

15 Divide-And-Group (DAG) Divide the space into square cells with proper size Find a set of non-overlapping tiles of 2 x 2 cells to cover the points, such that each tile covers at least k points Assign the rest of (uncovered) points to the nearest tile

16 Min-MBR-Group (MMG) For each point p, find the smallest MBR which covers at least k points including p Find a set of non-overlapping MBRs from the result of previous step Assign the points to the nearest MBR

17 Nearest-Neighbor-Group (NNG) For each point p, find the MBR which covers p and its k-1 nearest neighbors Find a set of non-overlapping MBRs from the result of previous step Assign the points to the nearest MBR

18 Analysis AlgorithmComplexityApproximation Ratio DAGO(3 d d n log 2 n)8d MMGO(d n 2d+1 )2d+1 NNGO(d n 2 )6d

19 In a QI group, if many records have the same sensitive attribute value... Drawback of k-anonymity Quasi-identifier (QI) attributes Sensitive attribute AgeSexZipcodeDisease [21, 40]M[10001, 60000]pneumonia [30, 60]M[10001, 60000]dyspepsia [30, 60]M[10001, 60000]dyspepsia [21, 40]M[10001, 60000]pneumonia [61, 65]F[10001, 60000]flu [63, 70]F[10001, 60000]gastritis [61, 65]F[10001, 60000]flu [63, 70]F[10001, 60000]bronchitis If Bob is in this group, he must have pneumonia.

20 l-diversity [ICDE06] A QI-group with m tuples is l -diverse, iff each sensitive value appears no more than m / l times in the QI-group. A table is l -diverse, iff all of its QI-groups are l -diverse. The above table is 2-diverse. 2 QI-groups Quasi-identifier (QI) attributes Sensitive attribute AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis

21 What l-diversity guarantees From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis NameAgeSexZipcode Bob23M11000 A 2-diverse generalized table A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006

22 Problem with multi-publishing A hospital keeps track of the medical records collected in the last three months. The microdata table T(1), and its generalization T*(1), published in Apr. 2007. NameAgeZipcodeDisease Bob2112000dyspepsia Alice2214000bronchitis Andy2418000flu David2325000gastritis Gary4120000flu Helen3627000gastritis Jane3733000dyspepsia Ken4035000flu Linda4326000gastritis Paul5233000dyspepsia Steve5634000gastritis Microdata T(1) G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis 2[23, 24][18k, 25k]flu 2[23, 24][18k, 25k]gastritis 3[36, 41][20k, 27k]flu 3[36, 41][20k, 27k]gastritis 4[37, 43][26k, 35k]dyspepsia 4[37, 43][26k, 35k]flu 4[37, 43][26k, 35k]gastritis 5[52, 56][33k, 34k]dyspepsia 5[52, 56][33k, 34k]gastritis 2-diverse Generalization T*(1)

23 Problem with multi-publishing Bob was hospitalized in Mar. 2007 NameAgeZipcode Bob2112000 G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis 2[23, 24][18k, 25k]flu 2[23, 24][18k, 25k]gastritis 3[36, 41][20k, 27k]flu 3[36, 41][20k, 27k]gastritis 4[37, 43][26k, 35k]dyspepsia 4[37, 43][26k, 35k]flu 4[37, 43][26k, 35k]gastritis 5[52, 56][33k, 34k]dyspepsia 5[52, 56][33k, 34k]gastritis 2-diverse Generalization T*(1)

24 Problem with multi-publishing One month later, in May 2007 NameAgeZipcodeDisease Bob2112000dyspepsia Alice2214000bronchitis Andy2418000flu David2325000gastritis Gary4120000flu Helen3627000gastritis Jane3733000dyspepsia Ken4035000flu Linda4326000gastritis Paul5233000dyspepsia Steve5634000gastritis Microdata T(1)

25 Problem with multi-publishing One month later, in May 2007 Some obsolete tuples are deleted from the microdata. Microdata T(1) NameAgeZipcodeDisease Bob2112000dyspepsia Alice2214000bronchitis Andy2418000flu David2325000gastritis Gary4120000flu Helen3627000gastritis Jane3733000dyspepsia Ken4035000flu Linda4326000gastritis Paul5233000dyspepsia Steve5634000gastritis

26 Problem with multi-publishing Bob’s tuple stays. Microdata T(1) NameAgeZipcodeDisease Bob2112000dyspepsia David2325000gastritis Gary4120000flu Jane3733000dyspepsia Linda4326000gastritis Steve5634000gastritis

27 Problem with multi-publishing Some new records are inserted. Microdata T(2) NameAgeZipcodeDisease Bob2112000dyspepsia David2325000gastritis Emily2521000flu Jane3733000dyspepsia Linda4326000gastritis Gary4120000flu Mary4630000gastritis Ray5431000dyspepsia Steve5634000gastritis Tom6044000gastritis Vince6536000flu

28 Problem with multi-publishing The hospital published T*(2). NameAgeZipcodeDisease Bob2112000dyspepsia David2325000gastritis Emily2521000flu Jane3733000dyspepsia Linda4326000gastritis Gary4120000flu Mary4630000gastritis Ray5431000dyspepsia Steve5634000gastritis Tom6044000gastritis Vince6536000flu Microdata T(2) G. IDAgeZipcodeDisease 1[21, 23][12k, 25k]dyspepsia 1[21, 23][12k, 25k]gastritis 2[25, 43][21k, 33k]flu 2[25, 43][21k, 33k]dyspepsia 3[25, 43][21k, 33k]gastritis 3[41, 46][20k, 30k]flu 4[41, 46][20k, 30k]gastritis 4[54, 56][31k, 34k]dyspepsia 4[54, 56][31k, 34k]gastritis 5[60, 65][36k, 44k]gastritis 5[60, 65][36k, 44k]flu 2-diverse Generalization T*(2)

29 Problem with multi-publishing Consider the previous adversary. NameAgeZipcode Bob2112000 G. IDAgeZipcodeDisease 1[21, 23][12k, 25k]dyspepsia 1[21, 23][12k, 25k]gastritis 2[25, 43][21k, 33k]flu 2[25, 43][21k, 33k]dyspepsia 3[25, 43][21k, 33k]gastritis 3[41, 46][20k, 30k]flu 4[41, 46][20k, 30k]gastritis 4[54, 56][31k, 34k]dyspepsia 4[54, 56][31k, 34k]gastritis 5[60, 65][36k, 44k]gastritis 5[60, 65][36k, 44k]flu 2-diverse Generalization T*(2)

30 Problem with multi-publishing What the adversary learns from T*(1). What the adversary learns from T*(2). So Bob must have contracted dyspepsia! A new generalization principle is needed. NameAgeZipcode Bob2112000 G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis …… NameAgeZipcode Bob2112000 G. IDAgeZipcodeDisease 1[21, 23][12k, 25k]dyspepsia 1[21, 23][12k, 25k]gastritis ……

31 m-invariance [SIGMOD07] A sequence of generalized tables T*(1), …, T*(n) is m-invariant, if and only if –T*(1), …, T*(n) are m-unique, and –each individual has the same signature in every generalized table s/he is involved. Explanation –m-unique: every QI group contains at least m tuples with different sensitive attributes –signature: all the sensitive attributes in the individual’s QI group.

32 m-unique A generalized table T*(j) is m-unique, if and only if –each QI-group in T*(j) contains at least m tuples –all tuples in the same QI-group have different sensitive values. G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis 2[23, 24][18k, 25k]flu 2[23, 24][18k, 25k]gastritis 3[36, 41][20k, 27k]flu 3[36, 41][20k, 27k]gastritis 4[37, 43][26k, 35k]dyspepsia 4[37, 43][26k, 35k]flu 4[37, 43][26k, 35k]gastritis 5[52, 56][33k, 34k]dyspepsia 5[52, 56][33k, 34k]gastritis A 2-unique generalized table

33 Signature The signature of Bob in T*(1) is {dyspepsia, bronchitis} The signature of Jane in T*(1) is {dyspepsia, flu, gastritis} NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis …………… Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis …………… T*(1)

34 The m-invariance principle Lemma: if a sequence of generalized tables {T*(1), …, T*(n)} is m-invariant, then for any individual o involved in any of these tables, we have risk(o) <= 1/m

35 The m-invariance principle Lemma: let {T*(1), …, T*(n-1)} be m-invariant. {T*(1), …, T*(n-1), T*(n)} is also m-invariant, if and only if {T*(n-1), T*(n)} is m-invariant Only T*(n - 1) is needed for the generation of T*(n). T*(1), T*(2), …, T*(n-2), T*(n-1), T*(n) Can be discarded

36 Solution idea Goal: Given T(n) and T*(n-1), create T*(n) such that {T*(n-1) and T*(n)} is m-invariant. Idea: create counterfeits. Optimization goal: to impose as little amount of generalization as possible.

37 NameGroup-IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Counterfeited generalization T*(2) Group-IDCount 11 31 The auxiliary relation R(2) for T*(2) NameAgeZipcodeDisease Bob2112000dyspepsia David2325000gastritis Emily2521000flu Jane3733000dyspepsia Linda4326000gastritis Gary4120000flu Mary4630000gastritis Ray5431000dyspepsia Steve5634000gastritis Tom6044000gastritis Vince6536000flu Microdata T(2)

38 NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Counterfeited Generalization T*(2) Group-IDCount 11 31 The auxiliary relation R(2) for T*(2) NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis Andy2[23, 24][18k, 25k]flu David2[23, 24][18k, 25k]gastritis Gary3[36, 41][20k, 27k]flu Helen3[36, 41][20k, 27k]gastritis Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis Paul5[52, 56][33k, 34k]dyspepsia Steve5[52, 56][33k, 34k]gastritis Generalization T*(1) NameAgeZipcode Bob2112000

39 NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Generalization T*(2) NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis Andy2[23, 24][18k, 25k]flu David2[23, 24][18k, 25k]gastritis Gary3[36, 41][20k, 27k]flu Helen3[36, 41][20k, 27k]gastritis Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis Paul5[52, 56][33k, 34k]dyspepsia Steve5[52, 56][33k, 34k]gastritis Generalization T*(1) A sequence of generalized tables T*(1), …, T*(n) is m- invariant, if and only if –T*(1), …, T*(n) are m-unique, and –each individual has the same signature in every generalized table s/he is involved.

40 NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Generalization T*(2) NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis Andy2[23, 24][18k, 25k]flu David2[23, 24][18k, 25k]gastritis Gary3[36, 41][20k, 27k]flu Helen3[36, 41][20k, 27k]gastritis Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis Paul5[52, 56][33k, 34k]dyspepsia Steve5[52, 56][33k, 34k]gastritis Generalization T*(1) A sequence of generalized tables T*(1), …, T*(n) is m- invariant, if and only if –T*(1), …, T*(n) are m-unique, and –each individual has the same signature in every generalized table s/he is involved.

41 NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia c1c11[21, 22][12k, 14k]bronchitis David2[23, 25][21k, 25k]gastritis Emily2[23, 25][21k, 25k]flu Jane3[37, 43][26k, 33k]dyspepsia c2c23[37, 43][26k, 33k]flu Linda3[37, 43][26k, 33k]gastritis Gary4[41, 46][20k, 30k]flu Mary4[41, 46][20k, 30k]gastritis Ray5[54, 56][31k, 34k]dyspepsia Steve5[54, 56][31k, 34k]gastritis Tom6[60, 65][36k, 44k]gastritis Vince6[60, 65][36k, 44k]flu Generalization T*(2) NameG.IDAgeZipcodeDisease Bob1[21, 22][12k, 14k]dyspepsia Alice1[21, 22][12k, 14k]bronchitis Andy2[23, 24][18k, 25k]flu David2[23, 24][18k, 25k]gastritis Gary3[36, 41][20k, 27k]flu Helen3[36, 41][20k, 27k]gastritis Jane4[37, 43][26k, 35k]dyspepsia Ken4[37, 43][26k, 35k]flu Linda4[37, 43][26k, 35k]gastritis Paul5[52, 56][33k, 34k]dyspepsia Steve5[52, 56][33k, 34k]gastritis Generalization T*(1) A sequence of generalized tables T*(1), …, T*(n) is m- invariant, if and only if –T*(1), …, T*(n) are m-unique, and –each individual has the same signature in every generalized table s/he is involved.

42 In case of corruption… If an adversary knows from Alice that she has bronchitis, he can conclude that Bob has dyspepsia. NameAgeZipcodeDisease Bob2112000dyspepsia Alice2214000bronchitis Andy2418000flu David2325000gastritis Gary4120000flu Helen3627000gastritis Jane3733000dyspepsia Ken4035000flu Linda4326000gastritis Paul5233000dyspepsia Steve5634000gastritis Microdata G. IDAgeZipcodeDisease 1[21, 22][12k, 14k]dyspepsia 1[21, 22][12k, 14k]bronchitis 2[23, 24][18k, 25k]flu 2[23, 24][18k, 25k]gastritis 3[36, 41][20k, 27k]flu 3[36, 41][20k, 27k]gastritis 4[37, 43][26k, 35k]dyspepsia 4[37, 43][26k, 35k]flu 4[37, 43][26k, 35k]gastritis 5[52, 56][33k, 34k]dyspepsia 5[52, 56][33k, 34k]gastritis 2-diverse Generalization

43 Anti-corruption publishing [ICDE08] We formalized anti-corruption publishing, by modeling the degree of privacy preservation as a function of an adversary’s background knowledge. We proposed a solution, by integrating generalization with –perturbation: switch selected records’ sensitive information. –stratified sampling: sample some records from each QI group.

44 Summary Introduced the problem of privacy-preserving publishing. Two principles: –k-anonymity –l-diversity Two extensions: –multi-publishing –corruption


Download ppt "Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis."

Similar presentations


Ads by Google