Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dilys Thomas PODS 20061 Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu.

Similar presentations


Presentation on theme: "Dilys Thomas PODS 20061 Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu."— Presentation transcript:

1 Dilys Thomas PODS 20061 Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu

2 Dilys Thomas PODS 20062 Talk outline k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular clustering Future Work

3 Dilys Thomas PODS 20063 Medical Records IdentifyingSensitive SSNNameDOBRaceZip codeDisease 614Sara03/04/76Cauc94305Flu 615Joan07/11/80Cauc94307Cold 629Kelly05/09/55Cauc94301Diabetes 710Mike11/23/62Afr-A94305Flu 840Carl11/23/62Afr-A94059Arthritis 780Joe01/07/50Hisp94042Heart problem 619Rob04/08/43Hisp94042Arthritis

4 Dilys Thomas PODS 20064 De-identified Medical Records Sensitive AgeRaceZip codeDisease Cauc94305Flu 07/11/80Cauc94307Cold 05/09/55Cauc94301Diabetes 11/23/62Afr-A94305Flu 11/23/62Afr-A94059Arthritis 01/07/50Hisp94042Heart problem 04/08/43Hisp94042Arthritis 03/04/76

5 Dilys Thomas PODS 20065 k-Anonymity model Uniquely identify you! Sensitive DOBRaceZip codeDisease 03/04/76Cauc94305Flu 07/11/80Cauc94307Cold 05/09/55Cauc94301Diabetes 12/30/72Afr-A94305Flu 11/23/62Afr-A94059Arthritis 01/07/50Hisp94042Heart problem 04/08/43Hisp94042Arthritis Quasi-identifiers: approximate foreign keys

6 Dilys Thomas PODS 20066 k-Anonymity Model [Swe00] Suppress some entries of quasi-identifiers – each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers Individual records hidden in a crowd of size k

7 Dilys Thomas PODS 20067 2-Anonymized Table DOBRaceZip codeDisease *Cauc*Flu *Cauc*Cold *Cauc*Diabetes 11/23/62Afr-A*Flu 11/23/62Afr-A*Arthritis *Hisp94042Heart problem *Hisp94042Arthritis

8 Dilys Thomas PODS 20068 k-Anonymity Optimization Minimize the number of generalizations/ suppressions to achieve k-Anonymity NP-hard to come up with minimum suppressions/ generalizations.[MW04]  (k) approximation for k-anonymity [AFK+05]  (k) lower bound on approximation ratio with graph assumption

9 Dilys Thomas PODS 20069 Talk outline k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular Clustering Future Work

10 Dilys Thomas PODS 200610 Original Table AgeSalary Amy2550 Brian2760 Carol29100 David35110 Evelyn39120

11 Dilys Thomas PODS 200611 2-Anonymity with Suppression AgeSalary Amy** Brian** Carol** David** Evelyn** All attributes suppressed

12 Dilys Thomas PODS 200612 Original Table AgeSalary Amy2550 Brian2760 Carol29100 David35110 Evelyn39120

13 Dilys Thomas PODS 200613 2-Anonymity with Generalization AgeSalary Amy20-3050-100 Brian20-3050-100 Carol20-3050-100 David30-40100-150 Evelyn30-40100-150 Generalization allows pre-specified ranges

14 Dilys Thomas PODS 200614 Original Table AgeSalary Amy2550 Brian2760 Carol29100 David35110 Evelyn39120

15 Dilys Thomas PODS 200615 2-Anonymity with Clustering AgeSalary Amy[25-29][50-100] Brian[25-29][50-100] Carol[25-29][50-100] David[35-39][110-120] Evelyn[35-39][110-120] Cluster centers published 27=(25+27+29)/3 70=(50+60+100)/3 37=(35+39)/2 115=(110+120)/2

16 Dilys Thomas PODS 200616 Advantages of Clustering Clustering reduces the amount of distortion introduced as compared to suppressions / generalizations Clustering allows constant factor approximation algorithms

17 Dilys Thomas PODS 200617 Quasi-Identifiers form a Metric Space Convert quasi-identifiers into points in a metric space Distance function, D, on points –D(X,X)=0 Reflexive –D(X,Y)=D(Y,X) Symmetric –D(X,Z) <= D(X,Y) + D(Y,Z) Triangle Inequality

18 Dilys Thomas PODS 200618 Metric Space Converting (gender, zip code, DOB) into points in a metric space not easy. Define distance function on each attribute. E.g. on Zip code: –D (Zip1,Zip2)= physical distance between locations Zip1 and Zip2. Weight attributes, weighted sum of attribute distances gives metric.

19 Dilys Thomas PODS 200619 Clustering for Anonymity Cluster Quasi-identifiers so that each cluster has at least r members for anonymity. Publish cluster centers for anonymity with number of point and radius Tight clusters  Usefulness of data for mining Large number of points per cluster  Anonymity

20 Dilys Thomas PODS 200620 Quasi-identifiers: Metric Space Assume further that the distance metric has been already defined on quasi-identifiers

21 Dilys Thomas PODS 200621 Talk outline k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular Clustering Future Work

22 Dilys Thomas PODS 200622 r-Gather Clustering 10 points, radius 5 20 points, radius 10 50 points, radius 20 Minimize the maximum radius: 20

23 Dilys Thomas PODS 200623 Results 2 Approximation to minimize maximum radius with cluster size constraint Matching Lower bound of 2 for maximum radius minimization

24 Dilys Thomas PODS 200624 r-Gather Clustering 2d

25 Dilys Thomas PODS 200625 Lower Bound: Reduction from 3-SAT X1TX1T X1FX1F X2TX2T X2FX2F r-2 points r-gather with radius 1 iff formula satisfiable Else radius ¸ 2 C 1 =X 1 Æ X 2 C1C1

26 Dilys Thomas PODS 200626 Talk outline k-Anonymity model Achieving Anonymity via Clustering r-Gather clustering Cellular Clustering Future Work

27 Dilys Thomas PODS 200627 Cellular Clustering 10 points, radius 5 20 points, radius 10 50 points, radius 20

28 Dilys Thomas PODS 200628 Cellular Clustering Metric 10 points, radius 5 20 points, radius 10 50 points, radius 20 Cellular Clustering Metric: 10*5 + 20*10 + 50*20 = 50 + 200 + 1000 = 1250

29 Dilys Thomas PODS 200629 Cellular Clustering Primal dual 4-approximation algorithm for cellular clustering Constant factor approximation to minimum cluster size –Each cluster has at least r points

30 Dilys Thomas PODS 200630 Cellular Clustering: Linear Program Minimize  c (  i x ic d c + f c y c ) Sum of Cellular cost and facility cost Subject to:  c x ic ¸ 1 Each Point belongs to a cluster x ic · y c Cluster must be opened for point to belong 0 · x ic · 1 Points belong to clusters positively 0 · y c · 1 Clusters are opened positively

31 Dilys Thomas PODS 200631 Dual Program Maximize  i  i Subject to:  i  ic · f c (1)  i -  ic · d c (2)  i ¸ 0  ic ¸ 0 Overview of Algorithm: First grow  i keeping  ic =0 till (2) becomes tight then grow  ic at same rate till (1) becomes tight

32 Dilys Thomas PODS 200632 Future Work Improve approximation ratio for Cellular Clustering Improve Running time. Presently r-gather is O(n 2 ) while cellular clustering is a linear program over n 2 variables. –Linear or even sub-linear time algorithms Weaker guarantees on anonymity, e.g. at least k/2 points per cluster instead of k.

33 Dilys Thomas PODS 200633 THANK YOU! QUESTIONS?


Download ppt "Dilys Thomas PODS 20061 Achieving Anonymity via Clustering G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, A. Zhu."

Similar presentations


Ads by Google