Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining.

Similar presentations


Presentation on theme: "An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining."— Presentation transcript:

1 An Zhu Towards Achieving Anonymity

2 Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?

3 Example: Medical Records IdentifiersSensitive Info SSNNameAgeRaceZipcodeDisease 614Sara31Cauc94305Flu 615Joan34Cauc94307Cold 629Kelly27Cauc94301Diabetes 710Mike41Afr-A94305Flu 840Carl41Afr-A94059Arthritis 780Joe65Hisp94042Heart problem 616Rob46Hisp94042Arthritis

4 De-identified Records Sensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

5 Not Sufficient! [Sweeney 00’] Sensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis Public Database Unique Identifiers!

6 Not Sufficient! [Sweeney 00’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis Public Database Unique Identifiers!

7 Anonymize the Quasi-Identifiers! Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease *** Flu *** Cold *** Diabetes *** Flu *** Arthritis *** Heart problem *** Arthritis Public Database Unique Identifiers!

8 Q: How to share such data?  Anonymize the quasi-identifiers Suppress information  Privacy guarantee: anonymity  Quality: the amount of suppressed information Clustering  Privacy guarantee: cluster size  Quality: various clustering measures

9 Q: How to share such data?  Anonymize the quasi-identifiers Suppress information  Privacy guarantee: anonymity  Quality: the amount of suppressed information Clustering  Privacy guarantee: cluster size  Quality: various clustering measures

10 k-anonymized Table [Samarati 01’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

11 Each row is identical to at least k-1 other rows k-anonymized Table [Samarati 01’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease *Cauc*Flu *Cauc*Cold *Cauc*Diabetes 41Afr-A*Flu 41Afr-A*Arthritis *Hisp94042Heart problem *Hisp94042Arthritis

12 Definition: k-anonymity  Input: a table consists of n row, each with m attributes (quasi-identifiers)  Output: suppress some entries such that each row is identical to at least k-1 other rows  Objective: minimize the number of suppressed entries

13 Past Work and New Results  [ MW 04 ’] NP-hardness for a large size alphabet O(k logk)-approximation  [ AFKMPTZ 05 ’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

14 Past Work and New Results  [ MW 04 ’] NP-hardness for a large size alphabet O(k logk)-approximation  [ AFKMPTZ 05 ’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

15 Graph Representation 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 4 2 4 6 3 AB F ED C 3 W(e)=Hamming distance between the two rows

16 2 Edge Selection I 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 2 2 3 AB F ED C Each node selects the lightest weight edge 0 k=3

17 3 Edge Selection II 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 2 3 AB F ED C For components with <k vertices, add more edges 0 k=3 2

18 Lemma  Total weight of edges selected is no more than OPT In the optimal solution, each vertex pays at least the weight of the (k-1) st lightest weight edge Forest: at most one edge per vertex By construction, the edge weight is no more than the (k-1) st lightest weight edge per vertex

19 Grouping  Ideally, each connected component forms a group  Anonymize vertices within a group  Total cost of a group: (total edge weights) (number of nodes) (2+2+3+3)6 3 2 2 3 AB F ED C 0 Small groups: O(k)

20 Dividing a Component  Root tree arbitrarily  Divide if Sub-trees & rest  k Aim: all sub-trees <k kk kk kk <k<k <k<k<k<k<k<k kk

21 Dividing a Component  Root tree arbitrarily  Divide if Sub-trees & rest  k Rotate the tree if necessary kk kk kk

22 Dividing a Component  Root tree arbitrarily  Divide if Sub-trees & rest  k  T. condition: max(2k-1, 3k-5) <k<k <k<k <k<k <k<k<k<k

23 An Example 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 3 2 2 3 AB F ED C 0

24 0 3 An Example C FE D B A 223 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F:

25 0 3 An Example C FE D B A 22 Estimated cost: 43+33 0*10** **01*1 **01*1 0*10** **01*1 0*10** A: B: C: D: E: F: Optimal cost: 33+33

26 Past Work and New Results  [ MW 04 ’] NP-hardness for a large size alphabet O(k logk)-approximation  [ AFKMPTZ 05 ’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

27 1.5-approximation 001000 000000 111111 001000 110111 110111 A: B: C: D: E: F: 1 6 5 6 6 AB F ED C 0 W(e)=Hamming distance between the two rows

28 Minimum {1,2}-matching 001000 000000 111111 001000 110111 110111 AB F D Each vertex is matched to 1 or 2 other vertices 0 0 1 E C 1 A: B: C: D: E: F:

29 Properties  Each component has 3 nodes Not Optimal Not possible (degree  2) >3

30  Cost  2OPT For binary alphabet: 1.5OPT Qualities apq r  p,q OPT pays: 2a We pay: 2a OPT pays: p+q+r We pay:  3(p+q)  2(p+q+r)

31 Past Work and New Results  [ MW 04 ’] NP-hardness for a large size alphabet O(k logk)-approximation  [ AFKMPTZ 05 ’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

32 Open Problems  Can we improve O(k)? (k) for graph representation

33 Open Problems  Can we improve O(k)? (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16, c = k  d / 2

34 Open Problems  Can we improve O(k)? (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16, c = k  d / 2

35 Open Problems  Can we improve O(k)? (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16, c = 2  d

36 Open Problems  Can we improve O(k)? (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16, c = 2  d

37 Q: How to share such data?  Anonymize the quasi-identifiers Suppress information  Privacy guarantee: anonymity  Quality: the amount of suppressed information Clustering  Privacy guarantee: cluster size  Quality: various clustering measures

38 Clustering Approach [AFKKPTZ 06’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

39 Transfers into a Metric… Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

40 Clusters and Centers Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

41 Clusters and Centers Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu Cold Diabetes Flu 41Afr-A94059Arthritis Heart problem 46Hisp94042Arthritis

42 Measure  How good are the clusters  “Tight” clusters are better Minimize max radius: Gather-k Minimize max distortion error: Cellular-k   radius  num_nodes Cost: Gather-k: 10 Cellular-k: 624

43 Measure  How good are the clusters  “Tight” clusters are better Minimize max radius: Gather-k Minimize max distortion error: Cellular-k   radius  num_nodes  Handle outliers  Constant approximations!

44 Comparison  k = 5  5-anonymity Suppress all entries More distortion  Clustering Can pick R5 as the center Less distortion Distortion is directly related with pair-wise distances R10111 R21011 R31101 R41110 R51111

45 Results [AFKKPTZ 06’]  Gather-k Tight 2-approximation Extension to outlier: 4-approximation  Cellular-k Primal-dual const. approximation Extensions as well

46 Results [AFKKPTZ 06’]  Gather-k Tight 2-approximation Extension to outlier: 4-approximation  Cellular-k Primal-dual const. approximation Extensions as well

47 2-approximation  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. A R 2R

48 2-approximation  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.

49 Example: k = 5

50 Optimal Solution 12 R

51 Center Selection

52 1

53 1 2R

54 Center Selection 2R 1

55 Center Selection 2 1 2R

56 Center Selection 2 1 2R

57 Reassignment 2 1

58 Degree Constrained Matching 1 ≥ k-1 =1 2

59 Actual Clustering 1 2

60 Optimal Clustering 12

61 Our guarantees  Return clusters of radius no more than 2R  If R is guessed correctly, then reassignment is possible Each cluster has at least k nodes  Do a binary search on the value of R suffices

62 Binary Search on R  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.

63 Binary Search on R  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R.  Not necessary, but is useful for quick pruning Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.

64 Binary Search on R  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R.  Not necessary, but is useful for quick pruning Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.  If successful, R could be smaller  Otherwise, R should be larger

65 Results [AFKKPTZ 06’]  Gather-k Tight 2-approximation Extension to outliner: 4-approximation  Cellular-k Primal-dual const. approximation Extensions

66 Ignore Cluster Size Constraint  Similar to Facility Location  radius  num_nodes vs.  invidual_distance_to_center  Caveat Assigning one distant node to an existing cluster will increase cost proportional to number of nodes in that cluster Each cluster is a (center, radius) pair

67 Intermediate Step I  Primal-dual constant approximation for  radius  num_nodes No cluster size constaint Arbitrary cluster setup cost  We want  radius  num_nodes Cluster size constraint No cluster setup cost

68 Enforce Cluster Size  Introduce extra cluster setup cost  Setup cost pays for k nodes to join a particular cluster, i.e., c setup = k  r  This at most doubles the actual cost of any size constrained cluster solution Each cluster’s total cost is at least k  r

69 Intermediate Step II  Shared solution! For each cluster with less than k nodes, additional nodes can join the cluster At no additional cost, paid for by the cluster setup cost Now nodes could be shared among multiple clusters  Key: convert a “shared” solution to a disjoint solution

70 Attached Separation  Starting from small radius clusters  “Open” as long as there are enough nodes  The left over points in clusters “attach” to the intersecting smaller radius (open) clusters Open

71 Regroup (k = 5)  Open cluster has ≥k nodes  Attached cluster has <k nodes  Group clusters to create bigger ones  Choose the “fat” cluster’s center as the new center 3 2 4 6

72 What About Cluster Cost?  These clustering intersects with the open cluster

73 What About Cluster Cost?  These clustering intersects with the open cluster  Routing cost is only a constant blowup w.r.t. the fat radius

74 What About Cluster Cost?  These clustering intersects with the open cluster  Routing cost is only a constant blowup w.r.t. the fat radius  Need to make sure the merged cluster is of reasonable size

75 Recap  Anonymize the quasi-identifiers Suppress information  Privacy guarantee: anonymity  Quality: the amount of suppressed information Clustering  Privacy guarantee: cluster size  Quality: various clustering measures

76 Thanks!


Download ppt "An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining."

Similar presentations


Ads by Google