An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining.

An Zhu Towards Achieving Anonymity

Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?

Example: Medical Records IdentifiersSensitive Info SSNNameAgeRaceZipcodeDisease 614Sara31Cauc94305Flu 615Joan34Cauc94307Cold 629Kelly27Cauc94301Diabetes 710Mike41Afr-A94305Flu 840Carl41Afr-A94059Arthritis 780Joe65Hisp94042Heart problem 616Rob46Hisp94042Arthritis

De-identified Records Sensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

Not Sufficient! [Sweeney 00’] Sensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis Public Database Unique Identifiers!

Not Sufficient! [Sweeney 00’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis Public Database Unique Identifiers!

Anonymize the Quasi-Identifiers! Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease *** Flu *** Cold *** Diabetes *** Flu *** Arthritis *** Heart problem *** Arthritis Public Database Unique Identifiers!

Q: How to share such data?  Anonymize the quasi-identifiers Suppress information  Privacy guarantee: anonymity  Quality: the amount of suppressed information Clustering  Privacy guarantee: cluster size  Quality: various clustering measures

k-anonymized Table [Samarati 01’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

Each row is identical to at least k-1 other rows k-anonymized Table [Samarati 01’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease *Cauc*Flu *Cauc*Cold *Cauc*Diabetes 41Afr-A*Flu 41Afr-A*Arthritis *Hisp94042Heart problem *Hisp94042Arthritis

Definition: k-anonymity  Input: a table consists of n row, each with m attributes (quasi-identifiers)  Output: suppress some entries such that each row is identical to at least k-1 other rows  Objective: minimize the number of suppressed entries

Past Work and New Results  [ MW 04 ’] NP-hardness for a large size alphabet O(k logk)-approximation  [ AFKMPTZ 05 ’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

Graph Representation 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 4 2 4 6 3 AB F ED C 3 W(e)=Hamming distance between the two rows

2 Edge Selection I 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 2 2 3 AB F ED C Each node selects the lightest weight edge 0 k=3

3 Edge Selection II 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 2 3 AB F ED C For components with <k vertices, add more edges 0 k=3 2

Lemma  Total weight of edges selected is no more than OPT In the optimal solution, each vertex pays at least the weight of the (k-1) st lightest weight edge Forest: at most one edge per vertex By construction, the edge weight is no more than the (k-1) st lightest weight edge per vertex

Grouping  Ideally, each connected component forms a group  Anonymize vertices within a group  Total cost of a group: (total edge weights) (number of nodes) (2+2+3+3)6 3 2 2 3 AB F ED C 0 Small groups: O(k)

Dividing a Component  Root tree arbitrarily  Divide if Sub-trees & rest  k Aim: all sub-trees <k kk kk kk <k<k <k<k<k<k<k<k kk

Dividing a Component  Root tree arbitrarily  Divide if Sub-trees & rest  k Rotate the tree if necessary kk kk kk

Dividing a Component  Root tree arbitrarily  Divide if Sub-trees & rest  k  T. condition: max(2k-1, 3k-5) <k<k <k<k <k<k <k<k<k<k

An Example 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F: 3 2 2 3 AB F ED C 0

0 3 An Example C FE D B A 223 001000 100101 010101 001000 110111 011011 A: B: C: D: E: F:

0 3 An Example C FE D B A 22 Estimated cost: 43+33 0*10** **01*1 **01*1 0*10** **01*1 0*10** A: B: C: D: E: F: Optimal cost: 33+33

1.5-approximation 001000 000000 111111 001000 110111 110111 A: B: C: D: E: F: 1 6 5 6 6 AB F ED C 0 W(e)=Hamming distance between the two rows

Minimum {1,2}-matching 001000 000000 111111 001000 110111 110111 AB F D Each vertex is matched to 1 or 2 other vertices 0 0 1 E C 1 A: B: C: D: E: F:

Properties  Each component has 3 nodes Not Optimal Not possible (degree  2) >3

 Cost  2OPT For binary alphabet: 1.5OPT Qualities apq r  p,q OPT pays: 2a We pay: 2a OPT pays: p+q+r We pay:  3(p+q)  2(p+q+r)

Open Problems  Can we improve O(k)? (k) for graph representation

Open Problems  Can we improve O(k)? (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16, c = k  d / 2

Open Problems  Can we improve O(k)? (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16, c = 2  d

Q: How to share such data?  Anonymize the quasi-identifiers Suppress information  Privacy guarantee: anonymity  Quality: the amount of suppressed information Clustering  Privacy guarantee: cluster size  Quality: various clustering measures

Clustering Approach [AFKKPTZ 06’] Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

Transfers into a Metric… Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

Clusters and Centers Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

Clusters and Centers Quasi-IdentifiersSensitive Info AgeRaceZipcodeDisease 31Cauc94305Flu Cold Diabetes Flu 41Afr-A94059Arthritis Heart problem 46Hisp94042Arthritis

Measure  How good are the clusters  “Tight” clusters are better Minimize max radius: Gather-k Minimize max distortion error: Cellular-k   radius  num_nodes Cost: Gather-k: 10 Cellular-k: 624

Measure  How good are the clusters  “Tight” clusters are better Minimize max radius: Gather-k Minimize max distortion error: Cellular-k   radius  num_nodes  Handle outliers  Constant approximations!

Comparison  k = 5  5-anonymity Suppress all entries More distortion  Clustering Can pick R5 as the center Less distortion Distortion is directly related with pair-wise distances R10111 R21011 R31101 R41110 R51111

Results [AFKKPTZ 06’]  Gather-k Tight 2-approximation Extension to outlier: 4-approximation  Cellular-k Primal-dual const. approximation Extensions as well

2-approximation  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. A R 2R

2-approximation  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.

Example: k = 5

Optimal Solution 12 R

Center Selection

Center Selection 2R 1

Center Selection 2 1 2R

Reassignment 2 1

Degree Constrained Matching 1 ≥ k-1 =1 2

Actual Clustering 1 2

Optimal Clustering 12

Our guarantees  Return clusters of radius no more than 2R  If R is guessed correctly, then reassignment is possible Each cluster has at least k nodes  Do a binary search on the value of R suffices

Binary Search on R  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R. Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.

Binary Search on R  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R.  Not necessary, but is useful for quick pruning Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.

Binary Search on R  Assume an optimal value R Make sure each node has at least k – 1 neighbors within distance 2R.  Not necessary, but is useful for quick pruning Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. Make sure we can reassign nodes to the selected centers.  If successful, R could be smaller  Otherwise, R should be larger

Results [AFKKPTZ 06’]  Gather-k Tight 2-approximation Extension to outliner: 4-approximation  Cellular-k Primal-dual const. approximation Extensions

Ignore Cluster Size Constraint  Similar to Facility Location  radius  num_nodes vs.  invidual_distance_to_center  Caveat Assigning one distant node to an existing cluster will increase cost proportional to number of nodes in that cluster Each cluster is a (center, radius) pair

Intermediate Step I  Primal-dual constant approximation for  radius  num_nodes No cluster size constaint Arbitrary cluster setup cost  We want  radius  num_nodes Cluster size constraint No cluster setup cost

Enforce Cluster Size  Introduce extra cluster setup cost  Setup cost pays for k nodes to join a particular cluster, i.e., c setup = k  r  This at most doubles the actual cost of any size constrained cluster solution Each cluster’s total cost is at least k  r

Intermediate Step II  Shared solution! For each cluster with less than k nodes, additional nodes can join the cluster At no additional cost, paid for by the cluster setup cost Now nodes could be shared among multiple clusters  Key: convert a “shared” solution to a disjoint solution

Attached Separation  Starting from small radius clusters  “Open” as long as there are enough nodes  The left over points in clusters “attach” to the intersecting smaller radius (open) clusters Open

Regroup (k = 5)  Open cluster has ≥k nodes  Attached cluster has <k nodes  Group clusters to create bigger ones  Choose the “fat” cluster’s center as the new center 3 2 4 6

What About Cluster Cost?  These clustering intersects with the open cluster

What About Cluster Cost?  These clustering intersects with the open cluster  Routing cost is only a constant blowup w.r.t. the fat radius

What About Cluster Cost?  These clustering intersects with the open cluster  Routing cost is only a constant blowup w.r.t. the fat radius  Need to make sure the merged cluster is of reasonable size

Recap  Anonymize the quasi-identifiers Suppress information  Privacy guarantee: anonymity  Quality: the amount of suppressed information Clustering  Privacy guarantee: cluster size  Quality: various clustering measures

Thanks!

An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining.

Similar presentations

Presentation on theme: "An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining.

Similar presentations

Presentation on theme: "An Zhu Towards Achieving Anonymity. Introduction  Collect and analyze personal data Infer trends and patterns  Making the personal data “public” Joining."— Presentation transcript:

Similar presentations

About project

Feedback