Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani,

Similar presentations


Presentation on theme: "Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani,"— Presentation transcript:

1 http://theory.stanford.edu/~rajeev/privacy.html Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani, Rina Panigrahy, Dilys Thomas, An Zhu

2 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 2 An example: Medical Records IdentifyingSensitive SSNNameAgeRaceZipcod e Disease 614Sara31Cauc94305Flu 615Joan34Cauc94307Cold 629Kelly27Cauc94301Diabetes 710Mike41Afr-A94305Flu 840Carl41Afr-A94059Arthritis 780Joe65Hisp94042Heart problem 614Rob46Hisp94042Arthritis

3 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 3 Medical Records: De-identify & Release Sensitive AgeRaceZipcod e Disease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis

4 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 4 Not sufficient! [Swe02, SS98] Public Database Uniquely identify you! Sensitive AgeRaceZipcod e Disease 31Cauc94305Flu 34Cauc94307Cold 27Cauc94301Diabetes 41Afr-A94305Flu 41Afr-A94059Arthritis 65Hisp94042Heart problem 46Hisp94042Arthritis Quasi-identifiers: reveal less information k-anonymity model

5 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 5 k-anonymity – Problem Definition  Input: Database consisting of n rows, each with m attributes drawn from a finite alphabet.  Goal: Suppress some entries in the table such that each modified row becomes identical to at least k-1 other rows.  More the suppression, lesser the utility of the modified table.  Objective: Minimize the number of suppressed entries.

6 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 6 Medical Records: 2-anonymized table AgeRaceZipcodeDisease *Cauc*Flu *Cauc*Cold *Cauc*Diabetes 41Afr-A*Flu 41Afr-A*Arthritis *Hisp94042Heart problem *Hisp94042Arthritis Suppress entriesCost = 10

7 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 7 k-anonymity – Results  [MW04]  NP-hardness for a linear size alphabet  O(k log k) - approximation algorithm  NP-hardness (even for ternary alphabet)  O(k) - approximation for k-anonymity  1.5 - approximation for 2-anonymity  2 - approximation for 3-anonymity

8 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 8 O(k)-approximation algorithm (for k = 3)  Create a complete graph s.t.  Each row vector in the table is a vertex.  Weight of an edge is the number of attributes on which the two rows differ (Hamming distance). AgeRaceZipcod e 31Cauc94305 34Cauc94307 41Afr-A94305 41Afr-A94059 2 2 1 3 3 3

9 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 9 O(k)-approximation algorithm (for k = 3)  We create a forest as follows:  Each node picks its nearest neighbor and connects to it.  If the resulting graph has a component with only two nodes, connect this component to the second nearest neighbor of one of the two nodes.

10 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 10 An example graph 7 9 4 3 2 7 2 3 5 10 12 9 5 1 1 Nearest-neighbor edge Other edges 7

11 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 11 The forest obtained 4 3 2 2 3 1 1

12 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 12 O(k)-approximation algorithm (for k = 3)  The forest has:  Components of size at least 3.  The total cost of edges in the forest is no more than the cost of the optimal solution.  In optimal solution, each node has at least as many *s as its Hamming distance to its second nearest neighbor.  Each node has at most as many *s as the cost of the tree containing the node.  If there is any component with size greater than 5, break it into components of size at least 3 (resp. k).

13 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 13 The final partition 3 3 2 2 3 1 1 4

14 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 14 Analysis of the algorithm  Cluster the row vectors according to this partition  Cost incurred ≤ OPT * (size of largest partition) = 5 * OPT.  For general k, the cost of this solution is within max{3k-5,2k-1} of the cost of optimal solution.

15 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 15 Better than O(k)-approximation?  Not possible, using only the graph representation  Lose information about the structure of the problem  There exist two instances with:  Same underlying graph  k-anonymity costs differing by a factor of O(k)

16 Krishnaram KenthapadiPORTIA Workshop, 8 July 2004 16 Open problems  Lower bounds on the approximation factor (without assuming the graph representation)  Extend the k-anonymity model to account for changes in the database:  Handle inserts, deletes and updates


Download ppt "Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani,"

Similar presentations


Ads by Google