Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong.

Similar presentations


Presentation on theme: "Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong."— Presentation transcript:

1 Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

2 Centralized publication  Assume that a hospital wants to publish the following table, called the microdata.  The publication must preserve the privacy of patients.  Prevent an adversary from knowing who-contracted- what. Microdata

3 Centralized publication (cont.)  A simple solution: Remove column ‘Name’.  It does not work. See next. publish

4 Linking attacks The published table A voter registration list Quasi-identifier (QI) attributes An adversary

5 These are real threats  Fact: 87% of Americans can be uniquely identified by {Zipcode, gender, date-of-birth}.  A famous experiment by Sweeney [International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]  finds the medical record of an ex-governor of Massachusetts.

6 Objectives  Publish a distorted version of the dataset so that  [Privacy] the privacy of all individuals is “adequately” protected;  [Utility] the dataset is useful for analyzing the characteristics of the microdata.  Paradox: Privacy protection , utility .

7 Issues  Privacy principle  What is adequate privacy protection?  Distortion approach  How to achieve the privacy principle?  The literature has discussed other issues as well.  Complexities, improving the utility of the published data, etc.

8 Principle 1: k-anonymity  2-anonymous generalization: QI attributes Sensitive attribute 4 QI groups A voter registration list [Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]

9 Defects of k-anonymity  What is the disease of Joe? No “diversity” in this QI group. A voter registration list

10 Principle 2: l-diversity  Each QI group should have at least l “well-represented” sensitive values.  Different ways to interpret “well-represented”. [Machanavajjhala et al., ICDE, 2006]

11 Naive interpretation  Each QI-group has l different sensitive values. A 2-diverse table AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

12 Defects of the naive interpretation  Assume that Joe is identified in the QI group. What is the probability that he contracted HIV?  Implication: The most frequent sensitive value in a QI group cannot be too frequent.  But accomplishing only is still vulnerable against attacks with background knowledge. A QI group with 100 tuples 98 tuples

13 Background knowledge attack  Let Joe be an individual in the QI group having HIV.  A friend of Joe has the background knowledge: “Joe does not have pneumonia”.  How likely would this friend assume that Joe had HIV? A QI group with 100 tuples 50 tuples 49 tuples

14 Controlling also the 2nd most frequent value  Even if an adversary can eliminate pneumonia, s/he can only assume that Joe has HIV with 40 / 70 probability. A QI group with 100 tuples 40 tuples 30 tuples

15 An example of 4-diversity A QI group The most frequent value The 2nd most frequent value The 3rd most frequent value The 4th most frequent value The other values

16 An example of 4-diversity (cont.) A QI group The most frequent value The other values Same cardinality

17  Assume that Joe is a person in the QI group.  Property: If an adversary can eliminate only  3 diseases, s/he can correctly guess the disease of Joe with at most 50% probability. An example of 4-diversity (cont.) A QI group HIV pneumonia bronchitis cancer The other values

18 l-diversity  Consider a QI group.  m is the number of sensitive values in the group.  r 1 is the number of tuples having the most sensitive value.  r 2 is the number of tuples having the 2nd most sensitive value.  …  r m is the number of tuples having the m-th most sensitive value.  Then, r 1  c (r l + … + r m ), where c is a constant.  If an adversary can eliminate only l – 1 sensitive values, s/he can infer the disease of a person with probability at most 1 / (c + 1).  Called (c, l)-diversity precisely.

19 Defects of l-diversity  Andy does not want anyone to know that he had a stomach problem.  Sarah does not mind at all if others find out that she had flu. NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-diverse tableA voter registration list AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

20  Does not work if an individual can have multiple tuples in the microdata. Defects of l-diversity (cont.) Microdata NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Andy4M12000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu

21 Defects of l-diversity (cont.) NameAgeSexZipcode Andy4M12000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-diverse tableA voter registration list AgeSexZipcodeDisease 4M12000gastric ulcer 4M12000dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

22 Principle 3: Personalized anonymity  Key ideas: Guarding node + sensitive attribute (SA) generalization  Assume a publicly-known hierarchy on the sensitive attribute. [Xiao and Tao, SIGMOD, 2006]

23 Guarding node  Andy does not want anyone to know that he had a stomach problem.  He can specify “stomach disease” as the guarding node for his tuple.  Protect Andy from being conjectured to have any disease in the subtree of the guarding node. NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease

24 Guarding node (cont.)  Sarah is willing to disclose her exact symptom.  She can specify Ø as the guarding node for her tuple. NameAgeSexZipcodeDiseaseguarding node Sarah28F37000flu Ø

25 Guarding node (cont.)  Bill does not have any special preference.  He sets the guarding node of his tuple to be the same as his sensitive value. NameAgeSexZipcodeDiseaseguarding node Bill5M14000dyspepsia

26 A personalized approach NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

27 Personalized anonymity  No adversary should be able to breach the privacy requirement of any guarding node with a probability above p breach..  If p breach = 0.3, then no adversary can have more than 30% probability to find out that:  Andy had a stomach disease  Bill had dyspepsia …… NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

28 Why SA generalization?  How many female patients are there with age above 30?  4 ∙ (60 – 30 + 1) / (60 – 21 + 1) = 3  Real answer: 1 Pure QI generalization AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Bill5M14000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu Microdata

29 SA generalization (cont.) With SA generalization AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]flu 56F58000 respiratory infection Pure QI generalization AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

30 Evaluation of disclosure risk  What is the probability that the adversary can find out that “Andy had a stomach disease”? NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection A voter registration list The published data

31 Combinatorial reconstruction (cont.)  Can each individual appear more than once?  No = the primary case  Yes = the non-primary case  Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The non-primary case

32 Combinatorial reconstruction (cont.)  Can each individual appear more than once?  No = the primary case  Yes = the non-primary case  Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The non-primary case

33 Breach probability (primary)  Totally 120 possible reconstructions  If Andy is associated with a stomach disease in n b reconstructions  The probability that the adversary should associate Andy with some stomach problem is n b / 120  Andy is associated with  gastric ulcer in 24 reconstructions  dyspepsia in 24 reconstructions  gastritis in 0 reconstructions  n b = 48  The breach probability for Andy’s tuple is 48 / 120 = 2 / 5. Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

34 Breach probability (non-primary)  Totally 625 possible reconstructions  Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions.  n b = 225  The breach probability for Andy’s tuple is 225 / 625 = 9 / 25 Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

35 A defect of personalized anonymity  Does not guard against background knowledge.  Recall that l-diversity can achieve this purpose.  But it seems possible to adapt the personalized approach to tackle background knowledge.  Future work?

36 Other privacy principles  k-gather.  Due to [Aggarwal et al., PODS, 2006]  Suffers from the problems of k-anonymity.  (a, k)-anonymity  Due to [Wong et al., KDD, 2006]  t-closeness.  Recently proposed by [Li and Li, ICDE, 2007]

37 Issues  Privacy principle  What is adequate privacy protection?  Distortion approach  How to achieve the privacy principle?

38 Three approaches  Suppression  We do not discuss it because the utility of the resulting table is low; it can be regarded as a special case of generalization.  Generalization  Due to [Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]  Anatomy (also called “bucketization”)  Due to [Xiao and Tao, VLDB, 2006]  Each of the above approaches can be integrated with all the privacy principles discussed earlier.

39 A multidimensional view of generalization

40 Taxonomy of generalization  Local recoding  (Generalized) rectangles may overhalp.  Suppression is a special case of local recoding.  Global recoding  All rectangles are disjoint. [LeFevre et al. SIGMOD, 2005]

41 Taxonomy of generalization (cont.)  Global recoding can be further divided.  Single-dimension recoding  Rectangles form a grid.  Multi-dimension recoding  The opposite of single- dimension recoding.

42 Taxonomy of generalization (cont.)  Single-dimension recoding can be further divided.  Full-domain recoding  Full-subtree recoding  Both assume a hierarchy on each QI attribute.  Example: A hierarchy on Age

43 Taxonomy of generalization (cont.)  Full-domain recoding  All age values must be generalized to the same level of the hierachy.

44 Taxonomy of generalization (cont.)  Full-subtree recoding  The subtrees of all generalized values must be disjoint.  Permissible generalization: [1, 30], [31, 40], [41, 50], [51, 60], [61, 90].  Illegal generalization: [1, 10], [1, 30], [31, 60], [61, 90].

45 Why all these generalization types?  Reason 1: If a dataset is generalized in a more restricted manner, less preprocessing is required before it can be analyzed by a standard statistical tool (such as SAAS).

46 Why all these generalization types?  Reason 2: More restrictive generalization is usually faster to compute and easier to analyze.

47 Why all these generalization types?  Reason 3: Less restrictive generalization promises more accurate data analysis, provided that a sophisticated analytical method is used.

48 Generalization algorithms  Operate on a quality metric. Examples:  The generalization level (for full-domain recoding)  Total rectangle size (for local recoding) ……  Mostly heuristics-based.  Finding the optimal generalization is often NP hard.

49 Defect of generalization  Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]dyspepsia [21, 60]M[10001, 60000]pneumonia [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]gastritis [61, 70]F[10001, 60000]flu [61, 70]F[10001, 60000]bronchitis  Estimated answer: 2p, where p is the probability that each of the two tuples satisfies the query conditions on the Age and Zipcode.

50 Defect of generalization (cont.)  Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000]  p = Area( R 1 ∩ Q ) / Area( R 1 ) = 0.05  Estimated answer for Query A: 2p = 0.1 AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]pneumonia

51 Defect of generalization (cont.)  Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000]  Estimated answer = 0.1 NameAgeSexZipcodeDisease Bob23M11000pneumonia Ken27M13000dyspepsia Peter35M59000dyspepsia Sam59M12000pneumonia Jane61F54000flu Linda65F25000gastritis Alice65F25000flu Mandy70F30000bronchitis  The exact answer = 1

52 Defect of generalization (cont.)  Cause of inaccuracy: QI distribution inside each QI group is lost! AgeSexZipcodeDisease [21, 60]M[10001, 60000]pneumonia [21, 60]M[10001, 60000]pneumonia

53 Anatomy  Releases a quasi-identifier table (QIT) and a sensitive table (ST). Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 Quasi-identifier table (QIT) Sensitive table (ST) AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis Microdata

54 Anatomy (cont.) 1. Decide an l-diverse partition of the tuples. AgeSexZipcodeDisease 23M11000pneumonia 27M13000dyspepsia 35M59000dyspepsia 59M12000pneumonia 61F54000flu 65F25000gastritis 65F25000flu 70F30000bronchitis QI group 1 QI group 2 A 2-diverse partition

55 Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition. Disease pneumonia dyspepsia pneumonia flu gastritis flu bronchitis AgeSexZipcode 23M11000 27M13000 35M59000 59M12000 61F54000 65F25000 65F25000 70F30000 group 1 group 2 quasi-identifier table (QIT)sensitive table (ST)

56 Anatomy (cont.) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the decided partition. Group-IDDisease 1pneumonia 1dyspepsia 1 1pneumonia 2flu 2gastritis 2flu 2bronchitis AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 quasi-identifier table (QIT)sensitive table (ST)

57 Privacy preservation  Given a pair of QIT and ST generated from an l-diverse partition, an adversary can infer the sensitive value of each individual with confidence at most 1 / l. Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 quasi-identifier table (QIT) sensitive table (ST) NameAgeSexZipcode Bob23M11000

58 Accuracy of data analysis  Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Group-IDDiseaseCount 1dyspepsia2 1pneumonia2 2bronchitis1 2flu2 2gastritis1 AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 61F540002 65F250002 65F250002 70F300002 Quasi-identifier table (QIT) Sensitive table (ST)

59 Accuracy of data analysis  Query A:SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000]  2 patients contracted pneumonia  2 out of 4 patients satisfy the query conditions on Age and Zipcode  Estimated answer = 2 * 2 / 4 = 1. AgeSexZipcodeGroup-ID 23M110001 27M130001 35M590001 59M120001 t1t2t3t4t1t2t3t4

60 A defect of anatomy  Existence breach: Does an individual exist in the microdata?

61 Future work  Re-publication  Tackle stronger background knowledge  Recent work [Martin et al., ICDE, 2007]  Improving utility  Pioneering work [Kifer and Gehrke, SIGMOD, 2006]  Application to specific (non-trivial) applications  Location privacy Pioneering work [Mokbel et al., VLDB, 2006]


Download ppt "Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong."

Similar presentations


Ads by Google