Download presentation

Presentation is loading. Please wait.

Published byDelphia Byrd Modified about 1 year ago

1
Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

2
Privacy preserving data publishing Microdata Purposes: –Allow researchers to effectively study the correlation between various attributes –Protect the privacy of every patient NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Bill5M14000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu

3
A naïve solution It does not work. See next. NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Bill5M14000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu AgeSexZipcodeDisease 4M12000gastric ulcer 5M14000dyspepsia 6M18000pneumonia 9M19000bronchitis 12F22000flu 19F24000pneumonia 21F33000gastritis 25F34000gastritis 28F37000flu 56F58000flu publish

4
Inference attack AgeSexZipcodeDisease 4M12000gastric ulcer 5M14000dyspepsia 6M18000pneumonia 9M19000bronchitis 12F22000flu 19F24000pneumonia 21F33000gastritis 25F34000gastritis 28F37000flu 56F58000flu Published table NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 An external database (a voter registration list) An adversary Quasi-identifier (QI) attributes

5
Generalization Transform each QI value into a less specific form NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A generalized tableAn external database AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 25]F[30001, 35000]gastritis [21, 25]F[30001, 35000]gastritis [26, 60]F[35001, 60000]flu [26, 60]F[35001, 60000]flu Information loss

6
k-anonymity The following table is 2-anonymous AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 25]F[30001, 35000]gastritis [21, 25]F[30001, 35000]gastritis [26, 60]F[35001, 60000]flu [26, 60]F[35001, 60000]flu 5 QI groups Quasi-identifier (QI) attributesSensitive attribute

7
Drawback of k-anonymity What is the disease of Linda? NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-anonymous tableAn external database AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 25]F[30001, 35000]gastritis [21, 25]F[30001, 35000]gastritis [26, 60]F[35001, 60000]flu [26, 60]F[35001, 60000]flu

8
A better criterion: l-diversity Each QI-group –has at least l different sensitive values –even the most frequent sensitive value does not have a lot of tuples NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Alice12F22000 Mike7M17000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-diverse tableAn external database AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

9
Motivation 1: Personalization Andy does not want anyone to know that he had a stomach problem Sarah does not mind at all if others find out that she had flu NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-diverse tableAn external database AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

10
Motivation 2: Non-primary case Microdata NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Andy4M12000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu

11
Motivation 2: Non-primary case (cont.) NameAgeSexZipcode Andy4M12000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F diverse tableAn external database AgeSexZipcodeDisease 4M12000gastric ulcer 4M12000dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

12
Motivation 3: SA generalization How many female patients are there with age above 30? 4 ∙ (60 – ) / (60 – ) = 3 Real answer: 1 A generalized table AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 An external database

13
Motivation 3: SA generalization (cont.) Generalization of the sensitive attribute is beneficial in this case A better generalized table AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]flu 56F58000 respiratory infection NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 An external database

14
Personalized anonymity We propose –a mechanism to capture personalized privacy requirements –criteria for measuring the degree of security provided by a generalized table –an algorithm for generating publishable tables

15
Guarding node Andy does not want anyone to know that he had a stomach problem He can specify “stomach disease” as the guarding node for his tuple The data publisher should prevent an adversary from associating Andy with “stomach disease” NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease

16
Guarding node Sarah is willing to disclose her exact symptom She can specify Ø as the guarding node for her tuple NameAgeSexZipcodeDiseaseguarding node Sarah28F37000flu Ø

17
Guarding node Bill does not have any special preference He can specify the guarding node for his tuple as the same with his sensitive value NameAgeSexZipcodeDiseaseguarding node Bill5M14000dyspepsia

18
A personalized approach NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

19
Personalized anonymity A table satisfies personalized anonymity with a parameter p breach –Iff no adversary can breach the privacy requirement of any tuple with a probability above p breach If p breach = 0.3, then any adversary should have no more than 30% probability to find out that: –Andy had a stomach disease –Bill had dyspepsia –etc NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

20
Personalized anonymity Personalized anonymity with respect to a predefined parameter p breach –an adversary can breach the privacy requirement of any tuple with a probability at most p breach AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection We need a method for calculating the breach probabilities What is the probability that Andy had some stomach problem?

21
Combinatorial reconstruction Assumptions –the adversary has no prior knowledge about each individual –every individual involved in the microdata also appears in the external database

22
Combinatorial reconstruction Andy does not want anyone to know that he had some stomach problem What is the probability that the adversary can find out that “Andy had a stomach disease”? NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection

23
Combinatorial reconstruction (cont.) Can each individual appear more than once? –No = the primary case –Yes = the non-primary case Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the non-primary case

24
Combinatorial reconstruction (cont.) Can each individual appear more than once? –No = the primary case –Yes = the non-primary case Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the non-primary case

25
Breach probability (primary) Totally 120 possible reconstructions If Andy is associated with a stomach disease in n b reconstructions The probability that the adversary should associate Andy with some stomach problem is n b / 120 Andy is associated with –gastric ulcer in 24 reconstructions –dyspepsia in 24 reconstructions –gastritis in 0 reconstructions n b = 48 The breach probability for Andy’s tuple is 48 / 120 = 2 / 5 Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

26
Breach probability (non-primary) Totally 625 possible reconstructions Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions n b = 225 The breach probability for Andy’s tuple is 225 / 625 = 9 / 25 Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

27
Breach probability: Formal results NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection

28
Breach probability: Formal results NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection

29
More in our paper An algorithm for computing generalized tables that –satisfies personalized anonymity with predefined p breach –reduces information loss by employing generalization on both the QI attributes and the sensitive attribute

30
Experiment settings 1 Goal: To show that k-anonymity and l-diversity do not always provide sufficient privacy protection Real dataset Pri-leaf Nonpri-leaf Pri-mixed Nonpri-mixed Cardinality = 100k AgeEducationGenderMarital-statusOccupationIncome

31
Degree of privacy protection (Pri-leaf) p breach = 0.25 (k = 4, l = 4)

32
Degree of privacy protection (Nonpri-leaf) p breach = 0.25 (k = 4, l = 4)

33
Degree of privacy protection (Pri-mixed) p breach = 0.25 (k = 4, l = 4)

34
Degree of privacy protection (Nonpri-mixed) p breach = 0.25 (k = 4, l = 4)

35
Experiment settings 2 Goal: To show that applying generalization on both the QI attributes and the sensitive attribute will lead to more effective data analysis

36
Accuracy of analysis (no personalization)

37
Accuracy of analysis (with personalization)

38
Conclusions k-anonymity and l-diversity are not sufficient for the Non-primary case Guarding nodes allow individuals to describe their privacy requirements better Generalization on the sensitive attribute is beneficial

39
Thank you! Datasets and implementation are available for download at

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google