Presentation is loading. Please wait.

Presentation is loading. Please wait.

Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)

Similar presentations


Presentation on theme: "Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)"— Presentation transcript:

1 Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)

2 2 Outline  Recap – privacy and k-anonymity  -diversity (beyond k-anonymity)  t-closeness (beyond k-anonymity and l-diversity)  Privacy?

3 Recap - k-Anonymity Using medical data without disclosing patients’ identity: The problem: the ability of an attacker to cross the released data with external data. Zip Birthdate Gender Ethnicity Visit date Diagnosis Procedure Medication Total charge Name Address Date registered Party affiliation Date last voted Medical data Voter List Quasi-identifier

4 4 K-Anonymity – Formal Definition  RT - Released Table  (A1,A2, …,An) - Attributes  QI RT - Quasi Identifier  RT[QI RT ] – Projection of RT on QI RT

5 Example – original data Non-Sensitive DataSensitive Data # ZIPAgeNationalityCondition 11305328RussianHeart Disease 21306829AmericanHeart Disease 31306821JapaneseViral Infection 41305323AmericanViral Infection 51485350IndianCancer 61485355RussianHeart Disease 71485047AmericanViral Infection 81485049AmericanViral Infection 91305331AmericanCancer 101305337IndianCancer 111306836JapaneseCancer 121306835AmericanCancer

6 Example - 4-anonymized Table Non-Sensitive DataSensitive Data # ZIPAgeNationalityCondition 11305328*Heart Disease 21306829*Heart Disease 31306821*Viral Infection 41305323*Viral Infection 51485350*Cancer 61485355*Heart Disease 71485047*Viral Infection 81485049*Viral Infection 91305331*Cancer 101305337*Cancer 111306836*Cancer 121306835*Cancer

7 Example - 4-anonymized Table Non-Sensitive DataSensitive Data # ZIPAgeNationalityCondition 113053<30*Heart Disease 213068<30*Heart Disease 313068<30*Viral Infection 413053<30*Viral Infection 514853 40 *Cancer 614853 40 *Heart Disease 714850 40 *Viral Infection 814850 40 *Viral Infection 9130533**Cancer 10130533**Cancer 11130683**Cancer 12130683**Cancer

8 Example - 4-anonymized Table Non-Sensitive DataSensitive Data # ZIPAgeNationalityCondition 1130**<30*Heart Disease 2130**<30*Heart Disease 3130**<30*Viral Infection 4130**<30*Viral Infection 51485* 40 *Cancer 61485* 40 *Heart Disease 71485* 40 *Viral Infection 81485* 40 *Viral Infection 9130**3**Cancer 10130**3**Cancer 11130**3**Cancer 12130**3**Cancer

9 Example - 4-anonymized Table Non-Sensitive DataSensitive Data # ZIPAgeNationalityCondition 1130**<30*Heart Disease 2130**<30*Heart Disease 3130**<30*Viral Infection 4130**<30*Viral Infection 51485* 40 *Cancer 61485* 40 *Heart Disease 71485* 40 *Viral Infection 81485* 40 *Viral Infection 9130**3**Cancer 10130**3**Cancer 11130**3**Cancer 12130**3**Cancer We have 4-anonymity!!! We have privacy!!!! We have 4-anonymity!!! We have privacy!!!!

10 Example - 4-anonymized Table Non-Sensitive DataSensitive Data # ZIPAgeNat.Condition 1130**<30*Heart Disease 2130**<30*Heart Disease 3130**<30*Viral Infection 4130**<30*Viral Infection 51485* 40 *Cancer 61485* 40 *Heart Disease 71485* 40 *Viral Infection 81485* 40 *Viral Infection 9130**3**Cancer 10130**3**Cancer 11130**3**Cancer 12130**3**Cancer Suppose attacker knows the non- sensitive attributes of And the fact that Japanese have very low incidence of heart disease NameZipAgeNational Umeko1306821Japanese Bob1305331American

11 Example - 4-anonymized Table Non-Sensitive DataSensitive Data # ZIPAgeNat.Condition 1130**<30*Heart Disease 2130**<30*Heart Disease 3130**<30*Viral Infection 4130**<30*Viral Infection 51485* 40 *Cancer 61485* 40 *Heart Disease 71485* 40 *Viral Infection 81485* 40 *Viral Infection 9130**3**Cancer 10130**3**Cancer 11130**3**Cancer 12130**3**Cancer Suppose attacker knows the non- sensitive attributes of And the fact that Japanese have very low incidence of heart disease NameZipAgeNational Umeko1306821Japanese Bob1305331American Bob has cancer! Umeko has viral infection!

12 k-Anonymity Drawbacks  Basic Reasons for leak: Sensitive attributes lack diversity in values Sensitive attributes lack diversity in values Homogeneity AttackHomogeneity Attack Attacker has additional background knowledge Attacker has additional background knowledge Background knowledge AttackBackground knowledge Attack  Hence a new solution has been proposed in- addition to k-anonymity – -diversity

13 Adversary’s background knowledge  Has access to published table T* and knows that it is a generalization of some base table T  Instance-level background knowledge: Some individuals are present in the table. Some individuals are present in the table. Knowledge about sensitive attributes of specific individuals. Knowledge about sensitive attributes of specific individuals.  Demographic background knowledge Partial knowledge about the distribution of sensitive and non-sensitive attributes in the population. Partial knowledge about the distribution of sensitive and non-sensitive attributes in the population.  Diversity in the sensitive attribute values should mitigate both!

14 Some notation…  T = {t 1, t 2,…, t n } : A table with attributes A 1, A 2,…, A m A table with attributes A 1, A 2,…, A m Subset of some population  Subset of some population   t[C] = (t[C 1, C 2, …, C p ]) : Projection of t onto a set of attributes C  A Projection of t onto a set of attributes C  A  S  A – sensitive attributes  QI  A – quasi-identifier attributes  T*: anonymized table  q*-block – the set of records that were generalized to the same value q* in T*

15 Bayes Optimal Privacy  Ideal notion of privacy: models background knowledge as probability distribution over attributes  Uses Bayesian Inference techniques  Simplifying assumptions: A single, multi-dimensional quasi-identifier attribute Q A single, multi-dimensional quasi-identifier attribute Q A single sensitive attribute S A single sensitive attribute S T is a simple random sample from  T is a simple random sample from  Adversary Alice knows complete joint distribution f of Q and S (worst case assumption) Adversary Alice knows complete joint distribution f of Q and S (worst case assumption)

16 Bayes Optimal Privacy  Assume Bob appears in generalized table T*.  Alice’s prior belief of Bob’s sensitive attribute:  (q,s) =P f ( t[S] = s | t[Q] = q)  After seeing T*, Alice’s belief changes to its posterior value (or observed belief):  (q,s,T*) =P f ( t[S] = s | t[Q] = q   t*  T*, t* generalizes t) We wouldn’t want Alice to learn “much”:  (q,s)  (q,s,T*)

17 Bayes Optimal Privacy - Example  Bob, Alice’s neighbor, is a 62 years old state employee.  Alice’s prior belief: 10% of men over 60 have cancer:  (age  60  ZIPcode=02138,cancer) =  (age  60,cancer) = 0.1  In k-anonymized GIC data T*, the following lines could relate to Bob:  Alice’s belief changes to its posterior value:  (age  60  ZIPcode=02138,cancer,T*) = 0.5 AgeZipcodeDiagnosis  60 02138Cancer  60 02138Cancer  60 02138Healthy  60 02138Pneumonia

18 Bayes Optimal Privacy  Theorem 3.1: where n(q*,s’) is the number of tuples in T* with t*[Q] = q* and t*[S] = s’

19 Privacy principles  Positive disclosure: the adversary can correctly identify the value of a sensitive attribute:  q,s such that  (q,s,T*) >1-  for a given   Negative disclosure: the adversary can correctly eliminate the value of a sensitive attribute:  (q,s,T*) <  for a given  and  t  T such that t[Q]=q but t[S]  s

20 Privacy principles  Note not all positive and negative disclosures are bad If Alice already knew Bob has Cancer, there is nothing much one can do! If Alice already knew Bob has Cancer, there is nothing much one can do!  Uninformative principle: there should not be a large difference between the prior and posterior beliefs

21 Bayes Optimal Privacy  Limitations in practice Insufficient knowledge: data publisher unlikely to know f Insufficient knowledge: data publisher unlikely to know f Publisher does not know how much the adversary actually knows Publisher does not know how much the adversary actually knows He may have instance level knowledgeHe may have instance level knowledge No way to model non-probabilistic knowledgeNo way to model non-probabilistic knowledge Multiple adversaries having different levels of knowledge Multiple adversaries having different levels of knowledge  Hence a practical definition is needed

22 -diversity principle -diversity principle  Revisit:  Positive disclosure can occur when:

23 -diversity principle -diversity principle  Could occur due to combination of: Lack of diversity Lack of diversity Strong background Knowledge Strong background Knowledge Mitigate by requiring “well- represented” sensitive values At least -1 damaging pieces of background knowledge required to succeed

24 -diversity principle -diversity principle A q*-block is -diverse if it contains at least well-represented values for the sensitive attribute S. A table is -diverse if every q*-block is - diverse. Example – distinct -diversity: there are at least l distinct values for the sensitive attribute in each q*-block.

25 Non-Sensitive DataSensitive Data # ZIPAgeNationalityCondition 11305*<= 40*Heart Disease 21305*<= 40*Viral Infection 31305*<= 40*Cancer 41305*<= 40*Cancer 51485*>= 40*Cancer 61485*>= 40*Heart Disease 71485*>= 40*Viral Infection 81485*>= 40*Viral Infection 91306*<= 40*Heart Disease 101306*<= 40*Viral Infection 111306*<= 40*Cancer 121306*<= 40*Cancer Example – 3-distinct diverse Table We have 3-distinct diversity!!! We have privacy!!!! We have 3-distinct diversity!!! We have privacy!!!!

26 Example - 3-distinct diverse table Non-Sensitive DataSensitive Data # ZIPAgeNat.Condition 1130**<30*Heart Disease 2130**<30*Heart Disease 3130**<30*Viral Infection 4130**<30*Viral Infection 5130**<30*Viral Infection 6130**<30*Viral Infection 7130**<30*Viral Infection 8130**<30*Viral Infection 9130**<30*Viral Infection 10130**<30*Viral Infection 11130**<30*Viral Infection 12130**<30*Cancer Suppose attacker knows the non- sensitive attributes of And the fact that Japanese have very low incidence of heart disease NameZipAgeNational Umeko1306821Japanese Still very likely that Umeko has viral infection!

27  A table is Entropy -Diverse if for every q*- block: where Entropy -diversity p(S 1 )p(S 2 )Entropy 1001 0.90.10.141.38 0.80.20.221.65 0.70.30.271.84 0.60.40.291.96 0.5 0.32 Not feasible when one value is very common Example with 2 sensitive attribute values

28 Recursive (c, )-diversity  None of the sensitive values should occur too frequently.  Let r i be the i th most frequent sensitive value  Given const c, recursive (c, )-diversity is satisfied if r 1 < c ( r + r +1 + … + r m ) For example, with 3 attributes (m=3):  (2,2)-diversity: r 1 <2(r 2 +r 3 )  (2,3)-diversity: r 1 <2r 3 Equivalently: even if we eliminate a sensitive value, we still have (2,2)-diversity Equivalently: even if we eliminate a sensitive value, we still have (2,2)-diversity

29 An algorithm for -diversity?  Monotonicity property: If T* preserves privacy, then so does every generalization of it  Satisfied by k-anonymity  Most k-anonymization algorithms work for any privacy measure that satisfies monotonicity - We can re-use previous algorithms directly  Bayes optimal privacy is not monotonic  -diversity variants are monotonic!

30 Mondrian(partition)  if (no allowable multidimensional cut for partition) return  : partition  summary  else dim  choose dimension() fs  frequency set(partition, dim) splitVal  find median(fs) lhs  {t  partition : t.dim  splitVal} rhs  {t  partition : t.dim > splitVal} return Mondrian(rhs)  Mondrian(lhs) Weight 3545405550656070 50 55 60 65 70 75 80 85 Age Example: Mondrian -entropy diverse, = 1.89 (for two sensitive attributes, equivalent to limiting prevalence to up to 2/3. Also equivalent to recursive (2,2)-diversity)

31 Experiments  Used Incognito (a popular generalization algorithm)  Adult dataset (Census data) from the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets/Adult) http://archive.ics.uci.edu/ml/datasets/Adult Adult Database Description Experiment results refer to this sensitive attribute

32 Experiments - Utility  Intuitively: “usefulness” of the -diverse and k-anonymized tables. Used k, = 2, 4, 6, 8 Number of generalization steps that were performed vs. k, Average size of q*-blocks generated (similar to C AVG ) vs. k,

33 Non-Sensitive DataSensitive Data # ZIPAgeNationalityCondition 11305*<= 40*Heart Disease 21305*<= 40*Viral Infection 31305*<= 40*Cancer 41305*<= 40*Cancer 51485*>= 40*Cancer 61485*>= 40*Heart Disease 71485*>= 40*Viral Infection 81485*>= 40*Viral Infection 91306*<= 40*Heart Disease 101306*<= 40*Viral Infection 111306*<= 40*Cancer 121306*<= 40*Cancer Example – 3-diverse Table We have 3-diversity!!! We have privacy!!!! We have 3-diversity!!! We have privacy!!!!

34 Similarity attack Bob ZipAge 4767827 ZipcodeAgeSalaryDisease 476**2*20KGastric Ulcer 476**2*30KGastritis 476**2*40KStomach Cancer 4790*≥4050KGastritis 4790*≥40100KFlu 4790*≥4070KBronchitis 476**3*60KBronchitis 476**3*80KPneumonia 476**3*90KStomach Cancer A 3-diverse patient table Conclusion 1.Bob’s salary is in [20k,40k], which is relative low. 2.Bob has some stomach-related disease. l-diversity does not consider semantic meanings of sensitive values l-diversity is insufficient to prevent attribute disclosure.

35 Skewness attack Non-Sensitive Data Sensitive Data # AgeCondition 1<30Cancer 2<30Cancer 3<30Healthy 4<30Healthy 53*Cancer 63*Healthy 73*Healthy 83*Healthy 93*Healthy 10 30 Healthy 11 30 Cancer 12 30 Cancer 13 30 Cancer 14 30 Cancer Two sensitive values in  : Cancer (1%) and Healthy (99%) (entropy: 1.0576) entropy: 2 entropy: 1.65 Equivalent in terms of - diversity, but very different semantically Attacker learned a lot!

36 t-Closeness: the main idea  Rationale AgeZipcode……GenderDisease **……*Flu **……*Heart Disease **……*Cancer............ ……............ ** *Gastritis External Knowledge Overall distribution Q of sensitive values BeliefKnowledge B0B0 B1B1 A completely generalized table

37 t-Closeness: the main idea  Rationale External Knowledge AgeZipcode……GenderDisease 2*479**……MaleFlu 2*479**……MaleHeart Disease 2*479**……MaleCancer............ ……............ ≥504766*……*Gastritis Overall distribution Q of sensitive values Distribution P i of sensitive values in each equivalence class BeliefKnowledge B0B0 B1B1 B2B2 A released table

38 t-Closeness: the main idea  Rationale External Knowledge Overall distribution Q of sensitive values Distribution P i of sensitive values in each equivalence class BeliefKnowledge B0B0 B1B1 B2B2  Observations Q should be treated as public Knowledge gain in two parts:  Whole population (from B 0 to B 1 )  Specific individuals (from B 1 to B 2 ) We bound knowledge gain between B 1 and B 2 instead  Principle The distance between Q and P i should be bounded by a threshold t.

39 t-closeness An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness. A distance measure called Earth Movers Distance is used. It maintains monotonicity!

40 Non-Sensitive DataSensitive Data # ZIPAgeSalaryCondition 14767*<= 403KGastric ulcer 24767*<= 405KStomach cancer 34767*<= 409KPneumonia 44790*>= 406KGastritis 54790*>= 4011KFlu 64790*>= 408KBronchitis 74760*<= 404KGastritis 84760*<= 407KBronchitis 94760*<= 4010KStomach cancer Example – t-closeness We have 0.167- closeness w.r.t. Salary and 0.278-closeness w.r.t. Disease!!! We have privacy!!!! We have 0.167- closeness w.r.t. Salary and 0.278-closeness w.r.t. Disease!!! We have privacy!!!!

41 Netflix privacy breach (Robust De-anonymization of Large Sparse Datasets, Narayanan and Shmatikov, 2008)  Released for the Netflix Prize contest 17,770 movie titles 17,770 movie titles 480,189 users with random customer IDs 480,189 users with random customer IDs Ratings: 1-5 Ratings: 1-5 For each movie we have the ratings: For each movie we have the ratings: (MovieID, CustomerID, Rating, Date)(MovieID, CustomerID, Rating, Date)  Re-arrange by customerID: 41 MovieCustomerIDRankDate The Godfather17236420.5 Quantum of Solace17236220.11 Hamlet17236514.10 The Scorpion King17236112.8 The profit17236511.8

42 Netflix privacy breach (Robust De-anonymization of Large Sparse Datasets, Narayanan and Shmatikov, 2008)  Can be linked, e.g., with IMDB data, to re- identify individuals! 42 MovieCustomerIDRankDate The Godfather17236420.5 Quantum of Solace17236220.11 Hamlet17236514.10 The Scorpion King17236112.8 The profit17236511.8 Netflix data IMDB data (This example is made up. Possibly, James Hitchcock has nothing to do with Netflix)

43 Epilogue 43 “You have zero privacy anyway. Get over it.” Scott McNeally (SUN CEO, January 1999)

44 HIPAA excerpt Health Insurance Portability and Accountability Act of 1996

45 45 Thank you!

46 46 Bibliography  “Mondrian Multidimensional k-Anonymity”,K. LeFevre, D.J. DeWitt, R. Ramakrishnan,2006  -diversity: Privacy beyond k-anonymity, A. Machanavajjhala, Johannes Gehrke, Daniel Kifer, 2006  T-closeness: Privacy beyond k-anonymity and -diversity, Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian, 2006  Presentations: “Privacy In Databases”, B. Aditya Prakash “Privacy In Databases”, B. Aditya Prakash “K-Anonymity and Other Cluster-Based Methods”, Ge. Ruan “K-Anonymity and Other Cluster-Based Methods”, Ge. Ruan


Download ppt "Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)"

Similar presentations


Ads by Google