Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security
Outline Problem definition K-Anonymity Property Disclosure Control Methods Complexity Attacks against k-Anonymity
3 Defining Privacy in DB Publishing NO FOUL PLAY Traditional information security: protecting information and information systems from unauthorized access and use. Privacy preserving data publishing: protecting private data while publishing useful information Modify Data
Problem: Disclosure Control Disclosure Control is the discipline concerned with the modification of data, containing confidential information about individual entities such as persons, households, businesses, etc. in order to prevent third parties working with these data to recognize individuals in the data Privacy preserving data publishing, anonymization, de- identification Types of disclosure Identity disclosure - identification of an entity (person, institution) Attribute disclosure - the intruder finds something new about the target person Disclosure – identity, attribute disclosure or both.
Microdata and External Information Microdata represents a series of records, each record containing information on an individual unit such as a person, a firm, an institution, etc In contrast to computed tables Masked Microdata names and other identifying information are removed or modified from microdata External Information any known information by an presumptive intruder related to some individuals from initial microdata
Disclosure Risk and Information Loss Disclosure risk - the risk that a given form of disclosure will arise if a masked microdata is released Information loss - the quantity of information which exist in the initial microdata and because of disclosure control methods does not occur in masked microdata
Disclosure Control Problem Individuals Researcher Intruder Data Owner Data Masked Data Submit Collect Release Receive Masking Process
Disclosure Control Problem Individuals Researcher Intruder Data Owner Data Masked Data Submit Collect Release Receive Masking Process Confidentiality of Individuals Preserve Data Utility Disclosure Risk / Anonymity Properties Information Loss
Disclosure Control Problem Individuals Researcher Intruder Data Owner Data Masked Data Submit Collect Release Receive Masking Process Confidentiality of Individuals Preserve Data Utility External Data Use Masked Data for Statistical Analysis Use Masked Data and External Data to disclose confidential information Disclosure Risk / Anonymity Properties Information Loss
Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice AIDS17,000 Bob AIDS68,000 Charley Asthma80,000 Dave Asthma55,000 Eva Diabetes23,000 Masked Microdata AgeZipDiagnosisIncome AIDS17, AIDS68, Asthma80, Asthma55, Diabetes23,000 Initial Microdata Data Owner
Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice AIDS17,000 Bob AIDS68,000 Charley Asthma80,000 Dave Asthma55,000 Eva Diabetes23,000 Masked Microdata Initial Microdata NameSSNAgeZip Alice Charley Dave External Information Data Owner Intruder AgeZipDiagnosisIncome AIDS17, AIDS68, Asthma80, Asthma55, Diabetes23,000
Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice AIDS17,000 Bob AIDS68,000 Charley Asthma80,000 Dave Asthma55,000 Eva Diabetes23,000 Masked Microdata Initial Microdata NameSSNAgeZip Alice Charley Dave External Information Data Owner Intruder AgeZipDiagnosisIncome AIDS17, AIDS68, Asthma80, Asthma55, Diabetes23,000 Identity Disclosure: Charlie is the third record Attribute Disclosure: Alice has AIDS
Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice AIDS17,000 Bob AIDS68,000 Charley Asthma80,000 Dave Asthma55,000 Eva Diabetes23,000 Masked Microdata Initial Microdata NameSSNAgeZip Alice Charley Dave External Information Data Owner Intruder AgeZipDiagnosisIncome 44482AIDS17, AIDS68, Asthma80, Asthma55, Diabetes23,000 Identity Disclosure: Charlie is the third record Attribute Disclosure: Alice has AIDS
Disclosure Control for Tables vs. Microdata Microdata Precomputed statistics tables
Disclosure Control For Microdata
Disclosure Control for Tables
Outline Problem definition K-Anonymity Property Disclosure Control Methods Complexity and algorithms Attacks against k-Anonymity
History The term was introduced in 1998 by Samarati and Sweeney. Important papers: Sweeney L. (2002), K-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge- based Systems, Vol. 10, No. 5, Sweeney L. (2002), Achieving K-Anonymity Privacy Protection using Generalization and Suppression, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 6,
More recently Many new research papers in the last 4-5 years Theoretical results Many algorithms achieving k-anonymity Many improved principles and algorithms
Terminology (Sweeney) Data or Microdata - a person-specific information that is conceptually organized as a table of rows and columns. Tuple, Row, or Record – a row from the microdata. Attribute or Field – a column from the microdata Inference – Belief on a new fact based on some other information Disclosure – Explicit and inferable information about a person
Attribute Classification I 1, I 2,..., I m - identifier attributes Ex: Name and SSN Found in IM only Information that leads to a specific entity. K 1, K 2,.…, K p - key attributes (quasi-identifiers) Ex: Zip Code and Age Found in IM and MM. May be known by an intruder. S 1, S 2,.…, S q - confidential attributes Ex: Principal Diagnosis and Annual Income Found in IM and MM. Assumed to be unknown to an intruder.
Attribute Types Identifier, Key (Quasi-Identifiers) and Confidential Attributes RecIDNameSSNAgeStateDiagnosisIncomeBilling 1John Wayne MIAIDS45,5001,200 2Mary Gore MIAsthma37,9002,500 3John Banks MIAIDS67,0003,000 4Jesse Casey MIAsthma21,0001,000 5Jack Stone MIAsthma90, Mike Kopi MIDiabetes48, Angela Simms INDiabetes49,0001,200 8Nike Wood MIAIDS66,0002,200 9Mikhail Aaron MIAIDS69,0004,200 10Sam Pall MITuberculosis34,0003,100
24 Motivating Example Modify Data Non-Sensitive DataSensitive Data #ZipAgeNationalityNameCondition BrazilianRonaldoHeart Disease USBobHeart Disease IndianKumarCancer JapaneseUmekoCancer
25 Motivating Example (continued) Published Data: Alice publishes data without the Name Modify Data The Optimization Problem Non-Sensitive DataSensitive Data #ZipAgeNationalityCondition BrazilianHeart Disease USHeart Disease IndianCancer JapaneseCancer Attacker’s Knowledge: Voter registration list Chris Bob Paul John Name US US US US NationalityAgeZip#
26 Motivating Example (continued) Published Data: Alice publishes data without the Name Modify Data Non-Sensitive DataSensitive Data #ZipAgeNationalityCondition BrazilianHeart Disease USHeart Disease IndianCancer JapaneseCancer #NameZipAgeNationality 1John US 2Paul US 3Bob US 4Chris US Attacker’s Knowledge: Voter registration list Data Leak ! The Optimization Problem
Data Re-identification Disease Birthdate Sex Zip Name 87% of the population in the USA can be uniquely identified using Birthdate, Sex, and Zipcode
28 Even if we do not publish the individuals: There are some fields that may uniquely identify some individual The attacker can use them to join with other sources and identify the individuals Non-Sensitive DataSensitive Data #ZipAgeNationalityCondition …………… Quasi Identifier Source of the Problem
K-Anonymity Definition The k-anonymity property for a masked microdata (MM) is satisfied if with respect to Quasi-identifier set (QID) if every count in the frequency set of MM with respect to QID is greater or equal to k
K-Anonymity Example RecIDAgeZipSexIllness MaleAIDS FemaleAsthma FemaleAIDS MaleAsthma MaleAsthma MaleDiabetes QID = { Age, Zip, Sex } SELECT COUNT(*) FROM Patient GROUP BY Sex, Zip, Age; If the results include groups with count less than k, the relation Patient does not have k-anonymity property with respect to QID.
Outline Problem definition K-Anonymity Property Disclosure Control Methods Attacks against k-Anonymity
Disclosure Control Techniques Remove Identifiers Generalization Suppression Sampling Microaggregation Perturbation / randomization Rounding Data Swapping Etc.
Disclosure Control Techniques Different disclosure control techniques are applied to the following initial microdata: RecIDNameSSNAgeStateDiagnosisIncomeBilling 1John Wayne MIAIDS45,5001,200 2Mary Gore MIAsthma37,9002,500 3John Banks MIAIDS67,0003,000 4Jesse Casey MIAsthma21,0001,000 5Jack Stone MIAsthma90, Mike Kopi MIDiabetes48, Angela Simms INDiabetes49,0001,200 8Nike Wood MIAIDS66,0002,200 9Mikhail Aaron MIAIDS69,0004,200 10Sam Pall MITuberculosis34,0003,100
Remove Identifiers Identifiers such as Names, SSN etc. are removed RecIDAgeStateDiagnosisIncomeBilling 144MIAIDS45,5001, MIAsthma37,9002, MIAIDS67,0003, MIAsthma21,0001, MIAsthma90, MIDiabetes48, INDiabetes49,0001, MIAIDS66,0002, MIAIDS69,0004, MITuberculosis34,0003,100
Sampling Sampling is the disclosure control method in which only a subset of records is released If n is the number of elements in initial microdata and t the released number of elements we call sf = t / n the sampling factor Simple random sampling is more frequently used. In this technique, each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample RecIDAgeStateDiagnosisIncomeBilling 555MIAsthma90, MIAsthma21,0001, MIAIDS66,0002, MIAIDS69,0004, INDiabetes49,0001,200
Microaggregation Order records from the initial microdata by an attribute, create groups of consecutive values, replace those values by the group average Microaggregation for attribute Income and minimum size 3 The total sum for all Income values remains the same. RecIDAgeStateDiagnosisIncomeBilling 244MIAsthma30,9672, MIAsthma30,9671, MITuberculosis30,9673, MIAIDS47,5001, MIDiabetes47, INDiabetes47,5001, MIAIDS73,0003, MIAsthma73, MIAIDS73,0002, MIAIDS73,0004,200
Data Swapping In this disclosure method a sequence of so-called elementary swaps is applied to a microdata An elementary swap consists of two actions: A random selection of two records i and j from the microdata A swap (interchange) of the values of the attribute being swapped for records i and j RecIDAgeStateDiagnosisIncomeBilling 144MIAIDS48,0001, MIAsthma37,9002, MIAIDS67,0003, MIAsthma21,0001, MIAsthma90, MIDiabetes45, INDiabetes49,0001, MIAIDS66,0002, MIAIDS69,0004, MITuberculosis34,0003,100
Generalization and Suppression Generalization Replace the value with a less specific but semantically consistent value #ZipAgeNationalityCondition < 40*Heart Disease < 40*Heart Disease < 40*Cancer < 40*Cancer Suppression Do not release a value at all
Domain and Value Generalization Hierarchies Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*. 4109*} Z2 = {410**} S0 = {Male, Female} S1 = {Person} 4107*4109* 410** Person Male Female
Generalization Lattice Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*, 4109*} Z2 = {410**} S0 = {Male, Female} S1 = {Person} Generalization Lattice Distance Vector Generalization Lattice [0, 0] [1, 0] [0, 1] [1, 1] [0, 2] [1, 2]
Generalization Tables
Outline Problem definition K-Anonymity Property Disclosure Control Methods Attacks and problems with k-Anonymity
Attacks against k-Anonymity Unsorted Matching Attack Complementary Release Attack Temporal Attack Homogeneity Attack for Attribute Disclosure
Unsorted Matching Attack Solution - Random shuffling of rows
Complementary Release Attack
Temporal Attack black 9/7/65 male headache black 11/4/65 male rash black 1965 male headache black 1965 male rash GT t1 PT t1
Homogeneity Attack k-Anonymity can create groups that leak information due to lack of diversity in sensitive attribute.
Coming up Algorithms on k-anonymity Improved principles on k-anonymity