Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.

Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security

Outline Problem definition K-Anonymity Property Disclosure Control Methods Complexity Attacks against k-Anonymity

3 Defining Privacy in DB Publishing   NO FOUL PLAY Traditional information security: protecting information and information systems from unauthorized access and use. Privacy preserving data publishing: protecting private data while publishing useful information   Modify Data

Problem: Disclosure Control Disclosure Control is the discipline concerned with the modification of data, containing confidential information about individual entities such as persons, households, businesses, etc. in order to prevent third parties working with these data to recognize individuals in the data Privacy preserving data publishing, anonymization, de- identification Types of disclosure Identity disclosure - identification of an entity (person, institution) Attribute disclosure - the intruder finds something new about the target person Disclosure – identity, attribute disclosure or both.

Microdata and External Information Microdata represents a series of records, each record containing information on an individual unit such as a person, a firm, an institution, etc In contrast to computed tables Masked Microdata names and other identifying information are removed or modified from microdata External Information any known information by an presumptive intruder related to some individuals from initial microdata

Disclosure Risk and Information Loss Disclosure risk - the risk that a given form of disclosure will arise if a masked microdata is released Information loss - the quantity of information which exist in the initial microdata and because of disclosure control methods does not occur in masked microdata

Disclosure Control Problem Individuals Researcher Intruder Data Owner Data Masked Data Submit Collect Release Receive Masking Process

Disclosure Control Problem Individuals Researcher Intruder Data Owner Data Masked Data Submit Collect Release Receive Masking Process Confidentiality of Individuals Preserve Data Utility Disclosure Risk / Anonymity Properties Information Loss

Disclosure Control Problem Individuals Researcher Intruder Data Owner Data Masked Data Submit Collect Release Receive Masking Process Confidentiality of Individuals Preserve Data Utility External Data Use Masked Data for Statistical Analysis Use Masked Data and External Data to disclose confidential information Disclosure Risk / Anonymity Properties Information Loss

Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice1234567894448202AIDS17,000 Bob3232323234448202AIDS68,000 Charley2323456564448201Asthma80,000 Dave3333333335548310Asthma55,000 Eva6666666665548310Diabetes23,000 Masked Microdata AgeZipDiagnosisIncome 4448202AIDS17,000 4448202AIDS68,000 4448201Asthma80,000 5548310Asthma55,000 5548310Diabetes23,000 Initial Microdata Data Owner

Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice1234567894448202AIDS17,000 Bob3232323234448202AIDS68,000 Charley2323456564448201Asthma80,000 Dave3333333335548310Asthma55,000 Eva6666666665548310Diabetes23,000 Masked Microdata Initial Microdata NameSSNAgeZip Alice1234567894448202 Charley2323456564448201 Dave3333333335548310 External Information Data Owner Intruder AgeZipDiagnosisIncome 4448202AIDS17,000 4448202AIDS68,000 4448201Asthma80,000 5548310Asthma55,000 5548310Diabetes23,000

Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice1234567894448202AIDS17,000 Bob3232323234448202AIDS68,000 Charley2323456564448201Asthma80,000 Dave3333333335548310Asthma55,000 Eva6666666665548310Diabetes23,000 Masked Microdata Initial Microdata NameSSNAgeZip Alice1234567894448202 Charley2323456564448201 Dave3333333335548310 External Information Data Owner Intruder AgeZipDiagnosisIncome 4448202AIDS17,000 4448202AIDS68,000 4448201Asthma80,000 5548310Asthma55,000 5548310Diabetes23,000 Identity Disclosure: Charlie is the third record Attribute Disclosure: Alice has AIDS

Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice1234567894448202AIDS17,000 Bob3232323234448202AIDS68,000 Charley2323456564448201Asthma80,000 Dave3333333335548310Asthma55,000 Eva6666666665548310Diabetes23,000 Masked Microdata Initial Microdata NameSSNAgeZip Alice1234567894448202 Charley2323456564448201 Dave3333333335548310 External Information Data Owner Intruder AgeZipDiagnosisIncome 44482AIDS17,000 44482AIDS68,000 44482Asthma80,000 55483Asthma55,000 55483Diabetes23,000 Identity Disclosure: Charlie is the third record Attribute Disclosure: Alice has AIDS

Disclosure Control for Tables vs. Microdata Microdata Precomputed statistics tables

Disclosure Control For Microdata

Disclosure Control for Tables

Outline Problem definition K-Anonymity Property Disclosure Control Methods Complexity and algorithms Attacks against k-Anonymity

History The term was introduced in 1998 by Samarati and Sweeney. Important papers: Sweeney L. (2002), K-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge- based Systems, Vol. 10, No. 5, 557-570 Sweeney L. (2002), Achieving K-Anonymity Privacy Protection using Generalization and Suppression, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, 571-588 Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 6, 1010-1027

More recently Many new research papers in the last 4-5 years Theoretical results Many algorithms achieving k-anonymity Many improved principles and algorithms

Terminology (Sweeney) Data or Microdata - a person-specific information that is conceptually organized as a table of rows and columns. Tuple, Row, or Record – a row from the microdata. Attribute or Field – a column from the microdata Inference – Belief on a new fact based on some other information Disclosure – Explicit and inferable information about a person

Attribute Classification I 1, I 2,..., I m - identifier attributes Ex: Name and SSN Found in IM only Information that leads to a specific entity. K 1, K 2,.…, K p - key attributes (quasi-identifiers) Ex: Zip Code and Age Found in IM and MM. May be known by an intruder. S 1, S 2,.…, S q - confidential attributes Ex: Principal Diagnosis and Annual Income Found in IM and MM. Assumed to be unknown to an intruder.

Attribute Types Identifier, Key (Quasi-Identifiers) and Confidential Attributes RecIDNameSSNAgeStateDiagnosisIncomeBilling 1John Wayne12345678944MIAIDS45,5001,200 2Mary Gore32323232344MIAsthma37,9002,500 3John Banks23234565655MIAIDS67,0003,000 4Jesse Casey33333333344MIAsthma21,0001,000 5Jack Stone44444444455MIAsthma90,000900 6Mike Kopi66666666645MIDiabetes48,000750 7Angela Simms77777777725INDiabetes49,0001,200 8Nike Wood88888888835MIAIDS66,0002,200 9Mikhail Aaron99999999955MIAIDS69,0004,200 10Sam Pall10000000045MITuberculosis34,0003,100

24 Motivating Example   Modify Data Non-Sensitive DataSensitive Data #ZipAgeNationalityNameCondition 11305328BrazilianRonaldoHeart Disease 21306729USBobHeart Disease 31305337IndianKumarCancer 41306736JapaneseUmekoCancer

25 Motivating Example (continued)   Published Data: Alice publishes data without the Name Modify Data The Optimization Problem Non-Sensitive DataSensitive Data #ZipAgeNationalityCondition 11305328BrazilianHeart Disease 21306729USHeart Disease 31305337IndianCancer 41306736JapaneseCancer Attacker’s Knowledge: Voter registration list Chris Bob Paul John Name US23130674 US29130673 US22130672 US45130671 NationalityAgeZip#

26 Motivating Example (continued)   Published Data: Alice publishes data without the Name Modify Data Non-Sensitive DataSensitive Data #ZipAgeNationalityCondition 11305328BrazilianHeart Disease 21306729USHeart Disease 31305337IndianCancer 41306736JapaneseCancer #NameZipAgeNationality 1John1306745US 2Paul1306722US 3Bob1306729US 4Chris1306723US Attacker’s Knowledge: Voter registration list Data Leak ! The Optimization Problem

Data Re-identification Disease Birthdate Sex Zip Name 87% of the population in the USA can be uniquely identified using Birthdate, Sex, and Zipcode

28 Even if we do not publish the individuals: There are some fields that may uniquely identify some individual The attacker can use them to join with other sources and identify the individuals Non-Sensitive DataSensitive Data #ZipAgeNationalityCondition …………… Quasi Identifier Source of the Problem

K-Anonymity Definition The k-anonymity property for a masked microdata (MM) is satisfied if with respect to Quasi-identifier set (QID) if every count in the frequency set of MM with respect to QID is greater or equal to k

K-Anonymity Example RecIDAgeZipSexIllness 15041076MaleAIDS 23041076FemaleAsthma 33041076FemaleAIDS 42041076MaleAsthma 52041076MaleAsthma 65041076MaleDiabetes QID = { Age, Zip, Sex } SELECT COUNT(*) FROM Patient GROUP BY Sex, Zip, Age; If the results include groups with count less than k, the relation Patient does not have k-anonymity property with respect to QID.

Outline Problem definition K-Anonymity Property Disclosure Control Methods Attacks against k-Anonymity

Disclosure Control Techniques Remove Identifiers Generalization Suppression Sampling Microaggregation Perturbation / randomization Rounding Data Swapping Etc.

Disclosure Control Techniques Different disclosure control techniques are applied to the following initial microdata: RecIDNameSSNAgeStateDiagnosisIncomeBilling 1John Wayne12345678944MIAIDS45,5001,200 2Mary Gore32323232344MIAsthma37,9002,500 3John Banks23234565655MIAIDS67,0003,000 4Jesse Casey33333333344MIAsthma21,0001,000 5Jack Stone44444444455MIAsthma90,000900 6Mike Kopi66666666645MIDiabetes48,000750 7Angela Simms77777777725INDiabetes49,0001,200 8Nike Wood88888888835MIAIDS66,0002,200 9Mikhail Aaron99999999955MIAIDS69,0004,200 10Sam Pall10000000045MITuberculosis34,0003,100

Remove Identifiers Identifiers such as Names, SSN etc. are removed RecIDAgeStateDiagnosisIncomeBilling 144MIAIDS45,5001,200 244MIAsthma37,9002,500 355MIAIDS67,0003,000 444MIAsthma21,0001,000 555MIAsthma90,000900 645MIDiabetes48,000750 725INDiabetes49,0001,200 835MIAIDS66,0002,200 955MIAIDS69,0004,200 1045MITuberculosis34,0003,100

Sampling Sampling is the disclosure control method in which only a subset of records is released If n is the number of elements in initial microdata and t the released number of elements we call sf = t / n the sampling factor Simple random sampling is more frequently used. In this technique, each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample RecIDAgeStateDiagnosisIncomeBilling 555MIAsthma90,000900 444MIAsthma21,0001,000 835MIAIDS66,0002,200 955MIAIDS69,0004,200 725INDiabetes49,0001,200

Microaggregation Order records from the initial microdata by an attribute, create groups of consecutive values, replace those values by the group average Microaggregation for attribute Income and minimum size 3 The total sum for all Income values remains the same. RecIDAgeStateDiagnosisIncomeBilling 244MIAsthma30,9672,500 444MIAsthma30,9671,000 1045MITuberculosis30,9673,100 144MIAIDS47,5001,200 645MIDiabetes47,500750 725INDiabetes47,5001,200 355MIAIDS73,0003,000 555MIAsthma73,000900 835MIAIDS73,0002,200 955MIAIDS73,0004,200

Data Swapping In this disclosure method a sequence of so-called elementary swaps is applied to a microdata An elementary swap consists of two actions: A random selection of two records i and j from the microdata A swap (interchange) of the values of the attribute being swapped for records i and j RecIDAgeStateDiagnosisIncomeBilling 144MIAIDS48,0001,200 244MIAsthma37,9002,500 355MIAIDS67,0003,000 444MIAsthma21,0001,000 555MIAsthma90,000900 645MIDiabetes45,500750 725INDiabetes49,0001,200 835MIAIDS66,0002,200 955MIAIDS69,0004,200 1045MITuberculosis34,0003,100

Generalization and Suppression Generalization Replace the value with a less specific but semantically consistent value #ZipAgeNationalityCondition 141076< 40*Heart Disease 248202< 40*Heart Disease 341076< 40*Cancer 448202< 40*Cancer Suppression Do not release a value at all

Domain and Value Generalization Hierarchies Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*. 4109*} Z2 = {410**} S0 = {Male, Female} S1 = {Person} 4107*4109* 410** Person 41075 410764109541099 Male Female

Generalization Lattice Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*, 4109*} Z2 = {410**} S0 = {Male, Female} S1 = {Person} Generalization Lattice Distance Vector Generalization Lattice [0, 0] [1, 0] [0, 1] [1, 1] [0, 2] [1, 2]

Generalization Tables

Outline Problem definition K-Anonymity Property Disclosure Control Methods Attacks and problems with k-Anonymity

Attacks against k-Anonymity Unsorted Matching Attack Complementary Release Attack Temporal Attack Homogeneity Attack for Attribute Disclosure

Unsorted Matching Attack Solution - Random shuffling of rows

Complementary Release Attack

Temporal Attack black 9/7/65 male 02139 headache black 11/4/65 male 02139 rash black 1965 male 02139 headache black 1965 male 02139 rash GT t1 PT t1

Homogeneity Attack k-Anonymity can create groups that leak information due to lack of diversity in sensitive attribute.

Coming up Algorithms on k-anonymity Improved principles on k-anonymity

Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.

Similar presentations

Presentation on theme: "Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.

Similar presentations

Presentation on theme: "Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security."— Presentation transcript:

Similar presentations

About project

Feedback