Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

Jeremiah Blocki CMU Ryan Williams IBM Almaden ICALP 2010.
1 Measures of Disclosure Risk and Harm Measures of Disclosure Risk and Harm Diane Lambert, Journal of Official Statistics, 9 (1993), pp Jim Lynch.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
1 Privacy Preserving Data Publishing Prof. Ravi Sandhu Executive Director and Endowed Chair March 29, © Ravi.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
1 Dr. Xiao Qin Auburn University Spring, 2011 COMP 7370 Advanced Computer and Network Security Generalizing.
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
C MU U sable P rivacy and S ecurity Laboratory 1 Privacy Policy, Law and Technology Data Privacy October 30, 2008.
Security in Databases. 2 Srini & Nandita (CSE2500)DB Security Outline review of databases reliability & integrity protection of sensitive data protection.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Attacks against K-anonymity
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
April 13, 2010 Towards Publishing Recommendation Data With Predictive Anonymization Chih-Cheng Chang †, Brian Thompson †, Hui Wang ‡, Danfeng Yao † †‡
Protecting Privacy when Disclosing Information Pierangela Samarati Latanya Sweeney.
L-Diversity: Privacy Beyond K-Anonymity
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
Protection of Personally Identifiable Information through Disclosure Avoidance Techniques Michael Hawes Statistical Privacy Advisor U.S. Department of.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis, Nikos Mamoulis University of Hong Kong Panos Kalnis National University of Singapore.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis.
Preserving Privacy in Published Data
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Publishing Microdata with a Robust Privacy Guarantee
Li Xiong CS573 Data Privacy and Security Healthcare privacy and security: Genomic data privacy.
Confidentiality Issues with “Small Cell” Data Michael C. Samuel, DrPH STD Control Branch California Department of Public Health 2008 National STD Prevention.
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Refined privacy models
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Computer Science and Engineering Computer System Security CSE 5339/7339 Session 21 November 2, 2004.
European Conference on Quality in Official Statistics, Rome, July 2008 Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Privacy-preserving data publishing
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.
CSCI 347, Data Mining Data Anonymization.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Differential Privacy (1). Outline  Background  Definition.
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
ROLE OF ANONYMIZATION FOR DATA PROTECTION Irene Schluender and Murat Sariyar (TMF)
Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.
Transforming Data to Satisfy Privacy Constraints 컴퓨터교육 전공 032CSE15 최미희.
Traian Marius Truta Overview of Statistical Disclosure Control and Privacy-Preserving Data Mining Traian Marius Truta
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Privacy in Database Publishing
University of Texas at El Paso
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
ADAPTIVE DATA ANONYMIZATION AGAINST INFORMATION FUSION BASED PRIVACY ATTACKS ON ENTERPRISE DATA Srivatsava Ranjit Ganta, Shruthi Prabhakara, Raj Acharya.
Security and Privacy in Mobile Computing
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Global Disclosure Risk for Microdata with Continuous Attributes
Data Anonymization – Introduction
Presented by : SaiVenkatanikhil Nimmagadda
TELE3119: Trusted Networks Week 4
Source: IEEE Journal of Biomedical and Health Informatics, Vol
SAFE – a method for anonymising the German Census
Refined privacy models
Presentation transcript:

Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security

Outline Problem definition K-Anonymity Property Disclosure Control Methods Complexity Attacks against k-Anonymity

3 Defining Privacy in DB Publishing   NO FOUL PLAY Traditional information security: protecting information and information systems from unauthorized access and use. Privacy preserving data publishing: protecting private data while publishing useful information   Modify Data

Problem: Disclosure Control Disclosure Control is the discipline concerned with the modification of data, containing confidential information about individual entities such as persons, households, businesses, etc. in order to prevent third parties working with these data to recognize individuals in the data Privacy preserving data publishing, anonymization, de- identification Types of disclosure Identity disclosure - identification of an entity (person, institution) Attribute disclosure - the intruder finds something new about the target person Disclosure – identity, attribute disclosure or both.

Microdata and External Information Microdata represents a series of records, each record containing information on an individual unit such as a person, a firm, an institution, etc In contrast to computed tables Masked Microdata names and other identifying information are removed or modified from microdata External Information any known information by an presumptive intruder related to some individuals from initial microdata

Disclosure Risk and Information Loss Disclosure risk - the risk that a given form of disclosure will arise if a masked microdata is released Information loss - the quantity of information which exist in the initial microdata and because of disclosure control methods does not occur in masked microdata

Disclosure Control Problem Individuals Researcher Intruder Data Owner Data Masked Data Submit Collect Release Receive Masking Process

Disclosure Control Problem Individuals Researcher Intruder Data Owner Data Masked Data Submit Collect Release Receive Masking Process Confidentiality of Individuals Preserve Data Utility Disclosure Risk / Anonymity Properties Information Loss

Disclosure Control Problem Individuals Researcher Intruder Data Owner Data Masked Data Submit Collect Release Receive Masking Process Confidentiality of Individuals Preserve Data Utility External Data Use Masked Data for Statistical Analysis Use Masked Data and External Data to disclose confidential information Disclosure Risk / Anonymity Properties Information Loss

Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice AIDS17,000 Bob AIDS68,000 Charley Asthma80,000 Dave Asthma55,000 Eva Diabetes23,000 Masked Microdata AgeZipDiagnosisIncome AIDS17, AIDS68, Asthma80, Asthma55, Diabetes23,000 Initial Microdata Data Owner

Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice AIDS17,000 Bob AIDS68,000 Charley Asthma80,000 Dave Asthma55,000 Eva Diabetes23,000 Masked Microdata Initial Microdata NameSSNAgeZip Alice Charley Dave External Information Data Owner Intruder AgeZipDiagnosisIncome AIDS17, AIDS68, Asthma80, Asthma55, Diabetes23,000

Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice AIDS17,000 Bob AIDS68,000 Charley Asthma80,000 Dave Asthma55,000 Eva Diabetes23,000 Masked Microdata Initial Microdata NameSSNAgeZip Alice Charley Dave External Information Data Owner Intruder AgeZipDiagnosisIncome AIDS17, AIDS68, Asthma80, Asthma55, Diabetes23,000 Identity Disclosure: Charlie is the third record Attribute Disclosure: Alice has AIDS

Types of Disclosure NameSSNAgeZipDiagnosisIncome Alice AIDS17,000 Bob AIDS68,000 Charley Asthma80,000 Dave Asthma55,000 Eva Diabetes23,000 Masked Microdata Initial Microdata NameSSNAgeZip Alice Charley Dave External Information Data Owner Intruder AgeZipDiagnosisIncome 44482AIDS17, AIDS68, Asthma80, Asthma55, Diabetes23,000 Identity Disclosure: Charlie is the third record Attribute Disclosure: Alice has AIDS

Disclosure Control for Tables vs. Microdata Microdata Precomputed statistics tables

Disclosure Control For Microdata

Disclosure Control for Tables

Outline Problem definition K-Anonymity Property Disclosure Control Methods Complexity and algorithms Attacks against k-Anonymity

History The term was introduced in 1998 by Samarati and Sweeney. Important papers: Sweeney L. (2002), K-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge- based Systems, Vol. 10, No. 5, Sweeney L. (2002), Achieving K-Anonymity Privacy Protection using Generalization and Suppression, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, Samarati P. (2001), Protecting Respondents Identities in Microdata Release, IEEE Transactions on Knowledge and Data Engineering, Vol. 13, No. 6,

More recently Many new research papers in the last 4-5 years Theoretical results Many algorithms achieving k-anonymity Many improved principles and algorithms

Terminology (Sweeney) Data or Microdata - a person-specific information that is conceptually organized as a table of rows and columns. Tuple, Row, or Record – a row from the microdata. Attribute or Field – a column from the microdata Inference – Belief on a new fact based on some other information Disclosure – Explicit and inferable information about a person

Attribute Classification I 1, I 2,..., I m - identifier attributes Ex: Name and SSN Found in IM only Information that leads to a specific entity. K 1, K 2,.…, K p - key attributes (quasi-identifiers) Ex: Zip Code and Age Found in IM and MM. May be known by an intruder. S 1, S 2,.…, S q - confidential attributes Ex: Principal Diagnosis and Annual Income Found in IM and MM. Assumed to be unknown to an intruder.

Attribute Types Identifier, Key (Quasi-Identifiers) and Confidential Attributes RecIDNameSSNAgeStateDiagnosisIncomeBilling 1John Wayne MIAIDS45,5001,200 2Mary Gore MIAsthma37,9002,500 3John Banks MIAIDS67,0003,000 4Jesse Casey MIAsthma21,0001,000 5Jack Stone MIAsthma90, Mike Kopi MIDiabetes48, Angela Simms INDiabetes49,0001,200 8Nike Wood MIAIDS66,0002,200 9Mikhail Aaron MIAIDS69,0004,200 10Sam Pall MITuberculosis34,0003,100

24 Motivating Example   Modify Data Non-Sensitive DataSensitive Data #ZipAgeNationalityNameCondition BrazilianRonaldoHeart Disease USBobHeart Disease IndianKumarCancer JapaneseUmekoCancer

25 Motivating Example (continued)   Published Data: Alice publishes data without the Name Modify Data The Optimization Problem Non-Sensitive DataSensitive Data #ZipAgeNationalityCondition BrazilianHeart Disease USHeart Disease IndianCancer JapaneseCancer Attacker’s Knowledge: Voter registration list Chris Bob Paul John Name US US US US NationalityAgeZip#

26 Motivating Example (continued)   Published Data: Alice publishes data without the Name Modify Data Non-Sensitive DataSensitive Data #ZipAgeNationalityCondition BrazilianHeart Disease USHeart Disease IndianCancer JapaneseCancer #NameZipAgeNationality 1John US 2Paul US 3Bob US 4Chris US Attacker’s Knowledge: Voter registration list Data Leak ! The Optimization Problem

Data Re-identification Disease Birthdate Sex Zip Name 87% of the population in the USA can be uniquely identified using Birthdate, Sex, and Zipcode

28 Even if we do not publish the individuals: There are some fields that may uniquely identify some individual The attacker can use them to join with other sources and identify the individuals Non-Sensitive DataSensitive Data #ZipAgeNationalityCondition …………… Quasi Identifier Source of the Problem

K-Anonymity Definition The k-anonymity property for a masked microdata (MM) is satisfied if with respect to Quasi-identifier set (QID) if every count in the frequency set of MM with respect to QID is greater or equal to k

K-Anonymity Example RecIDAgeZipSexIllness MaleAIDS FemaleAsthma FemaleAIDS MaleAsthma MaleAsthma MaleDiabetes QID = { Age, Zip, Sex } SELECT COUNT(*) FROM Patient GROUP BY Sex, Zip, Age; If the results include groups with count less than k, the relation Patient does not have k-anonymity property with respect to QID.

Outline Problem definition K-Anonymity Property Disclosure Control Methods Attacks against k-Anonymity

Disclosure Control Techniques Remove Identifiers Generalization Suppression Sampling Microaggregation Perturbation / randomization Rounding Data Swapping Etc.

Disclosure Control Techniques Different disclosure control techniques are applied to the following initial microdata: RecIDNameSSNAgeStateDiagnosisIncomeBilling 1John Wayne MIAIDS45,5001,200 2Mary Gore MIAsthma37,9002,500 3John Banks MIAIDS67,0003,000 4Jesse Casey MIAsthma21,0001,000 5Jack Stone MIAsthma90, Mike Kopi MIDiabetes48, Angela Simms INDiabetes49,0001,200 8Nike Wood MIAIDS66,0002,200 9Mikhail Aaron MIAIDS69,0004,200 10Sam Pall MITuberculosis34,0003,100

Remove Identifiers Identifiers such as Names, SSN etc. are removed RecIDAgeStateDiagnosisIncomeBilling 144MIAIDS45,5001, MIAsthma37,9002, MIAIDS67,0003, MIAsthma21,0001, MIAsthma90, MIDiabetes48, INDiabetes49,0001, MIAIDS66,0002, MIAIDS69,0004, MITuberculosis34,0003,100

Sampling Sampling is the disclosure control method in which only a subset of records is released If n is the number of elements in initial microdata and t the released number of elements we call sf = t / n the sampling factor Simple random sampling is more frequently used. In this technique, each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample RecIDAgeStateDiagnosisIncomeBilling 555MIAsthma90, MIAsthma21,0001, MIAIDS66,0002, MIAIDS69,0004, INDiabetes49,0001,200

Microaggregation Order records from the initial microdata by an attribute, create groups of consecutive values, replace those values by the group average Microaggregation for attribute Income and minimum size 3 The total sum for all Income values remains the same. RecIDAgeStateDiagnosisIncomeBilling 244MIAsthma30,9672, MIAsthma30,9671, MITuberculosis30,9673, MIAIDS47,5001, MIDiabetes47, INDiabetes47,5001, MIAIDS73,0003, MIAsthma73, MIAIDS73,0002, MIAIDS73,0004,200

Data Swapping In this disclosure method a sequence of so-called elementary swaps is applied to a microdata An elementary swap consists of two actions: A random selection of two records i and j from the microdata A swap (interchange) of the values of the attribute being swapped for records i and j RecIDAgeStateDiagnosisIncomeBilling 144MIAIDS48,0001, MIAsthma37,9002, MIAIDS67,0003, MIAsthma21,0001, MIAsthma90, MIDiabetes45, INDiabetes49,0001, MIAIDS66,0002, MIAIDS69,0004, MITuberculosis34,0003,100

Generalization and Suppression Generalization Replace the value with a less specific but semantically consistent value #ZipAgeNationalityCondition < 40*Heart Disease < 40*Heart Disease < 40*Cancer < 40*Cancer Suppression Do not release a value at all

Domain and Value Generalization Hierarchies Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*. 4109*} Z2 = {410**} S0 = {Male, Female} S1 = {Person} 4107*4109* 410** Person Male Female

Generalization Lattice Z0 = {41075, 41076, 41095, 41099} Z1 = {4107*, 4109*} Z2 = {410**} S0 = {Male, Female} S1 = {Person} Generalization Lattice Distance Vector Generalization Lattice [0, 0] [1, 0] [0, 1] [1, 1] [0, 2] [1, 2]

Generalization Tables

Outline Problem definition K-Anonymity Property Disclosure Control Methods Attacks and problems with k-Anonymity

Attacks against k-Anonymity Unsorted Matching Attack Complementary Release Attack Temporal Attack Homogeneity Attack for Attribute Disclosure

Unsorted Matching Attack Solution - Random shuffling of rows

Complementary Release Attack

Temporal Attack black 9/7/65 male headache black 11/4/65 male rash black 1965 male headache black 1965 male rash GT t1 PT t1

Homogeneity Attack k-Anonymity can create groups that leak information due to lack of diversity in sensitive attribute.

Coming up Algorithms on k-anonymity Improved principles on k-anonymity