CS573 Data Privacy and Security Anonymization methods Li Xiong.

Slides:



Advertisements
Similar presentations
A Privacy Preserving Index for Range Queries
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Statistical database security Special purpose: used only for statistical computations. General purpose: used with normal queries (and updates) as well.
CSCE 715 Ankur Jain 11/16/2010. Introduction Design Goals Framework SDT Protocol Achievements of Goals Overhead of SDT Conclusion.
Probabilistic Inference Protection on Anonymized Data
Security in Databases. 2 Srini & Nandita (CSE2500)DB Security Outline review of databases reliability & integrity protection of sensitive data protection.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
L-Diversity: Privacy Beyond K-Anonymity
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
k-Anonymity and Other Cluster-Based Methods
Database Laboratory Regular Seminar TaeHoon Kim.
Chapter 6 – Database Security  Integrity for databases: record integrity, data correctness, update integrity  Security for databases: access control,
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
Statistical Databases – Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin.
CS573 Data Privacy and Security Statistical Databases
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Sumathie Sundaresan Advisor : Dr. Huiping Guo Survey of Privacy Protection for Medical Data.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Protecting Sensitive Labels in Social Network Data Anonymization.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Refined privacy models
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Privacy-preserving data publishing
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Inference Problem Privacy Preserving Data Mining.
Security Methods for Statistical Databases. Introduction  Statistical Databases containing medical information are often used for research  Some of.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
University of Texas at El Paso
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Differential Privacy in Practice
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Presented by : SaiVenkatanikhil Nimmagadda
TELE3119: Trusted Networks Week 4
CS573 Data Privacy and Security Anonymization methods
Refined privacy models
Privacy-Preserving Data Publishing
Presentation transcript:

CS573 Data Privacy and Security Anonymization methods Li Xiong

Today Permutation based anonymization methods (cont.) Other privacy principles for microdata publishing Statistical databases

Anonymization methods Non-perturbative: don't distort the data – Generalization – Suppression Perturbative: distort the data – Microaggregation/clustering – Additive noise Anatomization and permutation – De-associate relationship between QID and sensitive attribute

Concept of the Anatomy Algorithm Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST) Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column Then produce a sensitive table with Disease statistics

Specifications of Anatomy cont. D EFINITION 3. (Anatomy) With a given l-diverse partition anatomy will create QIT and ST tables QIT will be constructed as the following: (A qi 1, A qi 2,..., A qi d, Group-ID) ST will be constructed as the following: (Group-ID, A s, Count)

Privacy properties T HEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

Comparison with generalization Compare with generalization on two assumptions: A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata If A1 and A2 are true, anatomy is as good as generalization 1/l holds true If A1 is true and A2 is false, generalization is stronger If A1 and A2 are false, generalization is still stronger

Preserving Data Correlation Examine the correlation between Age and Disease in T using probability density function pdf Example: t1

Preserving Data Correlation cont. To re-construct an approximate pdf of t 1 from the generalization table:

Preserving Data Correlation cont. To re-construct an approximate pdf of t 1 from the QIT and ST tables:

Preserving Data Correlation cont. To figure out a more rigorous comparison, calculate the “L 2 distance” with the following equation: The distance for anatomy is 0.5 while the distance for generalization is 22.5

Preserving Data Correlation cont. Idea: Measure the error for each tuple by using the following formula: Objective: for all tuples t in T and obtain a minimal re- construction error (RCE): Algorithm: Nearly-Optimal Anatomizing Algorithm

Experiments dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes Created two sets of microdata tables Set 1: 5 tables denoted as OCC-3,..., OCC-7 so that OCC-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Occupation as the sensitive attribute A s Set 2: 5 tables denoted as SAL-3,..., SAL-7 so that SAL-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Salary-class as the sensitive attribute A s g

Experiments cont.

Today Permutation based anonymization methods (cont.) Other privacy principles for microdata publishing Statistical databases Differential privacy

ZipcodeAgeDisease 476**2*Heart Disease 476**2*Heart Disease 476**2*Heart Disease 4790*≥40Flu 4790*≥40Heart Disease 4790*≥40Cancer 476**3*Heart Disease 476**3*Cancer 476**3*Cancer A 3-anonymous patient table Bob ZipcodeAge Carl ZipcodeAge Homogeneity attack Background knowledge attack Attacks on k-Anonymity k-Anonymity does not provide privacy if – Sensitive values in an equivalence class lack diversity – The attacker has background knowledge slide 16

Caucas787XXFlu Caucas787XXShingles Caucas787XXAcne Caucas787XXFlu Caucas787XXAcne Caucas787XXFlu Asian/AfrAm 78XXXFlu Asian/AfrAm 78XXXFlu Asian/AfrAm 78XXXAcne Asian/AfrAm 78XXXShingles Asian/AfrAm 78XXXAcne Asian/AfrAm 78XXXFlu Sensitive attributes must be “diverse” within each quasi-identifier equivalence class [Machanavajjhala et al. ICDE ‘06] l-Diversity slide 17

Distinct l-Diversity Each equivalence class has at least l well- represented sensitive values Doesn’t prevent probabilistic inference attacks slide records 8 records have HIV 2 records have other values

Other Versions of l-Diversity Probabilistic l-diversity – The frequency of the most frequent value in an equivalence class is bounded by 1/l Entropy l-diversity – The entropy of the distribution of sensitive values in each equivalence class is at least log(l) Recursive (c,l)-diversity – r 1 <c(r l +r l+1 +…+r m ) where r i is the frequency of the i th most frequent value – Intuition: the most frequent value does not appear too frequently slide 19

…Cancer … … …Flu …Cancer … … … … … …Flu … Original dataset 99% have cancer Neither Necessary, Nor Sufficient

…Cancer … … …Flu …Cancer … … … … … …Flu … Original dataset Q1Flu Q1Flu Q1Cancer Q1Flu Q1Cancer Q1Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Anonymization A 99% have cancer 50% cancer  quasi-identifier group is “diverse” Neither Necessary, Nor Sufficient slide 21

…Cancer … … …Flu …Cancer … … … … … …Flu … Original dataset Q1Flu Q1Cancer Q1Cancer Q1Cancer Q1Cancer Q1Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Q2Flu Q2Flu Anonymization B Q1Flu Q1Flu Q1Cancer Q1Flu Q1Cancer Q1Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Q2Cancer Anonymization A 99% have cancer 50% cancer  quasi-identifier group is “diverse” This leaks a ton of information 50% cancer  quasi-identifier group is “diverse” This leaks a ton of information 99% cancer  quasi-identifier group is not “diverse” Neither Necessary, Nor Sufficient slide 22

Limitations of l-Diversity Example: sensitive attribute is HIV+ (1%) or HIV- (99%) – Very different degrees of sensitivity! l-diversity is unnecessary – 2-diversity is unnecessary for an equivalence class that contains only HIV- records l-diversity is difficult to achieve – Suppose there are records in total – To have distinct 2-diversity, there can be at most 10000*1%=100 equivalence classes slide 23

Skewness Attack Example: sensitive attribute is HIV+ (1%) or HIV- (99%) Consider an equivalence class that contains an equal number of HIV+ and HIV- records – Diverse, but potentially violates privacy! l-diversity does not differentiate: – Equivalence class 1: 49 HIV+ and 1 HIV- – Equivalence class 2: 1 HIV+ and 49 HIV- slide 24 l-diversity does not consider overall distribution of sensitive values!

Bob ZipAge ZipcodeAgeSalaryDisease 476**2*20KGastric Ulcer 476**2*30KGastritis 476**2*40KStomach Cancer 4790*≥4050KGastritis 4790*≥40100KFlu 4790*≥4070KBronchitis 476**3*60KBronchitis 476**3*80KPneumonia 476**3*90KStomach Cancer A 3-diverse patient table Conclusion 1.Bob’s salary is in [20k,40k], which is relatively low 2.Bob has some stomach-related disease l-diversity does not consider semantics of sensitive values! Similarity attack Sensitive Attribute Disclosure slide 25

t-Closeness: A New Privacy Measure Rationale External Knowledge Overall distribution Q of sensitive values Distribution P i of sensitive values in each equi-class BeliefKnowledge B0B0 B1B1 B2B2  Observations Q is public or can be derived Potential knowledge gain from Q and Pi about Specific individuals Principle The distance between Q and P i should be bounded by a threshold t.

Caucas787XXFlu Caucas787XXShingles Caucas787XXAcne Caucas787XXFlu Caucas787XXAcne Caucas787XXFlu Asian/AfrAm 78XXXFlu Asian/AfrAm 78XXXFlu Asian/AfrAm 78XXXAcne Asian/AfrAm 78XXXShingles Asian/AfrAm 78XXXAcne Asian/AfrAm 78XXXFlu [Li et al. ICDE ‘07] Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database t-Closeness slide 27

Distance Measures P=(p 1,p 2,…,p m ), Q=(q 1,q 2,…,q m )  Trace-distance  KL-divergence  None of these measures reflect the semantic distance among values. Q: {3K,4K,5K,6K,7K,8K,9K,10K,11k} P 1 :{3K,4K,5k} P 2 :{5K,7K,10K} Intuitively, D[P 1,Q]>D[P 2,Q]

Earth Mover’s Distance If the distributions are interpreted as two different ways of piling up a certain amount of dirt over region D, EMD is the minimum cost of turning one pile into the other – the cost is amount of dirt moved * the distance by which it is moved – Assume two piles have the same amount of dirt Extensions for comparison of distributions with different total masses. – allow for a partial match, discard leftover "dirt“, without cost – allow for mass to be created or destroyed, but with a cost penalty

Earth Mover’s Distance Formulation – P=(p 1,p 2,…,p m ), Q=(q 1,q 2,…,q m ) – d ij : the ground distance between element i of P and element j of Q. – Find a flow F=[f ij ] where f ij is the flow of mass from element i of P to element j of Q that minimizes the overall work: subject to the constraints:

How to calculate EMD(Cont’d) EMD for categorical attributes – Hierarchical distance – Hierarchical distance is a metric

Earth Mover’s Distance Example – {3k,4k,5k} and {3k,4k,5k,6k,7k,8k,9k,10k,11k} – Move 1/9 probability for each of the following pairs 3k->6k,3k->7k cost: 1/9*(3+4)/8 4k->8k,4k->9k cost: 1/9*(4+5)/8 5k->10k,5k->11k cost: 1/9*(5+6)/8 – Total cost: 1/9*27/8=0.375 – With P2={6k,8k,11k}, we can get the total cost is 1/9 * 12/8 = < This make more sense than the other two distance calculation method.

Experiments Goal – To show l-diversity does not provide sufficient privacy protection (the similarity attack). – To show the efficiency and data quality of using t- closeness are comparable with other privacy measures. Setup – Adult dataset from UC Irvine ML repository – tuples, 9 attributes (2 sensitive attributes) – Algorithm: Incognito

Experiments Comparisons of privacy measurements – k-Anonymity – Entropy l-diversity – Recursive (c,l)-diversity – k-Anonymity with t-closeness

Experiments Efficiency – The efficiency of using t-closeness is comparable with other privacy measurements

Experiments Data utility – Discernibility metric; Minimum average group size – The data quality of using t-closeness is comparable with other privacy measurements

Caucas 787XX HIV+ Flu Asian/AfrAm 787XX HIV- Flu Asian/AfrAm 787XX HIV+ Shingles Caucas 787XX HIV- Acne Caucas 787XX HIV- Shingles Caucas 787XX HIV- Acne This is k-anonymous, l-diverse and t-close… …so secure, right? Anonymous, “t-Close” Dataset slide 37

Caucas 787XX HIV+ Flu Asian/AfrAm 787XX HIV- Flu Asian/AfrAm 787XX HIV+ Shingles Caucas 787XX HIV- Acne Caucas 787XX HIV- Shingles Caucas 787XX HIV- Acne Bob is Caucasian and I heard he was admitted to hospital with flu… slide 38 What Does Attacker Know?

Caucas 787XX HIV+ Flu Asian/AfrAm 787XX HIV- Flu Asian/AfrAm 787XX HIV+ Shingles Caucas 787XX HIV- Acne Caucas 787XX HIV- Shingles Caucas 787XX HIV- Acne Bob is Caucasian and I heard he was admitted to hospital … And I know three other Caucasions admitted to hospital with Acne or Shingles … slide 39 What Does Attacker Know?

k-Anonymity and Partition-based notions Syntactic – Focuses on data transformation, not on what can be learned from the anonymized dataset – “k-anonymous” dataset can leak sensitive information “Quasi-identifier” fallacy – Assumes a priori that attacker will not know certain information about his target slide 40

Today Permutation based anonymization methods (cont.) Other privacy principles for microdata publishing Statistical databases – Definitions and early methods – Output perturbation and differential privacy

Originated from the study on statistical database A statistical database is a database which provides statistics on subsets of records OLAP vs. OLTP Statistics may be performed to compute SUM, MEAN, MEDIAN, COUNT, MAX AND MIN of records Statistical Data Release

Types of Statistical Databases  Static – a static database is made once and never changes  Example: U.S. Census  Dynamic – changes continuously to reflect real-time data  Example: most online research databases

Types of Statistical Databases  Centralized – one database  Decentralized – multiple decentralized databases  General purpose – like census  Special purpose – like bank, hospital, academia, etc

Exact compromise – a user is able to determine the exact value of a sensitive attribute of an individual Partial compromise – a user is able to obtain an estimator for a sensitive attribute with a bounded variance Positive compromise – determine an attribute has a particular value Negative compromise – determine an attribute does not have a particular value Relative compromise – determine the ranking of some confidential values Data Compromise

Statistical Quality of Information Bias – difference between the unperturbed statistic and the expected value of its perturbed estimate Precision – variance of the estimators obtained by users Consistency – lack of contradictions and paradoxes – Contradictions: different responses to same query; average differs from sum/count – Paradox: negative count

Methods  Query restriction  Data perturbation/anonymization  Output perturbation

Data Perturbation

Output Perturbation Query Results

Statistical data release vs. data anonymization Data anonymization is one technique that can be used to build statistical database Other techniques such as query restriction and output purterbation can be used to build statistical database or release statistical data Different privacy principles can be used

Security Methods  Query restriction (early methods)  Query size control  Query set overlap control  Query auditing  Data perturbation/anonymization  Output perturbation

Query Set Size Control  A query-set size control limit the number of records that must be in the result set  Allows the query results to be displayed only if the size of the query set |C| satisfies the condition K <= |C| <= L – K where L is the size of the database and K is a parameter that satisfies 0 <= K <= L/2

Query Set Size Control

Tracker Q1: Count ( Sex = Female ) = A Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B What if B = A+1?

Tracker Q1: Count ( Sex = Female ) = A Q2: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) ) = B If B = A+1 Q3: Count ( Sex = Female OR (Age = 42 & Sex = Male & Employer = ABC) & Diagnosis = Schizophrenia) Positively or negatively compromised!

Query set size control With query set size control the database can be easily compromised within a frame of 4-5 queries For query set control, if the threshold value k is large, then it will restrict too many queries And still does not guarantee protection from compromise

Basic idea: successive queries must be checked against the number of common records. If the number of common records in any query exceeds a given threshold, the requested statistic is not released. A query q(C) is only allowed if: |q (C ) ^ q (D) | ≤ r, r > 0 Where r is set by the administrator Query Set Overlap Control

Query-set-overlap control Ineffective for cooperation of several users Statistics for a set and its subset cannot be released – limiting usefulness Need to keep user profile High processing overhead – every new query compared with all previous ones No formal privacy guarantee

Auditing Keeping up-to-date logs of all queries made by each user and check for possible compromise when a new query is issued Excessive computation and storage requirements “Efficient” methods for special types of queries

Audit Expert (Chin 1982) Query auditing method for SUM queries A SUM query can be considered as a linear equation where is whether record i belongs to the query set, xi is the sensitive value, and q is the query result A set of SUM queries can be thought of as a system of linear equations Maintains the binary matrix representing linearly independent queries and update it when a new query is issued A row with all 0s except for ith column indicates disclosure

Audit Expert Only stores linearly independent queries Not all queries are linearly independent Q1: Sum(Sex=M) Q2: Sum(Sex=M AND Age>20) Q3: Sum(Sex=M AND Age<=20)

Audit Expert O(L 2 ) time complexity Further work reduced to O(L) time and space when number of queries < L Only for SUM queries No restrictions on query set size Maximizing non-confidential information is NP-complete

Auditing – recent developments Online auditing – “Detect and deny” queries that violate privacy requirement – Denial themselves may implicitly disclose sensitive information Offline auditing – Check if a privacy requirement has been violated after the queries have been executed – Not to prevent

Security Methods  Query restriction  Data perturbation/anonymization  Output perturbation and differential privacy – Sampling – Output perturbation

Sources  Partial slides:  Adam, Nabil R. ; Wortmann, John C.; Security-Control Methods for Statistical Databases: A Comparative Study; ACM Computing Surveys, Vol. 21, No. 4, December 1989  Fung et al. Privacy Preserving Data Publishing: A Survey of Recent Development, ACM Computing Surveys, in press, 2009