Fast Data Anonymization with Low Information Loss

Fast Data Anonymization with Low Information Loss
Gabriel Ghinita1 Panagiotis Karras2 Panos Kalnis1 Nikos Mamoulis2 1 National University of Singapore 2 Hong Kong University

Privacy-Preserving Data Publishing
Large amounts of public data Research or statistical purposes e.g. distribution of disease for age, city Data may contain sensitive information Ensure data privacy We address the topic of privacy-preserving data publishing. Currently, there are large amounts of data owned various organizations, such as hospitals and insurance companies. This data pertains to individuals. While publishing such data is of interest for statistical or research purposes, it is of paramount importance to preserve the privacy of individuals included in the data

Privacy Violation Example
Age ZipCode Disease 42 52000 Ulcer 47 43000 Pneumonia 51 32000 Flu 55 27000 Gastritis 62 41000 Dyspepsia 67 55000 Name Age ZipCode Disease Andy 42 52000 Ulcer Bill 47 43000 Pneumonia Ken 51 32000 Flu Nash 55 27000 Gastritis Mike 62 41000 Dyspepsia Sam 67 55000 To give an example of how privacy of individuals can be compromised, consider a sample table with hospital records (left hand side), where directly identifiable fields (name, SSID) have been removed in advance. Such data is also referred to as microdata The Disease attribute represents sensitive information about patients. The data contains so-called QID attributes, which can be joined with public data, as shown in the following example On the right hand side, we have a voting registration list, which is publicly available Note that, the Age and ZipCode fields appear in both tables, and joining the two tables, an attacker can infer with certainty which the individual behind each hospital record (a) Microdata (b) Voting Registration List (public)

k-anonymity[Sam01] QID generalization or suppression
Age ZipCode Disease 42-47 Ulcer Pneumonia 51-55 Flu Gastritis 62-67 Dyspepsia Name Age ZipCode Disease Andy 42 52000 Ulcer or Pneumonia Bill 47 43000 Ken 51 32000 Flu or Gastritis Nash 55 27000 Mike 62 41000 Dyspepsia Sam 67 55000 To prevent such attacks, the k-anon concept has been proposed, which dictates that each published record should be indistinguishable from at least k-1 other records along the QID set. K-anon is achieved through generalization (replace exact QID values with more general ones) or suppression. Note the inherent information loss. Fig (a) shows a 2-anonymous version of the hospital data in our example, and Fig (b) shows the result of the join with the voting registration list For the first three groups, the attacker can associate an individual to a disease with probability 50%.However, for the last group, the attacker can still know the disease with certainty, due to the fact that all diseases in the group are equal (homogeneity attack) 2-anonymous microdata (b) Voting Registration List (public) Privacy Violation! [Sam01] P. Samarati, "Protecting Respondent's Privacy in Microdata Release," in IEEE TKDE, vol. 13, n. 6, November/December 2001, pp

ℓ-diversity[MGKV06] At least ℓ sensitive attribute (SA) values “well-represented” in each group e.g. freq. of an SA value in a group < 1/ℓ L-diversity is another privacy-preserving paradigm, which requires that at least “l” sensitive values are well-represented in each group A practical interpretation of l-diversity is to bound the association probability between a record and an SA value to 1/l [MGKV06] A. Machanavajjhala et al. ℓ-diversity: Privacy Beyond k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006

Problem Statement Find k-anonymous/ℓ-diverse transformation
Minimize information loss Incur reduced anonymization overhead The objective of effective data anonymization is to transform the data such that privacy is ensured, according to the k-anon or l-div paradigm At the same time, the data must remain as useful as possible, i.e. the information loss incurred by the anonymization should be minimized Finally, the overhead of anonymization should be low, since the published data may contain millions of records

Contributions 1D QID Generalization to multi-dimensional QID
Linear, optimal k-anonymous partitioning Polynomial, optimal ℓ-diverse partitioning Linear heuristic for ℓ-diverse partitioning Generalization to multi-dimensional QID Multi-to-1D mapping Hilbert Space-Filling Curve i-Distance Apply 1D algorithms In summary, our contributions are: - efficient solutions for the k-anon and l-div problems with 1D qid; specifically, an optimal, linear time solution to k-anon, an optimal poly-time solution for l-div, and a linear heuristic for l-div - We extend our algorithms to multi-dim qid using multi-to-1D transformations, specifically Hilbert and iDist; however, any transformation can be plugged-in into our framework

Multi-dimensional QID
Dimensionality Mapping Extension to multi-dim QID is performed through mapping, then application of 1D algos We consider Hilbert and iDist

State-of-the-art: Mondrian[FWR06]
Generalization-based data-space partitioning similar to k-d-trees split recursively as long as privacy condition holds k = 2 Age 20 40 60 40 60 Currently, there are two main paradigms to achieve privacy, generalization-based and permutation-based The state-of-the-art in generalization-based techniques is Mondrian, which recursively partitions the dataspace as long as each resulting partition satisfies the privacy requirement (either k-anon or l-div) The split dimension is the one with the largest normalized span; the split value is the median We show a 2-D example for k-anon, k=2 (briefly discuss example …) Weight 80 100 [FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006

Motivating Example Age Weight 42 22 24 40 30 35 31 33 55 56 63 61 45
50 65 60 70 75 80 85 Mondrian k-anonymity, k = 4 We highlight the limitations of Mondrian, first for k-anon, k=4: Each disease is represented with a different texture, and the numbers represent Hilbert values in a 8x8 space Both dims have the same attr span, assume Age is chosen, two partitions of 6 each are created by Mondrian, and no subsequent splits are allowed Our approach simply splits the Hilbert-ordered sequence into three groups of 4 each, with considerably smaller information loss (i.e. group extent. We will formally define info loss later)

Motivating Example Age Weight 42 22 24 40 30 35 31 33 55 56 63 61 45
50 65 60 70 75 80 85 Our Method k-anonymity, k = 4 We highlight the limitations of Mondrian, first for k-anon, k=4: Each disease is represented with a different texture, and the numbers represent Hilbert values in a 8x8 space Both dims have the same attr span, assume Age is chosen, two partitions of 6 each are created by Mondrian, and no subsequent splits are allowed Our approach simply splits the Hilbert-ordered sequence into three groups of 4 each, with considerably smaller information loss (i.e. group extent. We will formally define info loss later)

Motivating Example Mondrian Performs NO SPLIT! ℓ -diversity, ℓ = 3 Age
Weight 42 22 24 40 30 35 31 33 55 56 63 61 45 50 65 60 70 75 80 85 Mondrian Performs NO SPLIT! The problem of Mondrian is more pronounced for l-diversity, because the group validation condition is more strict In the example, no split is possible according to Mondrian In our approach, we sort the data according to Hilbert value (along each SA value domain), and we create groups according to this ordering. In the example, 4 groups of 3 each are created Our technique uses local recoding (i.e. overlap is allowed).

Motivating Example ℓ -diversity, ℓ = 3 Age Weight 42 22 24 40 30 35 31
33 55 56 63 61 45 50 65 60 70 75 80 85 Our Method The problem of Mondrian is more pronounced for l-diversity, because the group validation condition is more strict In the example, no split is possible according to Mondrian In our approach, we sort the data according to Hilbert value (along each SA value domain), and we create groups according to this ordering. In the example, 4 groups of 3 each are created Our technique uses local recoding (i.e. overlap is allowed).

State-of-the-art: Anatomy[XT06]
Permutation-based method discloses exact QID values vulnerable to presence attacks “Anatomized” table |G|! permutations Age ZipCode Disease 42 52000 Ulcer 47 43000 Pneumonia 51 32000 Flu 55 27000 Gastritis 62 41000 Dyspepsia 67 55000 Age ZipCode 42 52000 47 43000 51 32000 62 41000 55 27000 67 55000 Disease Ulcer(1) Pneumonia(1) Flu(1) Dyspepsia(1) Gastritis(1) Dyspepsia(1) Anatomy is a permutation-based approach, and it “breaks” the mapping between qid and SA value. |G|! mappings are possible for each group. Groups are formed by randomly selecting records with distinct SA values Anatomy publishes the EXACT qid, hence it is vulnerable to “presence” attacks [XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006

Limitation of Anatomy SA: D3 D2 D1 Alzheimer QID: 20 40 60 80 100
It has another limitation, due to the fact that it does not consider proximity in the QID space in the group formation process Explain the example

Information Loss (Numerical Data)
Age Weight 42 22 24 40 30 35 31 33 G2 55 56 63 61 45 50 65 60 70 75 80 85 For a numerical attribute, IL is the normalized span if a group for that attribute

Information Loss (Categorical Data)
For a categorical attribute, IL is measured wrt the taxonomy tree Explain example … IL = IL({Italy, Spain}) = 3/5

Optimal 1D k-anonymity Properties of optimal solution
Groups do not overlap in QID space Group size bounded by 2k-1 DP Formulation O(kN) Our 1D k-anon algo is based on the following observations: Groups need not overlap in the qid space Group size is bounded by 2k-1 DP formulation with linear complexity in k and db size Mention similarity to optimal V-histogram from data summarization j: end record of previous group i: end record candidates for current group

Optimal 1D ℓ-diversity Properties of optimal solution
Group size bounded by 2ℓ-1 But groups MAY overlap in QID space SA Domain Representation For l-diversity, the problem is more difficult, since overlap may be required in the optimal solution, hence the search space increases considerably Group size is still bounded, by 2l-1 We consider a multi-domain representation of data, where each domain corresponds to a SA value, and the x-axis is the QID

Order of groups in each domain is THE SAME
Group Order Property Optimal grouping We have identified three properties of the optimal solution: Group Order states that the order of appearance of groups is consistent across all domains Explain example violation of group order Order of groups in each domain is THE SAME

“begin” and “end” records in each group follow the same order
Border Order Property Optimal grouping Border Order states that the order also applies to the begin and end records of each group (regardless of domain) Explain example violation of border order “begin” and “end” records in each group follow the same order

Cover Property Optimal grouping violation of cover order
WILL ONLY SHOW TITLE, SKIP THIS TO SAVE TIME! Cover Property states that when a record can be added to one or more groups (wrt the l-diversity requirement), it should belong to the group which is closest in the QID space violation of cover order record r that can be added to two groups should belong to the “closest” group to r

1D ℓ-diversity Heuristic
Optimal algorithm is polynomial But may be costly in practice Linear heuristic algorithm Considers single “frontier of search” Frontier consists of first non-assigned record in each domain Based on the three properties, we obtain an optimal solution using DP However, although polynomial, the complexity can be rather large in practice (the degree of the polynomial is the cardinality of SA) We devise an heuristic which works in linear time and prunes the search space

1D ℓ-diversity Heuristic
use “frontier” of search check “eligibility condition” (for termination) G1 G2 G3 G4 Explain example and optimization ℓ=3

Experimental Setting Census dataset
Data about 500,000 individuals General purpose information loss metric Based on group extent in QID space OLAP query accuracy KL-divergence pdf distance We consider a real dataset, CENSUS, and we measure the performance of the algos wrt execution time and data utility We measure utility both with a general-purpose metric (defined earlier) and also for real query workload – OLAP queries The OLAP cube can be seen as a multi-dimensional pdf function; we measure the KL-divergence (distance probability) between the cube obtained from the real data and that obtained from the anonymized data

k-anonymity TopDown is a clustering-based method [KDD06]; it is considerably slower than the rest of the methods Our methods outperform state-of-the-art. The “fast” versions are a speed optimization of their original counterparts, and compute the group extent in O(1), as opposed to O(|G|)

ℓ-diversity: General Info. Loss
For l-diversity, we consider first general info loss, and we do not include anatomy, which publishes exact qid We are considerably better than l-diverse Mondrian

ℓ-diversity: General Info. Loss
We also scale better than Mondrian with QID dimensionality

OLAP Queries Distance between actual and approximate OLAP cubes
SELECT QT1, QT2,..., QTi, COUNT(*) FROM Data WHERE SA = val GROUP BY QT1, QT2,..., QTi

OLAP Query Accuracy For OLAP, we include Anatomy in our results. Recall that Anatomy publishes exact qid, while we use generalization. Still, we outperform Anatomy for a large part of the “l”-values spectrum We also show the variation with the OLAP granularity: the larger the level, the better the granularity (i.e. more attributes in group-by). Anatomy tends to be better for smaller granularity, because it discloses exact qid values.

OLAP Query Accuracy

Conclusions Framework for k-anonymity and ℓ-diversity Results
Transform the multi-D QID problem to 1-D Apply linear optimal/heuristic 1D algorithms Results Clearly superior utility to Mondrian, with comparable execution time Similar (or better) utility as Anatomy for aggregate queries, where Anatomy excels

Fast Data Anonymization with Low Information Loss

Similar presentations

Presentation on theme: "Fast Data Anonymization with Low Information Loss"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Data Anonymization with Low Information Loss

Similar presentations

Presentation on theme: "Fast Data Anonymization with Low Information Loss"— Presentation transcript:

Similar presentations

About project

Feedback