1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.

1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk Gabriel Ghinita 1 Yufei Tao 2 Panos Kalnis 1

2 Publishing Transaction Data  Publishing transaction data Retail chain-owned shopping cart data Infer consumer spending patterns  Correlations among purchased items e.g., 90% of cereals buyers also buy milk  What about privacy?

3 Privacy Threat Quasi-identifying Items Sensitive Items

4 Privacy Paradigm  ℓ-diversity prevent association between quasi-identifier and sensitive attributes  Create groups of transactions freq. of an SA value in a group < 1/p  Objective Enforce privacy Preserve correlations among items Challenge: high data dimensionality

5 Data Re-organization Band Matrix Organization PRESERVES CORELATIONS!

6 Published Data Summary of Sensitive Items

7 Contributions  Novel data representation Preserves correlation among items  Efficient heuristic for group formation Linear time to data size Supports multiple sensitive items

State-of-the-art: Mondrian [FWR06]  Generalization-based data-space partitioning similar to k-d-trees  split recursively until privacy condition does not hold constrained global recoding k = 2 [FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006 Age 204060 Weight 40 60 80 100 GENERALIZATION + HIGH DIMENSIONALITY = UNACCEPTBLE INFORMATION LOSS

State-of-the-art: Anatomy [XT06]  Permutation-based method discloses exact QID values Disease Ulcer(1) Pneumonia(1) Flu(1) Dyspepsia(1) Gastritis(1) Dyspepsia(1) [XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006 AgeZipCode 4252000 4743000 5132000 6241000 5527000 6755000 AgeZipCodeDisease 4252000Ulcer 4743000Pneumonia 5132000Flu 5527000Gastritis 6241000Dyspepsia 6755000Dyspepsia “Anatomized” table |G|! permutations RANDOM GROUP FORMATION DOES NOT PRESERVE CORRELATIONS

10 Band Matrix Representation  Bandwidth = U+L+1  Minimizing bandwidth is NP-hard

11 Reverse Cuthil-McKee (RCM)  Heuristic Bandwidth Minimization Solves corresponding graph labeling problem Permutes rows and columns Complexity N* D * log D  N = matrix rows (# transactions)  D = maximum degree of any vertex

12 Group Formation  Correlation-aware Anonymization of High- Dimensional Data (CAHD)  Use the order given by RCM Consecutive transactions highly correlated  O(pN) complexity

13 Group Formation

Experimental Evaluation

15 RCM Visualization

16 Experimental Setting  BMS dataset  Compare with hybrid PermMondrian(PM) Combines Mondrian with Anatomy  Query Workload  Reconstruction Error

17 Recostruction Error vs p

18 Execution Time

19 Conclusions  Anonymizing transaction data High-dimensionality Preserving correlation  Future work Different encodings for data representation  Enhance correlation among consecutive rows

1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.

Similar presentations

Presentation on theme: "1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.

Similar presentations

Presentation on theme: "1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong."— Presentation transcript:

Similar presentations

About project

Feedback