Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.

Similar presentations


Presentation on theme: "1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong."— Presentation transcript:

1 1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore {ghinitag,kalnis}@comp.nus.edu.sg 2 Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk Gabriel Ghinita 1 Yufei Tao 2 Panos Kalnis 1

2 2 Publishing Transaction Data  Publishing transaction data Retail chain-owned shopping cart data Infer consumer spending patterns  Correlations among purchased items e.g., 90% of cereals buyers also buy milk  What about privacy?

3 3 Privacy Threat Quasi-identifying Items Sensitive Items

4 4 Privacy Paradigm  ℓ-diversity prevent association between quasi-identifier and sensitive attributes  Create groups of transactions freq. of an SA value in a group < 1/p  Objective Enforce privacy Preserve correlations among items Challenge: high data dimensionality

5 5 Data Re-organization Band Matrix Organization PRESERVES CORELATIONS!

6 6 Published Data Summary of Sensitive Items

7 7 Contributions  Novel data representation Preserves correlation among items  Efficient heuristic for group formation Linear time to data size Supports multiple sensitive items

8 State-of-the-art: Mondrian [FWR06]  Generalization-based data-space partitioning similar to k-d-trees  split recursively until privacy condition does not hold constrained global recoding k = 2 [FWR06] K. LeFevre et al. Mondrian Multidimensional k-anonymity, Proceedings of the 22nd International Conference on Data Engineering (ICDE), 2006 Age 204060 Weight 40 60 80 100 GENERALIZATION + HIGH DIMENSIONALITY = UNACCEPTBLE INFORMATION LOSS

9 State-of-the-art: Anatomy [XT06]  Permutation-based method discloses exact QID values Disease Ulcer(1) Pneumonia(1) Flu(1) Dyspepsia(1) Gastritis(1) Dyspepsia(1) [XT06] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation, Proceedings of the 32nd international conference on Very Large Data Bases (VLDB), 2006 AgeZipCode 4252000 4743000 5132000 6241000 5527000 6755000 AgeZipCodeDisease 4252000Ulcer 4743000Pneumonia 5132000Flu 5527000Gastritis 6241000Dyspepsia 6755000Dyspepsia “Anatomized” table |G|! permutations RANDOM GROUP FORMATION DOES NOT PRESERVE CORRELATIONS

10 10 Band Matrix Representation  Bandwidth = U+L+1  Minimizing bandwidth is NP-hard

11 11 Reverse Cuthil-McKee (RCM)  Heuristic Bandwidth Minimization Solves corresponding graph labeling problem Permutes rows and columns Complexity N* D * log D  N = matrix rows (# transactions)  D = maximum degree of any vertex

12 12 Group Formation  Correlation-aware Anonymization of High- Dimensional Data (CAHD)  Use the order given by RCM Consecutive transactions highly correlated  O(pN) complexity

13 13 Group Formation

14 Experimental Evaluation

15 15 RCM Visualization

16 16 Experimental Setting  BMS dataset  Compare with hybrid PermMondrian(PM) Combines Mondrian with Anatomy  Query Workload  Reconstruction Error

17 17 Recostruction Error vs p

18 18 Execution Time

19 19 Conclusions  Anonymizing transaction data High-dimensionality Preserving correlation  Future work Different encodings for data representation  Enhance correlation among consecutive rows


Download ppt "1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong."

Similar presentations


Ads by Google