Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) gergely.acs@inria.fr Jagdish Achara (INRIA) jagdish.achara@inria.fr.

Slides:



Advertisements
Similar presentations
I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.
Advertisements

Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.
Fast, Memory-Efficient Traffic Estimation by Coincidence Counting Fang Hao 1, Murali Kodialam 1, T. V. Lakshman 1, Hui Zhang 2, 1 Bell Labs, Lucent Technologies.
The End of Anonymity Vitaly Shmatikov. Tastes and Purchases slide 2.
Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Benjamin C.M. Fung Concordia University Montreal, QC, Canada
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Interchanging distance and capacity in probabilistic mappings Uriel Feige Weizmann Institute.
Chernoff Bounds, and etc.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Suppose I learn that Garth has 3 friends. Then I know he must be one of {v 1,v 2,v 3 } in Figure 1 above. If I also learn the degrees of his neighbors,
Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating.
April 13, 2010 Towards Publishing Recommendation Data With Predictive Anonymization Chih-Cheng Chang †, Brian Thompson †, Hui Wang ‡, Danfeng Yao † †‡
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.
Foundations of Cryptography Lecture 2 Lecturer: Moni Naor.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Li Xiong CS573 Data Privacy and Security Healthcare privacy and security: Genomic data privacy.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Unique in the crowd: The privacy bounds of human mobility Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel, Scientific reports, vol. 3,
Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.
Refined privacy models
Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi.
An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.
Bahman Bahmani Stanford University
On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
MaskIt: Privately Releasing User Context Streams for Personalized Mobile Applications SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference.
Privacy-preserving data publishing
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
John Lafferty Andrew McCallum Fernando Pereira
Preserving Privacy GPS Traces via Uncertainty-Aware Path Cloaking Baik Hoh, Marco Gruteser, Hui Xiong, Ansaf Alrabady Presenter:Yao Lu ECE 256, Spring.
Location Privacy Protection for Location-based Services CS587x Lecture Department of Computer Science Iowa State University.
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Privacy Preserving in Social Network Based System PRENTER: YI LIANG.
Unraveling an old cloak: k-anonymity for location privacy
Sampling and Sampling Distributions. Sampling Distribution Basics Sample statistics (the mean and standard deviation are examples) vary from sample to.
Single Season Study Design. 2 Points for consideration Don’t forget; why, what and how. A well designed study will:  highlight gaps in current knowledge.
Towards Robust Revenue Management: Capacity Control Using Limited Demand Information Michael Ball, Huina Gao, Yingjie Lan & Itir Karaesmen Robert H Smith.
Privacy Vulnerability of Published Anonymous Mobility Traces Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip (Purdue University) Nageswara S. V. Rao (Oak.
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
An agency of the European Union Guidance on the anonymisation of clinical reports for the purpose of publication in accordance with policy 0070 Industry.
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Feeling-based location privacy protection for LBS
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Spatial Online Sampling and Aggregation
Differential Privacy in Practice
Data Integration with Dependent Sources
Random Sampling over Joins Revisited
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Unique in the shopping mall: On the reidentifiability of credit card metadata by Yves-Alexandre de Montjoye, Laura Radaelli, Vivek Kumar Singh, and Alex.
Presented by : SaiVenkatanikhil Nimmagadda
Refined privacy models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) gergely.acs@inria.fr Jagdish Achara (INRIA) jagdish.achara@inria.fr Claude Castelluccia (INRIA) claude.castelluccia@inria.fr

Overview Motivation Background: km-anonymity 2 Motivation Background: km-anonymity Why km-anonymity is impractical? Relaxation of km-anonymity: Probabilistic km-anonymity How to anonymize to have probabilistic km-anonymity? Performance evaluation Conclusions Important: users are present in the intersection of the datasets!!! US: HIPAA deals only with re-identifiability and prescribes best-practices (remove PIDS, how to generalize birth dates, etc.)

De-identification 3 Personal data is any information relating to an identified or identifiable individual (EU Directive 95/46/EC) De-identification breaks links between individuals’ identity and their data (records) Regulations apply only to personal data! De-identified data is non-personal data and hence out of the regulation NOTE: de-identification does NOT include the control of (sensitive) attribute inference Important: users are present in the intersection of the datasets!!! US: HIPAA deals only with re-identifiability and prescribes best-practices (remove PIDS, how to generalize birth dates, etc.)

Set-valued data 4 Rec # Item 1 Item 2 Item 3 … Item n 1 2 Rec # Data 1 {Item 2, Item 3} 2 {Item 1, Item 3, Item n} … No direct Personal ID in the dataset (e.g., phone numbers) Each user has a subset of items (e.g., visited locations, watched movies, purchased items, etc.) High-dimensional and sparse data! Y.-A. de Montjoye et al. Unique in the crowd: The privacy bounds of human mobility. Nature, March 2013. Y.-A. de Montjoye et al. Unique in the shopping mall: On the reidentifiability of credit card metadata. Science, January 2015. Full CDR is more detailed: includes the id of the receiver, the call duration, etc.

Privacy test: Location uniqueness 5 Rec # Data 1 {Tower 2, Tower 3} 2 {Tower 1, Tower 3, Tower 5} … Derived from Call Data Records 4,427,486 users 1303 towers (i.e., locations) 01/09/2007 – 15/10/2007 Mean tower # per user: 11.42 (std.dev: 17.23) Max. tower # user: 422

Privacy test: Location uniqueness 6 If the adversary knows m towers of a user, what is the probability that the user is the only one who have these towers in the dataset? Similar study: Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific Reports, Nature, March 2013. - arbitrary external knowledge => Perfect if data is shared to the public!

Background: km-anonymity 7 For ANY m items, there are at least k users who have these items if m equals the maximum item number per user, then km is equivalent to k-anonymity However, k-anonymity suffers from the curse of dimensionality[1] (i.e., very bad utility for high-dimensional, sparse data) Rationale of km-anonymity: adversary is unlikely to know all the items of a user Allows larger utility by applying fewer generalizations (aggregations) - arbitrary external knowledge => Perfect if data is shared to the public! [1] C. C. Aggarwal, On K-anonymity and the Curse of Dimensionality, VLDB, 2005

Example: k vs. km-anonymity 8 - arbitrary external knowledge => Perfect if data is shared to the public!

Problem of km-anonymity 9 Verifying km-anonymity can have exponential complexity in m [1]  impractical if m is large (typically when m ≥ 5) The exact speed depends on the structure of the generalization hierarchy and the dataset itself[1]  DOES NOT WORK FOR MANY REAL-WORLD DATASETS! [1] M. Terrovitis, N. Mamoulis, P.Kalnis, Privacy-preserving anonymization of set- valued data, VLDB, 2008 - arbitrary external knowledge => Perfect if data is shared to the public!

Probabilistic km-anonymity 10 For ANY m items, there are at least k users who have these items with probability at least p where p > 0.9, and typically should be around 0.99 or 0.999 Intuition: instead of checking all possible m items, we select randomly some of them from the dataset, and check k- anonymity of only these samples!  we have k-anonymity for ANY randomly selected m items with large probability (based on sampling theorems)! How to sample these m items? How many samples are needed? - arbitrary external knowledge => Perfect if data is shared to the public!

How to sample m-itemsets? 11 Naïve approach: 1. Sample a record 2. Sample m items from this record Biased towards selecting more popular itemsets! (e.g., popular places in location data) However, adversary may learn unpopular items easily e.g., home address is not necessarily popular… Our approach is more general: Select among all m-itemsets uniformly at random using a fast- mixing Markov chain Adversary can learn any m-itemset with equal probability! - arbitrary external knowledge => Perfect if data is shared to the public!

How many samples? 12 From the Chernoff-Hoeffding bound: to have km-anonymity with probability p Independent from m, the dataset size, and the number of all items! - arbitrary external knowledge => Perfect if data is shared to the public! p N 0.99 ≈ 60 K 0.999 ≈ 5 M 1 ∞

Anonymization 13 INPUT: p – probability, k,m – privacy parameters, D – dataset SAMPLING: Pick (uniformly at random) a single m-itemset from D using MCMC sampling IF the sample does NOT satisfy k-anonymity GENERALIZE an item in the sample such that generalization error is minimized (e.g., average cell size in location data) REPEAT the above steps until consecutive samples satisfy k-anonymity AMPLIFY UTILITY: Execute the above algorithm multiple times and select the one which has the least generalization error - arbitrary external knowledge => Perfect if data is shared to the public!

Running complexity 14 The required number of samples which must satisfy k-anon. is For each sample, the Markov chain sampling runs in The maximum number of generalizations is the number of possible items which is Hence, the total complexity is  polynomial in the number of records |D|, number of possible items |I|, m, and probability p - arbitrary external knowledge => Perfect if data is shared to the public!

Performance evaluation: Privacy guarantee 15 RECALL: a user has fewer than 11 visited towers on average We can have different privacy guarantee (i.e., k, p) for different m! In the evaluation: when m ≤ 4: k is 10 or 20, p = 1 (rationale: too easy to learn fewer than 4 locations) when m ≥ 5: k is 10 or 20, p is 0.99 or 0.999 or 0 (no guarantee) Execution time: couple of hours in all cases (dominated by p = 1)  1 ≤ m ≤ 11 - arbitrary external knowledge => Perfect if data is shared to the public!

Performance evaluation 16 p=.99 p=.999 Original: Privacy GOAL 1: if 1 ≤ m ≤ 4: 20m-anonymity with prob. 1 if m = 5, 20m-anonymity with prob. p if m ≥ 5, p = 0 (no guarantee) - arbitrary external knowledge => Perfect if data is shared to the public!

Performance evaluation 17 p=.99 p=.999 Original: Privacy GOAL 2: if 1 ≤ m ≤ 4: 20m-anonymity with prob. 1 if 5 ≤ m ≤ 11, 20m-anonymity with prob. p - arbitrary external knowledge => Perfect if data is shared to the public!

Average partition size 18 Average territory of the aggregated cells p = .99 p = .999 - arbitrary external knowledge => Perfect if data is shared to the public!

Conclusions km-anonymity is guaranteed with certain confidence 19 km-anonymity is guaranteed with certain confidence Adversarial knowledge is limited to any m items Probabilistic relaxation improves scalability and utility Proposed anonymization to achieve this guarantee Running time is polynomial in m, dataset size, and universe size Is it enough? If so, how to choose k, m, p? Perform Privacy Risk Analysis

Thank You! 20 Q (&A)

MCMC for sampling m-itemsets Start with any existing m-items in the dataset. REPEAT 1. PROPOSAL: 1.1 sample a user uniformly at random 1.2 select m items C from this user also uniformly at random 2. PROBABILISTIC ACCEPTANCE: 2.1 accept it (i.e., S=C) with a probability, which is min(1, Pr[“S is proposed”]/Pr[“C is proposed”]) UNTIL Convergence The main idea is that, at each state, we use the fast but biased sampling technique to propose a candidate C. We correct this bias probabistically; M is more likely to accept the samples which are less likely to be proposed.

European Data Protection law 22 personal data is any information relating to an identified or identifiable individual can be used to identify him or her, and to know his/her habits account must be taken of all the means available […] to determine whether a person is identifiable any processing of any personal data must be (1) transparent (to the individual), (2) for specified explicit purpose(s), (3) relevant and not excessive in relation to these purposes Legally nonbinding: all member states have enacted their own data protection legislation Anonymized data is considered to be non-personal data, and as such, the directive does not apply to that