M-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets Slawomir.

Slides:



Advertisements
Similar presentations
CLOSENESS: A NEW PRIVACY MEASURE FOR DATA PUBLISHING
Advertisements

I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.
A Privacy Preserving Index for Range Queries
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
CS4432: Database Systems II
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong.
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
603 Database Systems Senior Lecturer: Laurie Webster II, M.S.S.E.,M.S.E.E., M.S.BME, Ph.D., P.E. Lecture 8 A First Course in Database Systems.
Xyleme A Dynamic Warehouse for XML Data of the Web.
B+-tree and Hashing.
Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
1 Global Privacy Guarantee in Serial Data Publishing Raymond Chi-Wing Wong 1, Ada Wai-Chee Fu 2, Jia Liu 2, Ke Wang 3, Yabo Xu 4 The Hong Kong University.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Attacks against K-anonymity
Recording a Game of Go: Hidden Markov Model Improves Weak Classifier Steven Scher
Preserving Privacy in Clickstreams Isabelle Stanton.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis, Nikos Mamoulis University of Hong Kong Panos Kalnis National University of Singapore.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis.
Database Laboratory Regular Seminar TaeHoon Kim.
Preserving Privacy in Published Data
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
1.1 - Populations, Samples and Processes Pictorial and Tabular Methods in Descriptive Statistics Measures of Location Measures of Variability.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Refined privacy models
Adaptive randomization
View Materialization & Maintenance Strategies By Ashkan Bayati & Ali Reza Vazifehdoost.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
SQL John Nowobilski. What is SQL? Structured Query Language Manages Data in Database Management Systems based on the Relational Model Developed in 1970s.
1 Power-Efficient TCAM Partitioning for IP Lookups with Incremental Updates Author: Yeim-Kuan Chang Publisher: ICOIN 2005 Presenter: Po Ting Huang Date:
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
Privacy-preserving data publishing
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
Sovereign Information Sharing, Searching and Mining Rakesh Agrawal IBM Almaden Research Center.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Versatile Publishing For Privacy Preservation
CS 545 – Fundamentals of Stream Processing – Consistent Hashing
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Updating SF-Tree Speaker: Ho Wai Shing.
Tree-Structured Indexes: Introduction
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Database Systems (資料庫系統)
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Presented by : SaiVenkatanikhil Nimmagadda
Chapter 11: Indexing and Hashing
SHUFFLING-SLICING IN DATA MINING
Refined privacy models
Presentation transcript:

m-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets Slawomir Goryczka

Panta rhei (Heraclitus) "everything is in a state of flux" To provide most recent anonymized data publisher needs to re-publish them Most of the current approaches do not consider this! Exception:  Support only insertions of data  J.-W. Byun, Y. Sohn, E. Bertino, and N. Li Secure anonymization for incremental datasets. (2006) Where is the problem?

Maybe it's simple? We just need to ensure that: Dataset is not published too often (movie effect) We use different algorithm for each dataset snapshot (“white” noise instead of the movie effect, but may be used to identify part of the data!) Play with data to keep similar statistics of attribute values – what with long time trends, i.e. flu pandemic, which change global and local statistics of the data

Deletion of tuples Deletion of data may introduce critical absence:

Deletion of tuples Deletion of data may introduce critical absence:

Deletion of tuples Deletion of data may introduce critical absence:

Deletion of tuples Deletion of data may introduce critical absence:

Deletion of tuples Deletion of data may introduce critical absence: Bob has dyspepsia

Deletion of tuples Deletion of data may introduce critical absence: Bob has dyspepsia Solution(?) Ignore deletions

Counterfeit generalization Add some counterfeit tuples to avoid critical absence Publish number and location of these tuples (utility)

Counterfeit generalization Add some counterfeit tuples to avoid critical absence Publish number and location of these tuples (utility)

Counterfeit generalization Add some counterfeit tuples to avoid critical absence Publish number and location of these tuples (utility)

Counterfeit generalization (continued) Crucial to preserve privacy is to ensure certain invariance in all quasi-identifier groups that a tuple (here: Bob's tuple) is generalized to in different snapshots Existing generalization schemas are special cases of counterfeited generalization, where there is no counterfeits Goal: minimize number of counterfeit tuples, but ensure privacy among all snapshots. How?

m-Invariance m-unique each QI group in anonymized table T*(j) contains ≥m tuples with different sensitive data among them m-invariant T*(j) is m-unique for all 1≤j≤n For each tuple t, for each data snapshot where this tuple appears, its QI generalized group have the same set of distinct sensitive values (For each QI generalized group its set of distinct sensitive values is constant – no problems with critical absence, but each tuple have limited number of QI generalized groups where it can belongs to)

Privacy disclosure risk Privacy disclosure risk for tuple t: risk(t) = nis(t)/nrs  nis(t) – number of reasonable surjective functions that correctly reconstruct t  nrs – number of all reasonable surjections

m-Invariance (properties) If {T*(1),..., T*(n)} is m-invariant, then risk(i) ≤ 1/m, 1 ≤ i ≤ n If {T*(1),..., T*(n-1)} is m-invariant, then {T*(1),..., T*(n)} is also m-invariant if and only if:  T*(n) is m-unique  For any tuple its generalized QI groups in snapshots T*(n-1) and T*(n) have the same signature (set of distinct sensitive values).

m-Invariant algorithm n-th publication is allowed, only if T(n)-T(n-1) is m-eligible, that is, at most 1/m of the tuples in T(n)-T(n-1) have an identical sensitive value Algorithm (4 phases): 1.Division 2.Balancing 3.Assignment 4.Split

m-Invariant algorithm (continued) Division – group tuples common for T*(n-1) and T(n) with the same signature into one bucket Balancing – balance number of tuples in buckets using counterfeits if necessary (they have no value for QI attributes)

m-Invariant algorithm (continued) Assignment – add tuples, which were not in T*(n-1), but are in T(n) using similar steps to Dividing and Balancing Split – split each bucket B into |B|/s QI generalized groups where s (≥m) is the number of values in the signature of B. Each group has s tuples, taking the s sensitive values in the signature, respectively.

m-Invariant algorithm (continued) Assignment – add tuples, which were not in T*(n-1), but are in T(n) using similar steps to Dividing and Balancing Split – split each bucket B into |B|/s QI generalized groups where s (≥m) is the number of values in the signature of B. Each group has s tuples, taking the s sensitive values in the signature, respectively.

Datasets (Tooc, Tsal):  400k tuples (600k in total)  Attributes: Age, Gender, Education, Birthplace, Occupation, Salary Experiments

Pros and cons Incremental Small data disturbance High data utility (measured as a median relative error for queries)... Preserving current statistics of attribute values – what if they change? What about continues attributes (numbers)?...

Q & I* * Ideas