Ιδιωτικότητα σε Βάσεις Δεδομένων Οκτώβρης 2011. Roadmap Motivation Core ideas Extensions 2.

Slides:



Advertisements
Similar presentations
Cipher Techniques to Protect Anonymized Mobility Traces from Privacy Attacks Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip and Nageswara S. V. Rao.
Advertisements

21-1 Last time Database Security  Data Inference  Statistical Inference  Controls against Inference Multilevel Security Databases  Separation  Integrity.
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
M-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets Slawomir.
Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.
Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
1 IS 2150 / TEL 2810 Information Security & Privacy James Joshi Associate Professor, SIS Lecture 11 April 10, 2013 Information Privacy (Contributed by.
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
1 Privacy Preserving Data Publishing Prof. Ravi Sandhu Executive Director and Endowed Chair March 29, © Ravi.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
 Guarantee that EK is safe  Yes because it is stored in and used by hw only  No because it can be obtained if someone has physical access but this can.
Do You Trust Your Recommender? An Exploration of Privacy and Trust in Recommender Systems Dan Frankowski, Dan Cosley, Shilad Sen, Tony Lam, Loren Terveen,
Probabilistic Inference Protection on Anonymized Data
Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.
Privacy and k-Anonymity Guy Sagy November 2008 Seminar in Databases (236826)
C MU U sable P rivacy and S ecurity Laboratory 1 Privacy Policy, Law and Technology Data Privacy October 30, 2008.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
L-Diversity: Privacy Beyond K-Anonymity
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Database Laboratory Regular Seminar TaeHoon Kim.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Preserving Privacy in Published Data
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be.
Protecting Sensitive Labels in Social Network Data Anonymization.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Refined privacy models
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
K-Anonymity & Algorithms
Dimensions of Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Privacy-preserving data publishing
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Anonymity and Privacy Issues --- re-identification
CSCI 347, Data Mining Data Anonymization.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Differential Privacy (1). Outline  Background  Definition.
Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and Anonymity.
Unraveling an old cloak: k-anonymity for location privacy
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
ROLE OF ANONYMIZATION FOR DATA PROTECTION Irene Schluender and Murat Sariyar (TMF)
Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
An agency of the European Union Guidance on the anonymisation of clinical reports for the purpose of publication in accordance with policy 0070 Industry.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
Versatile Publishing For Privacy Preservation
Privacy in Database Publishing
University of Texas at El Paso
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Side-Channel Attack on Encrypted Traffic
By (Group 17) Mahesha Yelluru Rao Surabhee Sinha Deep Vakharia
Differential Privacy in Practice
Lecture 27: Privacy CS /7/2018.
Data Anonymization – Introduction
Presented by : SaiVenkatanikhil Nimmagadda
TELE3119: Trusted Networks Week 4
Refined privacy models
Privacy-Preserving Data Publishing
Differential Privacy (1)
Presentation transcript:

Ιδιωτικότητα σε Βάσεις Δεδομένων Οκτώβρης 2011

Roadmap Motivation Core ideas Extensions 2

Roadmap Motivation Core ideas Extensions 3

Reasons for privacy preserving data publishing Vast amount of data collected nowadays Estimated user data per day – 8-10 GB public content – ~ 4 TB private content ( s, SMSs, content annotations, social networks…) 4

Reasons for privacy preserving data publishing Organizations (hospitals, ministries, internet providers, …) publicly release data concerning individual records (internet searches, medical records, …) The laws oblige these agencies to protect the individuals’ privacy 5

Reasons for privacy preserving data publishing So, data are stripped of the attributes that can reveal the individuals’ identities Unfortunately, this is not enough… 6

Sweeney’s breach of governor’s medical record “ … In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees. For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and received the information on two diskettes. The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter. This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures, and medications to particularly named individuals. …” 7

Sweeney’s breach of governor’s medical record “ … For example, William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.…” 8

AOL’s exposure of user August 2006: AOL publicizes anonymized data for 21M user queries User’s had a strong essence of geo and thematic locality Researchers would focus more and more the search, based on these queries Ms Arnold, 62, would prove to be the user search for medication, resorts, dogs and family members, … 9

The context of privacy-preserving data publishing 10 Detailed microdata T Anonymized public data T* Bob (the victim) to be hidden Ben, the benevolent, data miner Alice, the external attacker Deborah, a star DBA & a TRUSTED data publisher

Roadmap Motivation Core ideas Extensions 11

Anonymization To retain privacy one must: – Remove the attributes that directly identify individuals (name, SSN, …) – Organize the tuples and the cell values of the data set in such a way that: The statistical properties of the data set are retained The attacker cannot guess to which individual a tuple corresponds with statistical meaningful guarantee 12

Fundamentals Identifier(s): attribute(s) that explicitly reveal the identity of a person (name, SSN, …). These attributes are removed from the public data set Quasi identifier: attribute(s) that if joined with external data can reveal sensitive information (zip code, birth date, sex,…) – Typically accompanied by “generalization hierarchies” Sensitive attribute: containing the values that should be kept private (disease, salary,…) 13

14 Generalization hierarchies

General methods for Anonymization “Hide tuples in the crowd” – Generalization – Anatomization “Lies to the attacker, truth to the statistician” – Noise injection – Value perturbation 15

Generalization methods Global recoding – All the values of an attribute are generalized to the same hierarchy level [Swee02a] [Sama01] [LeDR05]. Multidimensional – The values of an attribute can be generalized in different levels, depending on the density of the groups that can be formed. However, each combination of QI-values is always generalized to the same value [LeDR06]. Local recoding – The values of an attribute can be generalized in different levels, depending on the density of the groups that can be formed. In fact, the same combination of QI-values can be generalized to different values [Xu+06] 16

k-anonymity (TKDE 01, IJUFKS 02) 17 A relation Τ is k-anonymous when every tuple of the relation is identical to k-1 other tuples with respect to their Quasi-Identifier set of attributes.

Naïve l-diversity 18 A relation T satisfies the naïve l-diversity property whenever every group of the relation contains at least l different values in its sensitive attributes.

Information utility Must prevent the attackers, by satisfying the privacy criterion (k for k-anonymity, l for l- diversity) – Fundamental anonymization technique: hide individual in groups of identical QI values!! Must serve the well-meaning users, by maximizing information utility i.e., by minimizing The tuples we remove (see next) the amount of generalization that we apply to the QI attributes. 19

Generalization vs suppression 20 This anonymization suppressed no tuples, and guarantees 3- anonymity. What if we want 4-anonymity?

Generalization vs suppression 21 Low height, 6 tuples suppressed Higher height, no tuples suppressed //the difference is in the work_class field

Incognito (SIGMOD 2005) Two fundamental ideas can be exploited with hierarchies: If a data set generalized at a certain level (e.g., 1345*) is k-anonymous, then it is also k-anonymous if it is even more generalized (e.g., 134**) If a data set of N attributes is not k-anonymous if – n attributes are not fully anonymized (age) and N-n are fully anonymized (sex, zip) then, the same data set is still not k- anonymous with – n+1 attributes are not fully anonymized (age,sex) and N-n-1 are fully anonymized (zip) 22

Incognito 23 Birth date, zip code, sex Combinations of 2 attributes

Incognito 24 Birth date, zip code, sex Combinations of 3 attributes, after non- anonymous gener. are pruned

25 What disease Bob is suffering from? Since Alice is Bob’s neighbor, she knows that Bob is a 31-year-old American male who lives in the zip code Therefore, Alice knows that Bob’s record number is 9,10,11, or 12. Now, all of those patients have the same medical condition (cancer), and so Alice concludes that Bob has cancer. Umeko is a 21 year old Japanese female who currently lives in zip code Therefore, Umeko’s information is contained in record number 1,2,3, or 4. +BCGR Knowledge: it is well-known that Japanese have an extremely low incidence of heart disease. Therefore, Alice concludes with near certainty that Umeko has a viral infection.

L-diversity (ICDE 2006) Every q-block group, has – At least k tuples – At least l well-represented values – Well-represented? Ούτε όλες οι τιμές σε ένα group είναι ίδιες (έχω τουλάχιστον l, l>=2) Ούτε κάποια τιμή είναι απίθανο να υπάρχει => μπορώ να συνάγω ότι ισχύει κάποια άλλη αν το l είναι σχετικά μικρό 26

Well-represented Distinct l-diversity: simply l different values Entropy l-diversity: for each pair (public value q*, sensitive value s) measure the value p(q*,s)logp(q*,s) Entropy of a q-block with value q* is -Σ s p()logp() over all sensitive values s You need to have E > log(p) for all groups (and this can be guaranteed if it holds for the whole table, too) Recursive l-diversity: the most frequent values do not appear too frequently and the less frequent do not appear too rarely 27

Roadmap Motivation Core ideas Extensions 28

Mondrian (ICDE 2006) 29 Why must we generalize fully every attributes? Some records are in regions with many records and anonymity is easily preserved even by giving out more information. Some others are in sparse areas and need to be generalized more… age zip

Mondrian (ICDE 2006) 30 Global recoding local recoding Original data

M-invariance (SIGMOD ‘07) 31 If I know that Bob is in group 1 + he has been taken to the hospital twice, I can deduce bronc. from: {dysp., bronch.}  {dysp.,gastr.}

M-invariance 32

Many other extensions Concerning multi-relational privacy Data perturbations More sophisticated “local recoding” a-la Mondrian Trajectory, set-valued, OLAP, … data … 33