Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong.

Slides:



Advertisements
Similar presentations
CLOSENESS: A NEW PRIVACY MEASURE FOR DATA PUBLISHING
Advertisements

Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Minimality Attack in Privacy Preserving Data Publishing Raymond Chi-Wing Wong (the Chinese University of Hong Kong) Ada Wai-Chee Fu (the Chinese University.
Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
M-Invariance and Dynamic Datasets based on: Xiaokui Xiao, Yufei Tao m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets Slawomir.
Privacy-Preserving Data Publishing Donghui Zhang Northeastern University Acknowledgement: some slides come from Yufei Tao and Dimitris Sacharidis.
Anonymizing Healthcare Data: A Case Study on the Blood Transfusion Service Benjamin C.M. Fung Concordia University Montreal, QC, Canada
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Probabilistic Inference Protection on Anonymized Data
Privacy Preserving Serial Data Publishing By Role Composition Yingyi Bu 1, Ada Wai-Chee Fu 1, Raymond Chi-Wing Wong 2, Lei Chen 2, Jiuyong Li 3 The Chinese.
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
1 Global Privacy Guarantee in Serial Data Publishing Raymond Chi-Wing Wong 1, Ada Wai-Chee Fu 2, Jia Liu 2, Ke Wang 3, Yabo Xu 4 The Hong Kong University.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
PRIVÉ : Anonymous Location-Based Queries in Distributed Mobile Systems 1 National University of Singapore 2 University.
Suppose I learn that Garth has 3 friends. Then I know he must be one of {v 1,v 2,v 3 } in Figure 1 above. If I also learn the degrees of his neighbors,
Attacks against K-anonymity
L-Diversity: Privacy Beyond K-Anonymity
MobiHide: A Mobile Peer-to-Peer System for Anonymous Location-Based Queries Gabriel Ghinita, Panos Kalnis, Spiros Skiadopoulos National University of Singapore.
The Union-Split Algorithm and Cluster-Based Anonymization of Social Networks Brian Thompson Danfeng Yao Rutgers University Dept. of Computer Science Piscataway,
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Preserving Privacy in Clickstreams Isabelle Stanton.
Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis, Nikos Mamoulis University of Hong Kong Panos Kalnis National University of Singapore.
Privacy-preserving Anonymization of Set Value Data Manolis Terrovitis Institute for the Management of Information Systems (IMIS), RC Athena Nikos Mamoulis.
Database Laboratory Regular Seminar TaeHoon Kim.
Preserving Privacy in Published Data
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Preserving Link Privacy in Social Network Based Systems Prateek Mittal University of California, Berkeley Charalampos Papamanthou.
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Sumathie Sundaresan Advisor : Dr. Huiping Guo Survey of Privacy Protection for Medical Data.
Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.
Protecting Sensitive Labels in Social Network Data Anonymization.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Topic 21: Data Privacy1 Information Security CS 526 Topic 21: Data Privacy.
Refined privacy models
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı
Differential Privacy SIGMOD 2012 Tutorial Marianne Winslett University of Illinois at Urbana-Champaign Advanced Digital Sciences Center, Singapore Including.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
1/3/ A Framework for Privacy- Preserving Cluster Analysis IEEE ISI 2008 Benjamin C. M. Fung Concordia University Canada Lingyu.
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
CSCI 347, Data Mining Data Anonymization.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
Versatile Publishing For Privacy Preservation
Fast Data Anonymization with Low Information Loss
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Presented by : SaiVenkatanikhil Nimmagadda
TELE3119: Trusted Networks Week 4
SHUFFLING-SLICING IN DATA MINING
Refined privacy models
Presentation transcript:

Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Privacy preserving data publishing Microdata Purposes: –Allow researchers to effectively study the correlation between various attributes –Protect the privacy of every patient NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Bill5M14000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu

A naïve solution It does not work. See next. NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Bill5M14000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu AgeSexZipcodeDisease 4M12000gastric ulcer 5M14000dyspepsia 6M18000pneumonia 9M19000bronchitis 12F22000flu 19F24000pneumonia 21F33000gastritis 25F34000gastritis 28F37000flu 56F58000flu publish

Inference attack AgeSexZipcodeDisease 4M12000gastric ulcer 5M14000dyspepsia 6M18000pneumonia 9M19000bronchitis 12F22000flu 19F24000pneumonia 21F33000gastritis 25F34000gastritis 28F37000flu 56F58000flu Published table NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 An external database (a voter registration list) An adversary Quasi-identifier (QI) attributes

Generalization Transform each QI value into a less specific form NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A generalized tableAn external database AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 25]F[30001, 35000]gastritis [21, 25]F[30001, 35000]gastritis [26, 60]F[35001, 60000]flu [26, 60]F[35001, 60000]flu Information loss

k-anonymity The following table is 2-anonymous AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 25]F[30001, 35000]gastritis [21, 25]F[30001, 35000]gastritis [26, 60]F[35001, 60000]flu [26, 60]F[35001, 60000]flu 5 QI groups Quasi-identifier (QI) attributesSensitive attribute

Drawback of k-anonymity What is the disease of Linda? NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-anonymous tableAn external database AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 25]F[30001, 35000]gastritis [21, 25]F[30001, 35000]gastritis [26, 60]F[35001, 60000]flu [26, 60]F[35001, 60000]flu

A better criterion: l-diversity Each QI-group –has at least l different sensitive values –even the most frequent sensitive value does not have a lot of tuples NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Alice12F22000 Mike7M17000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-diverse tableAn external database AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

Motivation 1: Personalization Andy does not want anyone to know that he had a stomach problem Sarah does not mind at all if others find out that she had flu NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 A 2-diverse tableAn external database AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

Motivation 2: Non-primary case Microdata NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Andy4M12000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Sarah28F37000flu Mary56F58000flu

Motivation 2: Non-primary case (cont.) NameAgeSexZipcode Andy4M12000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F diverse tableAn external database AgeSexZipcodeDisease 4M12000gastric ulcer 4M12000dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu

Motivation 3: SA generalization How many female patients are there with age above 30? 4 ∙ (60 – ) / (60 – ) = 3 Real answer: 1 A generalized table AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]gastritis [21, 60]F[30001, 60000]flu [21, 60]F[30001, 60000]flu NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 An external database

Motivation 3: SA generalization (cont.) Generalization of the sensitive attribute is beneficial in this case A better generalized table AgeSexZipcodeDisease [1, 5]M[10001, 15000]gastric ulcer [1, 5]M[10001, 15000]dyspepsia [6, 10]M[15001, 20000]pneumonia [6, 10]M[15001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]gastritis [21, 30]F[30001, 40000]flu 56F58000 respiratory infection NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 An external database

Personalized anonymity We propose –a mechanism to capture personalized privacy requirements –criteria for measuring the degree of security provided by a generalized table –an algorithm for generating publishable tables

Guarding node Andy does not want anyone to know that he had a stomach problem He can specify “stomach disease” as the guarding node for his tuple The data publisher should prevent an adversary from associating Andy with “stomach disease” NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease

Guarding node Sarah is willing to disclose her exact symptom She can specify Ø as the guarding node for her tuple NameAgeSexZipcodeDiseaseguarding node Sarah28F37000flu Ø

Guarding node Bill does not have any special preference He can specify the guarding node for his tuple as the same with his sensitive value NameAgeSexZipcodeDiseaseguarding node Bill5M14000dyspepsia

A personalized approach NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

Personalized anonymity A table satisfies personalized anonymity with a parameter p breach –Iff no adversary can breach the privacy requirement of any tuple with a probability above p breach If p breach = 0.3, then any adversary should have no more than 30% probability to find out that: –Andy had a stomach disease –Bill had dyspepsia –etc NameAgeSexZipcodeDiseaseguarding node Andy4M12000gastric ulcerstomach disease Bill5M14000dyspepsia Ken6M18000pneumoniarespiratory infection Nash9M19000bronchitis Alice12F22000flu Betty19F24000pneumonia Linda21F33000gastritis Jane25F34000gastritis Ø Sarah28F37000flu Ø Mary56F58000flu

Personalized anonymity Personalized anonymity with respect to a predefined parameter p breach –an adversary can breach the privacy requirement of any tuple with a probability at most p breach AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection We need a method for calculating the breach probabilities What is the probability that Andy had some stomach problem?

Combinatorial reconstruction Assumptions –the adversary has no prior knowledge about each individual –every individual involved in the microdata also appears in the external database

Combinatorial reconstruction Andy does not want anyone to know that he had some stomach problem What is the probability that the adversary can find out that “Andy had a stomach disease”? NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection

Combinatorial reconstruction (cont.) Can each individual appear more than once? –No = the primary case –Yes = the non-primary case Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the non-primary case

Combinatorial reconstruction (cont.) Can each individual appear more than once? –No = the primary case –Yes = the non-primary case Some possible reconstructions: Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis the non-primary case

Breach probability (primary) Totally 120 possible reconstructions If Andy is associated with a stomach disease in n b reconstructions The probability that the adversary should associate Andy with some stomach problem is n b / 120 Andy is associated with –gastric ulcer in 24 reconstructions –dyspepsia in 24 reconstructions –gastritis in 0 reconstructions n b = 48 The breach probability for Andy’s tuple is 48 / 120 = 2 / 5 Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

Breach probability (non-primary) Totally 625 possible reconstructions Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions n b = 225 The breach probability for Andy’s tuple is 225 / 625 = 9 / 25 Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis

Breach probability: Formal results NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection

Breach probability: Formal results NameAgeSexZipcode Andy4M12000 Bill5M14000 Ken6M18000 Nash9M19000 Mike7M17000 Alice12F22000 Betty19F24000 Linda21F33000 Jane25F34000 Sarah28F37000 Mary56F58000 AgeSexZipcodeDisease [1, 10]M[10001, 20000]gastric ulcer [1, 10]M[10001, 20000]dyspepsia [1, 10]M[10001, 20000]pneumonia [1, 10]M[10001, 20000]bronchitis [11, 20]F[20001, 25000]flu [11, 20]F[20001, 25000]pneumonia 21F33000stomach disease 25F34000gastritis 28F37000flu 56F58000respiratory infection

More in our paper An algorithm for computing generalized tables that –satisfies personalized anonymity with predefined p breach –reduces information loss by employing generalization on both the QI attributes and the sensitive attribute

Experiment settings 1 Goal: To show that k-anonymity and l-diversity do not always provide sufficient privacy protection Real dataset Pri-leaf Nonpri-leaf Pri-mixed Nonpri-mixed Cardinality = 100k AgeEducationGenderMarital-statusOccupationIncome

Degree of privacy protection (Pri-leaf) p breach = 0.25 (k = 4, l = 4)

Degree of privacy protection (Nonpri-leaf) p breach = 0.25 (k = 4, l = 4)

Degree of privacy protection (Pri-mixed) p breach = 0.25 (k = 4, l = 4)

Degree of privacy protection (Nonpri-mixed) p breach = 0.25 (k = 4, l = 4)

Experiment settings 2 Goal: To show that applying generalization on both the QI attributes and the sensitive attribute will lead to more effective data analysis

Accuracy of analysis (no personalization)

Accuracy of analysis (with personalization)

Conclusions k-anonymity and l-diversity are not sufficient for the Non-primary case Guarding nodes allow individuals to describe their privacy requirements better Generalization on the sensitive attribute is beneficial

Thank you! Datasets and implementation are available for download at