Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.

Slides:



Advertisements
Similar presentations
A Privacy Preserving Index for Range Queries
Advertisements

Information Theory For Data Management
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
CS4432: Database Systems II
Wang, Lakshmanan Probabilistic Privacy Analysis of Published Views, IDAR'07 Probabilistic Privacy Analysis of Published Views Hui (Wendy) Wang Laks V.S.
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong.
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong.
Fast Data Anonymization with Low Information Loss 1 National University of Singapore 2 Hong Kong University
Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong.
1 Privacy Preserving Data Publishing Prof. Ravi Sandhu Executive Director and Endowed Chair March 29, © Ravi.
Xiaowei Ying Xintao Wu Univ. of North Carolina at Charlotte 2009 SIAM Conference on Data Mining, May 1, Sparks, Nevada Graph Generation with Prescribed.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical.
Probabilistic Inference Protection on Anonymized Data
K Beyond k-Anonimity: A Decision Theoretic Framework for Assessing Privacy Risk M.Scannapieco, G.Lebanon, M.R.Fouad and E.Bertino.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
1 On the Anonymization of Sparse High-Dimensional Data 1 National University of Singapore 2 Chinese University of Hong.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
L-Diversity: Privacy Beyond K-Anonymity
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Introduction and Basic Concepts
The Union-Split Algorithm and Cluster-Based Anonymization of Social Networks Brian Thompson Danfeng Yao Rutgers University Dept. of Computer Science Piscataway,
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Database Laboratory Regular Seminar TaeHoon Kim.
Differentially Private Transit Data Publication: A Case Study on the Montreal Transportation System Rui Chen, Concordia University Benjamin C. M. Fung,
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Quantifying Location Privacy Reza Shokri George Theodorakopoulos Jean-Yves Le Boudec Jean-Pierre Hubaux May 2011.
Preserving Link Privacy in Social Network Based Systems Prateek Mittal University of California, Berkeley Charalampos Papamanthou.
Beyond k-Anonymity Arik Friedman November 2008 Seminar in Databases (236826)
Publishing Microdata with a Robust Privacy Guarantee
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.
Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.
Protecting Sensitive Labels in Social Network Data Anonymization.
Refined privacy models
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
Probability. Statistical inference is based on a Mathematics branch called probability theory. If a procedure can result in n equally likely outcomes,
K-Anonymity & Algorithms
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Data Anonymization – Introduction and k-anonymity Li Xiong CS573 Data Privacy and Security.
Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Privacy-preserving data publishing
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Anonymizing Data with Quasi-Sensitive Attribute Values Pu Shi 1, Li Xiong 1, Benjamin C. M. Fung 2 1 Departmen of Mathematics and Computer Science, Emory.
No Free Lunch in Data Privacy CompSci Instructor: Ashwin Machanavajjhala 1Lecture 15: Fall 12.
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
Versatile Publishing For Privacy Preservation
University of Texas at El Paso
Chapter 7. Classification and Prediction
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Xiaokui Xiao and Yufei Tao Chinese University of Hong Kong
Personalized Privacy Protection in Social Networks
Personalized Privacy Protection in Social Networks
Presented by : SaiVenkatanikhil Nimmagadda
SHUFFLING-SLICING IN DATA MINING
Refined privacy models
Habib Ullah qamar Mscs(se)
Presentation transcript:

Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining

Discussion Outline (sigmod08-4) Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification (kdd08-4) Composition Attacks and Auxiliary Information in Data Privacy (vldb07-4) Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge

Anonymization techniques Generalization & suppression  Consistency property: multiple occurrences of the same value are always generalized the same way. (all old methods and recent Incognito)  No consistency property (Mondrain) Anatomy (Tao vldb06) Permutation (Koudas ICDE07)

Anonymization through Anatomy Anatomy: simple and effective privacy preservation

Anonymization through permutation

Background knowledge K-anonymity  Attacker has access to public databases, i.e., quasi- identifier values of the individuals.  The target individual is in the released database. L-diversity  Homogeneity attack  background knowledge about some individuals’ sensitive attribute values T-closeness  The distribution of sensitive attribute in the overall table

Type of background knowledge Known facts  A male patient cannot have ovarian cancer Demographical information  It is unlikely that a young patient of certain ethnic groups has heart disease  Some combination of the quasi-identifier values cannot entail some sensitive attribute values

Type of background knowledge Adversary-specific knowledge  target individual has no specific sensitive attribute value, e.g., Bob does not have flu  Sensitive attribute values of some other individuals, Joe, John, and Mike (as Bob’s neighbor) have flu  Knowledge about same-value family

Some extension Multiple sensitive values per individual  Flu \in Bob[S]  Basic implication (adopted in Martin ICDE07) cannot practically express the above --- |s|-1 basic implications are needed Probabilistic knowledge vs. deterministic knowledge

Data Sets IdentifierQuasi-Identifier (QI)Sensitive Attribute (SA) how much adversaries can know about an individual’s sensitive attributes if they know the individual’s quasi-identifiers

we need to measure P(SA|QI) Quasi-Identifier (QI)Sensitive Attribute (SA) Background Knowledge

Impact of Background Knowledge Background Knowledge: It ’ s rare for male to have breast cancer.

[Martin, et al. ICDE’07] first formal study of the effect of background knowledge on privacy-preserving

Assumption  the attacker has complete information about individuals ’ non-sensitive data Full identification information NameAgeSexZipcodeDisease Andy4M12000gastric ulcer Bill5M14000dyspepsia Ken6M18000pneumonia Nash9M19000bronchitis Alice12F22000flu Full identification information

Rule based knowledge Atom A i  a predicate about a person and his/her sensitive values  t Jack [Disease] = flu says that the Jack’s tuple has the value flu for the sensitive attribute Disease. Basic implication Background knowledge  formulated as conjunctions of k basic implications

The idea use k to bound the background knowledge, and compute the maximum disclosure of a bucket data set with respect to the background knowledge.

(vldb07-4) [Bee-Chung, et al. VLDB’07] use a triple (l, k,m) to specify the bound of the background rather than a single k

Introduction [Martin, et al. ICDE’07] limitation of using a single number k to bound background knowledge novel multidimensional approach quantifying an adversary’s external knowledge by a novel multidimensional approach

Problem formulation Pr(t has s | K, D*) data owner has a table of data (denoted by D) data owner publishes the resulting release candidate D* S:a sensitive attribute s:a target sensitive value t:a target individual new bound specifies that  adversaries know l other people’s sensitive value;  adversaries know k sensitive values that the target does not have  adversaries know a group of m−1 people who share the same sensitive value with the target

Theoretical framework

(sigmod08-4) [Wenliang, et al. SIGMOD’08]

Introduction The impact of background knowledge:  How does it affect privacy?  How to measure its impact on privacy? Integrate background knowledge in privacy quantification.  Privacy-MaxEnt: A systematic approach.  Based on well-established theories. maximum entropy estimate

Challenges Directly computing P( S | Q ) is hard. What do we want to compute?  P( S | Q ), given the background knowledge and the published data set.

Our Approach Background Knowledge Published Data Public Information Constraints on x Constraints on x Solve x Consider P( S | Q ) as variable x (a vector). Most unbiased solution

Maximum Entropy Principle “ Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information. ” — by E. T. Jaynes, 1957.

The MaxEnt Approach Background Knowledge Published Data Public Information Constraints on P( S | Q ) Constraints on P( S | Q ) Estimate P( S | Q ) Maximum Entropy Estimate

Entropy Because H(S | Q, B) = H(Q, S, B) – H(Q, B) Constraint should use P(Q, S, B) as variables

Maximum Entropy Estimate Let vector x = P(Q, S, B). Find the value for x that maximizes its entropy H(Q, S, B), while satisfying  h 1 (x) = c 1, …, h u (x) = c u : equality constraints  g 1 (x) ≤ d 1, …, g v (x) ≤ d v : inequality constraints A special case of Non-Linear Programming.

Putting Them Together Background Knowledge Published Data Public Information Constraints on P( S | Q ) Constraints on P( S | Q ) Estimate P( S | Q ) Maximum Entropy Estimate Tools: LBFGS, TOMLAB, KNITRO, etc.

Conclusion Privacy-MaxEnt is a systematic method  Model various types of knowledge  Model the information from the published data  Based on well-established theory.

(kdd08-2) [Srivatsava, et al. KDD’08]

Introduction reason about privacy in the face of rich, realistic sources of auxiliary information. investigate the effectiveness of current anonymization schemes in preserving privacy when multiple organizations independently release anonymized data present a composition attacks  an adversary uses independently anonymized releases to breach privacy

Summary What is background knowledge?  Probability-Based Knowledge P (s | q) = 1. P (s | q) = 0. P (s | q) = 0.2. P (s | Alice) = ≤ P (s | q) ≤ 0.5. P (s | q 1 ) + P (s | q 2 ) = 0.7  Logic-Based Knowledge (proposition/ first order/ modal logic) One of Alice and Bob has “Lung Cancer”.  Numerical data 50K ≤ salary of Alice ≤ 100K age of Bob ≤ age of Alice  Linked data degree of a node topology information ….  Domain Knowledge mechanism or algorithm of anonymization for data publication independently released anonymized data by other organizations  And many many others ….

Summary How to represent background knowledge?  Probability-Based Knowledge P (s | q) = 1. P (s | q) = 0. P (s | q) = 0.2. P (s | Alice) = ≤ P (s | q) ≤ 0.5. P (s | q 1 ) + P (s | q 2 ) = 0.7  Logic-Based Knowledge (proposition/ first order/ modal logic) One of Alice and Bob has “Lung Cancer”.  Numerical data 50K ≤ salary of Alice ≤ 100K age of Bob ≤ age of Alice  Linked data degree of a node topology information ….  Domain Knowledge mechanism or algorithm of anonymization for data publication independently released anonymized data by other organizations  And many many others …. [Martin, et al. ICDE’07]Rule-based [Wenliang, et al. SIGMOD’08] [Srivatsava, et al. KDD’08] [Raymond, et al. VLDB’07] general knowledge framework too hard to give a unified framework and give a general solution

Summary How to quantify background knowledge?  by the number of basic implications(association rules)  by a novel multidimensional approach  formulated as linear constraints How one can reason about privacy in the presence of external knowledge?  quantify the privacy  quantify the degree of randomization required  quantify the precise effect of background knowledge [Charu ICDE’07] [Martin, et al. ICDE’07] [Wenliang, et al. SIGMOD’08] [Bee-Chung, et al. VLDB’07] [Martin, et al. ICDE’07] [Wenliang, et al. SIGMOD’08]

Questions? Thanks to Zhiwei Li