Privacy-preserving data publishing

Slides:



Advertisements
Similar presentations
Cipher Techniques to Protect Anonymized Mobility Traces from Privacy Attacks Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip and Nageswara S. V. Rao.
Advertisements

CS4432: Database Systems II
M-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets by Tyrone Cadenhead.
Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong.
Chapter 4 Probability and Probability Distributions
1 Privacy in Microdata Release Prof. Ravi Sandhu Executive Director and Endowed Chair March 22, © Ravi Sandhu.
PRIVACY AND SECURITY ISSUES IN DATA MINING P.h.D. Candidate: Anna Monreale Supervisors Prof. Dino Pedreschi Dott.ssa Fosca Giannotti University of Pisa.
1 Privacy Preserving Data Publishing Prof. Ravi Sandhu Executive Director and Endowed Chair March 29, © Ravi.
Anonymization Algorithms - Microaggregation and Clustering Li Xiong CS573 Data Privacy and Anonymity.
Statistical database security Special purpose: used only for statistical computations. General purpose: used with normal queries (and updates) as well.
Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.
Probabilistic Inference Protection on Anonymized Data
Finding Personally Identifying Information Mark Shaneck CSCI 5707 May 6, 2004.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.
PRIVACY CRITERIA. Roadmap Privacy in Data mining Mobile privacy (k-e) – anonymity (c-k) – safety Privacy skyline.
Preserving Privacy in Clickstreams Isabelle Stanton.
Multiplying, Dividing, and Simplifying Radicals
Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1.
Database Laboratory Regular Seminar TaeHoon Kim.
Copyright © Cengage Learning. All rights reserved. Fundamentals.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Publishing Microdata with a Robust Privacy Guarantee
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Section 1.4 Rational Expressions
Prerequisites: Fundamental Concepts of Algebra
Thwarting Passive Privacy Attacks in Collaborative Filtering Rui Chen Min Xie Laks V.S. Lakshmanan HKBU, Hong Kong UBC, Canada UBC, Canada Introduction.
Protecting Sensitive Labels in Social Network Data Anonymization.
Background Knowledge Attack for Generalization based Privacy- Preserving Data Mining.
Refined privacy models
Secure Cloud Database with Sense of Security. Introduction Cloud computing – IT as a service from third party service provider Security in cloud environment.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
K-Anonymity & Algorithms
Privacy Preservation of Aggregates in Hidden Databases: Why and How? Arjun Dasgupta, Nan Zhang, Gautam Das, Surajit Chaudhuri Presented by PENG Yu.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Hybrid l-Diversity* Mehmet Ercan NergizMuhammed Zahit GökUfuk Özkanlı
HIDING EMERGING PATTERNS WITH LOCAL RECODING GENERALIZATION Presented by: Michael Cheng Supervisor: Dr. William Cheung Co-Supervisor: Dr. Byron Choi.
Preservation of Proximity Privacy in Publishing Numerical Sensitive Data J. Li, Y. Tao, and X. Xiao SIGMOD 08 Presented by Hongwei Tian.
Additive Data Perturbation: the Basic Problem and Techniques.
1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Chapter P Prerequisites: Fundamental Concepts of Algebra 1 Copyright © 2014, 2010, 2007 Pearson Education, Inc. 1 P.3 Radicals and Rational Exponents.
Thesis Sumathie Sundaresan Advisor: Dr. Huiping Guo.
Privacy Protection in Social Networks Instructor: Assoc. Prof. Dr. DANG Tran Khanh Present : Bui Tien Duc Lam Van Dai Nguyen Viet Dang.
CSCI 347, Data Mining Data Anonymization.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Privacy Preserving in Social Network Based System PRENTER: YI LIANG.
Unraveling an old cloak: k-anonymity for location privacy
Personalized Privacy Preservation: beyond k-anonymity and ℓ-diversity SIGMOD 2006 Presented By Hongwei Tian.
Data Mining And Privacy Protection Prepared by: Eng. Hiba Ramadan Supervised by: Dr. Rakan Razouk.
Chapter R Section 7: Radical Notation and Rational Exponents
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Xiaowei Ying, Kai Pan, Xintao Wu, Ling Guo Univ. of North Carolina at Charlotte SNA-KDD June 28, 2009, Paris, France Comparisons of Randomization and K-degree.
Deriving Private Information from Association Rule Mining Results Zutao Zhu, Guan Wang, and Wenliang Du ICDE /3/181.
Versatile Publishing For Privacy Preservation
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
The Real Numbers And Their Representations
The Real Numbers And Their Representations
Presented by : SaiVenkatanikhil Nimmagadda
SHUFFLING-SLICING IN DATA MINING
Refined privacy models
Differential Privacy (1)
Presentation transcript:

Privacy-preserving data publishing Paper presenter: Erik Wang Discussion leader: XiaoXiao Ma

Overview Research background Paper go through Key Technical Anonymization Information Loss Metric Privacy Models Conclusions

Research background Objective Solution To handle data privacy issue when publishing data, sensitive data should not be disclosed. Solution Try to modify data so that to avoid adversary to analyses the published data by apply his background knowledge to get sensitive information Not a passive data leak but initiative

Overview Research background Paper go through Key Technical Anonymization Information Loss Metric Privacy Models Conclusions

Paper go through This paper is the first 2 chapter of the book Privacy-preserving data publishing an overview. Published in 2010 How does the data owner modify the data? How does the data owner guarantee that the modified data contain no sensitive information? How much does the data need to be modified so that no sensitive information remains? Chapter 1: Background of the research Chapter 2: Concepts Technique – Anonymization Metric – information loss metric Model – Privacy models

Identify the problem No law, and ethic

Concepts Sensitive (data, value, tuple, attribute…) Something will offend privacy – not happy to share others Qusai-identifier (a.k.a QI) Quasi-identifier attributes are those that can serve as an identifier for an individual.

Overview Research background Paper go through Key Technical Anonymization Information Loss Metric Privacy Models Conclusions

Anonymization Grouping-and-breaking Break exact linkage between QI value and sensitive value Perturbation Change / generate to fake value

Grouping-and-breaking Suppression Change value to ANY, denoted by * Generalization Change to another categorical value to denoting a broader concept of the original one Global and local recording – how deep we do generalization Bucketization (breaking) Divide data into partitions, hidden sensitive data with ID, and generate sensitive table which connect with main table by ID

Grouping-and-breaking Method Advantage Drawback Suppression Easy to use Perfect to hidden data OVERKILL Generalization Change value to more generalized one i.e. numeric to a range, categorical value to boarder concept Extra work to maintain taxonomy Lost original actual value Global recording Consistent represent anonymized table More information loss Local recording Table generated by local recoding is more similar to the original table, and thus the data analysis based on this table is more accurate. cannot give as consistent a representation of the anonymized table as global recoding Bucketization Allowing users to obtain the original specific values for data analysis. Need extra table requires some sophisticated analysis of the data generated by bucketization

Perturbation Adding Noise Swapping Mode-fitting-and-regenerating applicable to numeric attributes. If the original numeric value is v, adding noise will change the value to v +∆ by adding a value ∆ that follows some distribution. Swapping Swapping the two values (of the same attribute) of any two tuples in the dataset. Mode-fitting-and-regenerating Modeling – parameter estimation – data regeneration i.e. condensation: Clustering data, find center, radius and size, and then regenerate new data according to the cluster

Perturbation Method Advantage Drawback Add noise it maintains some statistical information such as means and correlations may introduce some values that do not exist in the real world. Swapping the domain of each single attribute after value swapping remains unchanged. combination of the swapped value of this attribute and the values of other attributes may not exist in one of these two tuples regeneration the statistics of the data captured by the model are maintained. it may generate some tuples that may not exist in real data.

Overview Research background Paper go through Key Technical Anonymization Information Loss Metric Privacy Models Conclusions

Information loss metric The cost of anonymization is given by the distortion ratio of the resulting data set. Value of the attribute of a tuple been generalized, there will be distortion. Let di,j be the distortion of the value of attribute Ai of tuple tj The distortion of the whole data set distortion dis = ∑i,j di,j Distortion ratio is disdataset / disfull_generatlized

Information loss metric Distortion = 4 +3 =7 Distortion ration = 7 / 18 = 38.89% Distortion = 3 * 6 = 18 alter

Overview Research background Paper go through Key Technical Anonymization Information Loss Metric Privacy Models Conclusions

Privacy models: k-anonymity The size of the QI-group is at least k. A table T is said to satisfy k-anonymity (or a table is said to be k-anonymous) if each QI-group satisfies k-anonymity. The objective of k-anonymity is to make sure that each individual is indistinguishable from at least k − 1 other individuals in the table. QI group female – NA – 36-40 – HIV is not protected, if adversary knows the QI group, he can deduct all people in it has aids Sensitive attributes are not protected! 2 - anonymity

Privacy models: l-diversity The probability that any tuple in this group is linked to a sensitive value is at most 1/l. The table satisfies l-diversity if each QI-group satisfies l- diversity. 1 – diversity is meaningless P = ½ = 1/ l  l = 2 P = 2/4 = 1/ l  l = 2 2-diversity table

Privacy models: (a,k) anonymity Given a real number α ∈[0, 1] and a positive integer k. QI-group G is said to satisfy (α, k)-anonymity if the number of tuples in G is at least k and the frequency (in fraction) of each sensitive value in G is at most α. If α = 1/l, it is a simplified l-diversity model (0.5, 2)-anonymity table

Privacy models: monotonicity Let R be a privacy model. R is said to satisfy the monotonicity property if, for any two QI-groups G1 and G2 satisfying R, the final QI-group that is a result of merging all tuples in G1 and all tuples in G2 satisfies R Model Monotonicity 2-diversity table √ (0.5, 2)-anonymity table

Privacy models: numeric sensitive attributes Straight-forward: Transformed numeric attribute is to a categorical one, then be anonymized leads information loss. (k,e) – anonymity model each QI-group is of size at least k and has a range of the sensitive values at least e.

Privacy models: (Є, m) – anonymity Є is a non-negative real number and m is a positive integer. Each QI-group G satisfy for each sensitive numeric value that appears in G, the frequency (in fraction) of the tuples with sensitive numeric values close to s is at most 1/m where the closeness among numeric sensitive values is captured by parameter Є. Absolute difference, a numeric value s1 is close to s2 if |s1 − s2| ≤ Є Relative difference , a numeric value s1 is close to s2 if |s1 − s2|≤ Є s2 Does not obey the monotonicity property ¾ > ½

Privacy models: personalized privacy Each individual can provide his/her preference on the protection of his/her sensitive value, denoted by a guarding node. Any QI-group in the published table that may contain the individual should contain at most 1/l tuples with guarding values A variation of l-diversity so that it satisfies monotonicity property Taxonomy

Privacy models: Multiple QI attributes Extend the possibility of QI attribute from one external source to multiple source The model satisfies monotonicity property

Privacy models: Free-form anonymity Proposed based on whether a value is easily observable. If a value is easily observed, it is assumed that it is non-sensitive and is regarded as a quasi-identifier. Otherwise, it is regarded as a sensitive value. Add a condition to the definition of “Sensitive”

Publishing additional tables publish some additional tables that are not sensitive at all so that these tables together can provide better utility You know there are additional male but can’t determine which 2 are they

Conclusion Fundamental concepts that underlie all approaches to privacy preserving data publishing. How to modify the data (suppression, generalization, bucketization and perturbation) Minimizing the information loss of the modified data by using privacy model.

Questions and Discussion