Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Slides:



Advertisements
Similar presentations
I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.
Advertisements

Wavelet and Matrix Mechanism CompSci Instructor: Ashwin Machanavajjhala 1Lecture 11 : Fall 12.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
CS4432: Database Systems II
Computational Complexity & Differential Privacy Salil Vadhan Harvard University Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov,
1. Required Reading A firm foundation for private data analysis. Dwork, C. Communications of the ACM, 54(1), Privacy by the Numbers: A New.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Foundations of Privacy Lecture 4 Lecturer: Moni Naor.
STAT Section 5 Lecture 23 Professor Hao Wang University of South Carolina Spring 2012 TexPoint fonts used in EMF. Read the TexPoint manual before.
Chapter 10: Sampling and Sampling Distributions
The State of the Art Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A AA A AAA.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Kunal Talwar MSR SVC [Dwork, McSherry, Talwar, STOC 2007] TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A AA A.
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Chapter 7 Sampling Distributions
Probability. Probability Definitions and Relationships Sample space: All the possible outcomes that can occur. Simple event: one outcome in the sample.
Evaluating Hypotheses
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.
PSY 307 – Statistics for the Behavioral Sciences Chapter 8 – The Normal Curve, Sample vs Population, and Probability.
Formalizing the Concepts: Simple Random Sampling.
Determining the Size of
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
The Complexity of Differential Privacy Salil Vadhan Harvard University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Defining and Achieving Differential Privacy Cynthia Dwork, Microsoft TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Foundations of Privacy Lecture 6 Lecturer: Moni Naor.
Sampling Theory Determining the distribution of Sample statistics.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Essentials of Marketing Research
Estimation of Statistical Parameters
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
1 Estimation From Sample Data Chapter 08. Chapter 8 - Learning Objectives Explain the difference between a point and an interval estimate. Construct and.
Chapter 7: Sampling and Sampling Distributions
The Sparse Vector Technique CompSci Instructor: Ashwin Machanavajjhala 1Lecture 12 : Fall 12.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison November 14, 2015 TexPoint.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 7-1 Chapter 7 Sampling Distributions Basic Business Statistics.
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Part 3: Query Processing -- Data-Independent Methods 1 Marianne Winslett 1,3, Xiaokui Xiao 2, Yin Yang 3, Zhenjie Zhang 3, Gerome Miklau 4 1 University.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
POLS 7000X STATISTICS IN POLITICAL SCIENCE CLASS 5 BROOKLYN COLLEGE-CUNY SHANG E. HA Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for.
An Introduction to Differential Privacy and its Applications 1 Ali Bagherzandi Ph.D Candidate University of California at Irvine 1- Most slides in this.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Private Release of Graph Statistics using Ladder Functions J.ZHANG, G.CORMODE, M.PROCOPIUC, D.SRIVASTAVA, X.XIAO.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
Sergey Yekhanin Institute for Advanced Study Lower Bounds on Noise.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.
Private Data Management with Verification
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Current Developments in Differential Privacy
Random Variables Binomial Distributions
Lecture # 2 MATHEMATICAL STATISTICS
Published in: IEEE Transactions on Industrial Informatics
Some contents are borrowed from Adam Smith’s slides
Differential Privacy (1)
Differential Privacy.
Optimization under Uncertainty
Presentation transcript:

Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AA A A AAA

Full Papers  Privacy, Accuracy, and Consistency Too: A Holistic Solution to Contingency Table Release  Barak, Chaudhuri, Dwork, Kale, McSherry, and Talwar  ACM SIGMOD/PODS 2007  Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size  Dwork, McSherry, and Talwar  This Workshop

Release of Contingency Table Marginals  Simultaneously ensure:  Consistency  Accuracy  Differential Privacy

Release of Contingency Table Marginals  Simultaneously ensure:  Consistency  Accuracy  Differential Privacy  Terms To Define:  Contingency Table  Marginal  Consistency  Accuracy  Differential Privacy

Contingency Tables and Marginals  Contingency Table: Histogram / Table of Counts  Each respondent (member of data set) described by a vector of k (binary) attributes  Population in each of the 2 k cells  One cell for each setting of the k attributes A2A2 A1A1 A3A3

Contingency Tables and Marginals  Contingency Table: Histogram / Table of Counts  Each respondent (member of data set) described by a vector of k (binary) attributes  Population in each of the 2 k cells  One cell for each setting of the k attributes  Marginal: sub-table  Specified by a set of j ≤ k attributes, eg, j=1  Histogram of population in each of 2 j (eg, 2) cells  One cell for each setting of the j selected attributes  A 1 = 0: 3, A 1 = 1: 4, so the A 1 marginal is (3,4) A1A1

All the Notation for the Entire Talk  D: the real data set  n: number of respondents  k: number of attributes  T = T(D) : the contingency table representing D (2 k cells)  T*: contingency table of a fictional data set  M = M(T): a collection of marginals of T  M3: collection of all 3-way marginals  R=R(M)=R(M(T)): reported marginals  Typically noisy, to protect privacy: R(M(T)) ≠ M(T)  c = c(M): total number of cells in M   : name of a marginal (ie, a set of attributes)  ε: a privacy parameter

Consistency Across Reported Marginals There exists a fictional contingency table T* whose marginals equal the reported marginals  M(T*) = R(M(T))  T*, M(T*) may have negative and/or non-integral counts  Who cares about consistency?  Not we.  Software?

Consistency Across Reported Marginals There exists a fictional contingency table T* whose marginals equal the reported marginals  M(T*) = R(M(T))  T*, M(T*) may have negative and/or non-integral counts  Who cares about integrality, non-negativity?  Not we.  Software?  See the paper.

Accuracy of Reported Values  Roughly, described by E[||R(M(T)) – M(T)|| 1 ]  Expected error in each cell: proportional to c(M)/ε  A little worse  Probabilistic guarantees on size of max error  Similar to change obtained by randomly adding/deleting c(M)/ε respondents to T and then computing M  Key Point: Error is Independent of n (and k)  Depends on the “complexity” of M  Depends on the privacy parameter ε

ε-Differential Privacy 11 For all x, for all reported values r Pr[R(M) = r | x in D] 2 exp(± ε) Pr[R(M) = r | x not in D] Pr [r] ratio bounded r

ε-Differential Privacy When ε is small: for all x, for all reported r Pr[R(M) = r | x in D] 2 (1 ± ε) Pr[R(M) = r | x not in D]  Probabilities taken over coins flipped by curator  Independent of other sources of data, databases, or even knowledge of every element in D\{x}.  “Anything, good or bad, is essentially equally likely to occur, whether I join the database or not.”  Generalizes to groups of respondents  Although, if group is large, then outcomes should differ.

Why Differential Privacy?  Dalenius’ Goal: “Anything that can be learned about a respondent, given access to the statistical database, can be learned without access” is Provably Unachievable.  Sam the (American) smoker tries to buy medical insurance  Statistical DB teaches smoking causes cancer  Sam harmed: high premiums for medical insurance  Sam need not be in the database!  Differential Privacy guarantees that risk to Sam will not substantially increase if he enters the DB  DBs have intrinsic social value

14 An Ad Omnia Guarantee  No perceptible risk is incurred by joining data set  Anything adversary can do to Sam, it could do even if his data not in data set Bad r’s: XXX Pr [r]

Achieving Differential Privacy for d-ary f  Curator adds noise according to Laplace distribution  “Hides” the presence/absence of any individual  How much can the data of one person affect M(T)?  8  2 M, one person can affect one cell in  (T), by 1  f = max D, x ||f(D [ {x}) – f(D \ {x})|| 1 eg,  = 1  M ≤ |M|

16 Calibrate Noise to Sensitivity for d-ary f  f = max D,x ||f(D [ {x}) – f(D \ {x})|| 1 0 s2s3s4s5s-s-2s-3s-4s Theorem: To achieve  -differential privacy, use scaled symmetric noise ~ Lap(s) d with s =  f/  Ratio = e

17 Calibrate Noise to Sensitivity for d-ary f  f = max D,x ||f(D [ {x}) – f(D \ {x})|| 1 0 s2s3s4s5s-s-2s-3s-4s Theorem: To achieve  -differential privacy, use scaled symmetric noise ~ Lap(s) d with s =  f/  f =  : s = 1/ ε

18 Calibrate Noise to Sensitivity for d-ary f  f = max D, x ||f(D [ {x}) – f(D \ {x})|| 1 0 s2s3s4s5s-s-2s-3s-4s Theorem: To achieve  -differential privacy, use scaled symmetric noise ~ Lap(s) d with s =  f/  f = T: s = 1/ ε

19 Calibrate Noise to Sensitivity for d-ary f  f = max D, x ||f(D [ {x}) – f(D \ {x})|| 1 0 s2s3s4s5s-s-2s-3s-4s Theorem: To achieve  -differential privacy, use scaled symmetric noise ~ Lap(s) d with s =  f/  f = M: s ≤ |M|/ ε

20 Calibrate Noise to Sensitivity for d-ary f  f = max D, x ||f(D [ {x}) – f(D \ {x})|| 1 0 s2s3s4s5s-s-2s-3s-4s Theorem: To achieve  -differential privacy, use scaled symmetric noise ~ Lap(s) d with s =  f/  f = M3: s ≤ (k choose 3) / ε

Application: Release of Marginals M  Release noisy contingency table; compute marginals?  Consistency and differential privacy  Noise per cell of T: Lap(1/ ε)  Noise per cell of M: about 2 k/2 / ε for low order marginals  Release noisy versions of all marginals in M?  Noise per cell of M: Lap(|M|/ ε)  Differential privacy and better accuracy  Inconsistent

Move to the Fourier Domain  Just a change of basis. Why bother?  T represented by 2 k Fourier coefficients (it has 2 k cells)  To compute j-ary marginal ® (T) only need 2 j coefficients  For any M, expected noise/cell depends on number of coefficients needed to compute M(T)  Eg, for M3, E[noise/cell] ≈ (k choose 3)/ ε.  The Algorithm for R(M(T)):  Compute set of Fourier coefficients of T needed for M(T)  Add noise; gives Fourier coefficients for M(T*)  1-1 mapping between set of Fourier coeffs and tables ensures consistency!  Convert back to obtain M(T*)  Release R(M(T))=M(T*)

Improving Accuracy  Gaussian noise, instead of Laplacian  E[noise/cell] for M3 looks more like O((log (1/  ) k 3/2 / ε)  Probablistic (1-  ) guarantee of ε-differential privacy  Use Domain-Specific Knowledge  We have, so far, avoided this!  If most attributes are considered (socially) insensitive, can add less noise, and to fewer coefficients  Eg,  M3 with 1 sensitive attribute ≈ k 2 (instead of k 3 )

In Theory, Noise Must Depend on the Set M “Dinur-Nissim” style results: 1 sensitive attribute  Overly accurate answers to too many questions permits reconstruction of sensitive attributes of almost entire database, say, 99%.  Attacks use no linkage/external/auxiliary information  Rough Translation: there are “bad” databases, in which a sensitive binary attribute can be learned for all respondents, from, say, 2 √ n degree-2 marginals, if per- cell errors are strictly less than √ n  The “badness” relates to the distribution of the occurrence of the insensitive attributes!

Summary  Introduced ε- Differential Privacy  Rigorous and ad omnia notion of privacy  Showed how to achieve differential privacy  In general  In the special case of marginal release  Simple!  Special attention paid to ensuring consistency among released marginals  Per cell accuracy deteriorates with complexity of query and degree of privacy  Noted that accuracy must deteriorate with complexity of query