Calibrating Noise to Sensitivity in Private Data Analysis

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

When Random Sampling Preserves Privacy Kamalika Chaudhuri U.C.Berkeley Nina Mishra U.Virginia.
Conceptual Clustering
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Derandomized parallel repetition theorems for free games Ronen Shaltiel, University of Haifa.
Privacy Enhancing Technologies
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Foundations of Privacy Lecture 3 Lecturer: Moni Naor.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Machine Learning CMPT 726 Simon Fraser University
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Differential Privacy (2). Outline  Using differential privacy Database queries Data mining  Non interactive case  New developments.
Thanks to Nir Friedman, HU
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Foundations of Cryptography Lecture 9 Lecturer: Moni Naor.
Radial Basis Function Networks
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance 
Foundations of Privacy Lecture 6 Lecturer: Moni Naor.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Data mining and machine learning A brief introduction.
CS573 Data Privacy and Security Statistical Databases
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Slide 1 Differential Privacy Xintao Wu slides (P2-20) from Vitaly Shmatikove, then from Adam Smith.
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
Private Approximation of Search Problems Amos Beimel Paz Carmi Kobbi Nissim Enav Weinreb (Technion)
Secure Computation (Lecture 5) Arpita Patra. Recap >> Scope of MPC > models of computation > network models > modelling distrust (centralized/decentralized.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Additive Data Perturbation: the Basic Problem and Techniques.
Propagation of Error Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) 1. Directly measure the variable. - referred.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
Statistics Presentation Ch En 475 Unit Operations.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Private Release of Graph Statistics using Ladder Functions J.ZHANG, G.CORMODE, M.PROCOPIUC, D.SRIVASTAVA, X.XIAO.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,
Sergey Yekhanin Institute for Advanced Study Lower Bounds on Noise.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Oliver Schulte Machine Learning 726
Private Data Management with Verification
Information Complexity Lower Bounds
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Privacy-preserving Release of Statistics: Differential Privacy
Differential Privacy in Practice
Statistics Review ChE 477 Winter 2018 Dr. Harding.
Summarizing Data by Statistics
Published in: IEEE Transactions on Industrial Informatics
Clustering.
CS639: Data Management for Data Science
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Some contents are borrowed from Adam Smith’s slides
Differential Privacy (1)
Presentation transcript:

Calibrating Noise to Sensitivity in Private Data Analysis Kobbi Nissim BGU With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb

The Setting San x1 x2 x3  xn-1 xn Users x  Dn query X = answer Can I combine these to learn some private info? I just want to learn a few harmless global statistics xn xn-1  x3 x2 x1 Users (government, researchers, marketers, …) query San X = answer x  Dn (n rows each of domain D)

What is privacy? Clearly we cannot undo the harm done by others Can we minimize the additional harm while providing utility? Goal: Whether or not I contribute my data does not affect my privacy

Output Perturbation San x1 x2 x3  xn-1 xn f f(x) + noise ¢ random coins ¢ f(x) + noise San Controls: which functions f kind of perturbation

When Can I Release f(x) accurately? Intuition: global information is “insensitive” to individual data and is safe f(x1,…,xn) is sensitive if changing a few entries can drastically change its value

Talk Outline A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to privacy definitions Examples of sensitivity based analysis New ideas Basic models for privacy Local vs. global Noninteractive vs. Interactive

Related Work Relevant work in Statistics, Data mining, Computer Security, Databases Largely: no precise definitions and analysis of privacy Recently: A foundational approach [DN03,EGS03,DN04,BDMN05, KMN05 CDMSW05,CDMT05,MS06,CM06,…] This work extends [DN03,DN04,BDMN05]

Privacy as Indistinguishability xn xn-1  x3 x2 x1 x= query 1 answer 1 query T answer T  transcript T(x) San random coins ¢ Differ in 1 row Distributions at “distance” <  x’= xn xn-1  x3 x2’ x1 query 1 answer 1 query T answer T  transcript T(x') San random coins ¢

-Indistinguishability A sanitizer is -indistinguishable if for all pairs x,x’  Dn which differ on at most one entry for all adversaries A for all transcripts t Pr[TA(x) = t]  e  Pr[TA(x’) = t]

Semantically Flavored Definitions Indistinguishability - easy to work with but does not directly say what the adversary can do an learn “Ideal” semantic definition: Adversary does not change his beliefs about me Problem: dependencies, e.g. in form of side information Say you know that I am 20 pounds heavier than average Israeli… You will learn my weight from census results Whether or not I participate Ways to get around: Assume “independence” of X1,…,Xn [DN03,DN04,BDMN05] Compare “what A knows now” vs “what A would have learned anyway” [DM]

Incremental Risk Suppose adversary has prior “beliefs” about x Probability distribution, r.v. X= (X1,…,Xn) Given transcript t, adversary updates “beliefs” according to Bayes’ rule New distribution Xi’| T(X)=t

It’s the same whether you participate or not Incremental Risk Bugger! It’s the same whether you participate or not “Proof:” indistinguishability guarantees that updates are the same within 1±  Two options: I participate in census (input = X) I do not participate (input Yi = X1,…,Xi-1,*,Xi+1,…,Xn ) Privacy: whether I participate or not does not significantly influence adversary’s posterior beliefs: For all transcripts t, for all i: X’i |T(X)=t ¼ X’i |T(Yi)=t San Yi San X

Recall – -Indistinguishability For all pairs x,x’  Dn s.t. dist(x,x’) = 1 For all transcripts t Pr[TA(x) = t]  e  Pr[TA(x’) = t]

An Example – Sum Queries Pls let me know fA(x)=iA xi xn xn-1  x3 x2 x1 San random coins ¢ fA(x) + noise x [0,1]n

Sum Queries – Answering a Query x 2 [0,1]n fA(x)=iA xi Can be used as a basis for other tasks: clustering, learning, classification… [BDMN05] Answer: xi + Y where Y » Lap(1/) Laplace Distribution: h(y)  e-|y| Note: |fA(x)-fA(x’)|  1

Sum Queries – Proof of -Indistinguishability Property of Lap  x,y: h(x)/h(y)  e|x-y| Pr[T(x)=t]  e|fA(x)-t| Pr[T(x’)=t]  e|fA(x’)-t| Pr[T(x)=t] / Pr[T(x’)=t]  e | fA(x)- fA(x’)|  e f(x’) f(x) max |fA(x)-fA(x’)| = 1

Sensitivity We chose noise magnitude to cover for max |f(x)-f(x’)| Sf = max ||f(x)-f(x’)||1 Local Sensitivity LSf(x) = max ||f(x)-f(x’)||1 f xn xn-1  x3 x2 x1 San x= ¢ x’= x2’ f(x) + noise f(x’) + noise dist(x,x’)=1 dist(x,x’)=1

Calibrating Noise to Sensitivity Pls let me know f (x) xn xn-1  x3 x2 x1 San random coins ¢ f (x) + Lap(Sf /) x Dn h(y)  e-/Sf ||y||1

Calibrating Noise to Sensitivity - Why it Works? Sf = max |f(x)-f(x’)|1 Property of Lap:  x,y: h(x)/h(y)  e||x-y||1 Pr[T(x)=t] / Pr[T(x’)=t]  e / Sf ||fA(x)- fA(x’)||1  e h(y)  e-/Sf ||y||1 dist(x,x’)=1

Main Result Theorem: If a user U is limited to T adaptive queries of sensitivity Sf then -indistinguishability if iid noise Lap(SfT/) added to query answers Same idea works with other metrics and noise Which useful functions are insensitive? All useful functions should be insensitive… Statistical conclusions should not depend on small variations in data

Using insensitive functions Strategies: Use theorem, output f(x) + Lap(Sf /) Sf may be hard to analyze/compute Sf high for functions considered ‘insensitive’ Express f in terms of insensitive functions Resulting noise depends on input (in form and magnitude)

Example - Expressing f in terms of insensitive functions x  {0,1}n f(x) = ( xi)2 Sf = n2 - (n-1)2 = 2n-1 af = ( xi)2 + Lap(2n/) If f(x) << n noise dominates However f(x) = (g(x))2 where g(x) =  xi Sg=1 Better to query for g Get ag = xi + Lap(1/) Estimate f(x) as (ag)2 Taking  constant results in stddev O( xi) – (1/ )2

Useful Insensitive functions Means, variances,… With appropriate assumptions on data Histograms & contingency tables Singular value decomposition Distance to a property Functions with low query complexity

Histograms/Contingency Tables x1,…,xn 2 D where D partitioned into d disjoint bins b1,…,bd h(x) = (v1,…,vd) where vi=|{i : xi  bi}| Sh = 2 Changing one value xi changes vector by · 2 Irrespective of d Add Laplacian with std. dev. 2/ to each count Can do that with sum queries … b1 b2 … b4

Distance to a Property x P Say P = set of “good” databases Distance to P = min # points in x that must be changed to make x in P Always has sensitivity 1 Add Laplacian with stdev 1/ Examples: Distance to being clusterable Weight of minimum cut in graph P x distance to P

Approximations with Low Query Complexity Lemma: Assume algorithm A that randomly samples n points and Pr[ A(x)  f(x) ± ] > (1+)/2 Then Sf · 2 Proof: Consider x,x’ that differ on point i Let Ai be A conditioned on not choosing point i Pr[Ai(x)  f(x)±  | pt i not sampled] > 1/2 Pr[Ai(x’)  f(x’)± | pt i not sampled] > 1/2  point p that is within dist  from both f(x), f(x’)  Sf · 2 Support of Ai(x)=Ai(x) p

Local Sensitivity Median – typically insensitive, large (global) sensitivity LSf(x) = max ||f(x)-f(x’)||1 Example: f(x) = min(xi, 10) where xi{0,1} LSf(x) = 1 if xi  10 and 0 otherwise dist(x,x’)=1 10 n  xi

Local Sensitivity – First Attempt Calibrate noise to LSf(x) Answer query f by f(x) + Lap(LSf(x)/) If x1…x10=1 and x11…xn=0 Answer = 10 + Lap(1/) If x1…x11=1 and x12…xn=0 Answer = 10 Noise magnitude may be disclosive! 10 n  xi

How to Calibrate Noise to Local Sensitivity? Noise magnitude at a point x depends on LS(y) for all y  Dn N*f = max (LSf(y) e- dist(x,y)) Median 10 n  xi

Talk Outline A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to privacy definitions Examples of sensitivity based analysis New ideas Basic models for privacy Local vs. global Noninteractive vs. Interactive

Models for Data Privacy Collection and sanitization You Bob Alice Users (government, researchers, marketers, …)

Models for Data Privacy – Local vs. Global Collection and sanitization You Bob Alice San San Collection and sanitization You Bob Alice Including “SFE”

Models for Data Privacy – Interactive vs. Noninteractive You Bob Alice Collection and sanitization You Bob Alice Collection and sanitization

Models for Data Privacy - Summary Local (vs. Global) Non central trusted party Individuals interact directly with (untrusted) user Individuals control their own privacy Noninteractive (vs. Interactive) Easier distribution: web site, book, CD, … More secure: can erase the data once it is processed Almost all work in statistics, data mining is noninteractive! OLD NOTES! This talk is about database privacy. The term can mean many things but for this talk, the example to keep in mind is a government census. Individuals provide information to a trusted government agency, which processes the information and makes some sanitized version of it available for public use. - privacy is required by law - ethical - pragmatic: people won’t answer unless they trust you There are two goals: we want users to be able to extract global statistics about the population being studied. However, for legal, ethical and pragmatic reasons, we also want to protect the privacy of the individuals who participate. And so we have a fundamental tradeoff between privacy on one hadn and utility on the other. The extremes are easy: publishing nothing at all provides complet eprivacy, but no utility, and publishing the raw data exactly provides the most utility but no privacy. Thus the first-order goal of this paper is to plot some middle course between the extremes; that is, to find a compromise which allows users to obtain useful information while also providing a meaningful guarantee of privacy. This problem is not new: it is often called the "statistical database" problem. I would say a second-order goal of this paper is to change the way the problem is approached and treated in the literature… Graphically, this is what is going on. As I said, there are two goals, utility and privacy. Utility is easy to understand, and to explain to a user. To prove that your scheme provides a particular utility, just give an algoriithm and an analysis. Privacy is much harder to get a handle on…

? Four Basic Models Global, interactive incomparable Local, noninteractive Global, interactive incomparable ?

Interactive vs. Noninteractive Local, noninteractive Global, interactive

Separating Interactive from Noninteractive Random samples: can compute estimates for many stats (essentially) no need to decide upon queries ahead of time But not private (unless small domain, small sample [CM06]) Interaction: get the power of random samples With privacy! E.g. Sum queries f(x) = i fi(xi) Even chosen adaptively! Noninteractive schemes seem weaker Intuition: privacy  cannot answer all questions ahead of time (e.g. [DN03]) Intuition: sanitization must be tailored to specific functions

Separating Interactive from Noninteractive Theorem: If D={0,1}d, then for any private, noninteractive scheme, many sum queries cannot be learned, unless d = o(log n) Weaker than Interactive Cannot emulate random sample if data is complex

Local vs. Global Local, noninteractive Global, interactive

Separating Local from Global D = {0,1}d for d = (log n) View x as an nd matrix Local: rank(x) has sensitivity 1, can release with low noise Global: cannot distinguish whether rank(x) = k or much larger than k For suitable choice of d,n,k

To sum up Defined privacy in terms of indistinguishability Considered semantic versions of definitions “Crypto” with non-negligible error How to Calibrate noise to sensitivity and # of queries Seems that useful stats should be insensitive Some commonly used functions have low sensitivity For others – local sensitivity? Begun to explore the relationships between basic models

Questions Which useful functions are insensitive? What would you like to compute? Can we get stronger results using: Local sensitivity? Computational assumptions? [MS06] Entropy in data? How to deal with small databases? Privacy in a broader context Rationalizing privacy and privacy related decisions Which types of privacy? How to decide upon privacy parameters? … Handling rich data Audio, Video, Pictures, Text, …