Calibrating Noise to Sensitivity in Private Data Analysis

Name: Calibrating Noise to Sensitivity in Private Data Analysis
Uploaded: 2017-12-05T23:53:54+00:00
Duration: PTM21S44
Description: Calibrating Noise to Sensitivity in Private Data Analysis

Calibrating Noise to Sensitivity in Private Data Analysis
Kobbi Nissim BGU With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb

The Setting San x1 x2 x3  xn-1 xn Users x  Dn query X = answer
Can I combine these to learn some private info? I just want to learn a few harmless global statistics xn xn-1  x3 x2 x1 Users (government, researchers, marketers, …) query San X = answer x  Dn (n rows each of domain D)

What is privacy? Clearly we cannot undo the harm done by others
Can we minimize the additional harm while providing utility? Goal: Whether or not I contribute my data does not affect my privacy

Output Perturbation San x1 x2 x3  xn-1 xn f f(x) + noise ¢
random coins f(x) + noise San Controls: which functions f kind of perturbation

When Can I Release f(x) accurately?
Intuition: global information is “insensitive” to individual data and is safe f(x1,…,xn) is sensitive if changing a few entries can drastically change its value

Talk Outline A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to privacy definitions Examples of sensitivity based analysis New ideas Basic models for privacy Local vs. global Noninteractive vs. Interactive

Related Work Relevant work in Statistics, Data mining, Computer Security, Databases Largely: no precise definitions and analysis of privacy Recently: A foundational approach [DN03,EGS03,DN04,BDMN05, KMN05 CDMSW05,CDMT05,MS06,CM06,…] This work extends [DN03,DN04,BDMN05]

Privacy as Indistinguishability
xn xn-1  x3 x2 x1 x= query 1 answer 1 query T answer T  transcript T(x) San random coins Differ in 1 row Distributions at “distance” <  x’= xn xn-1  x3 x2’ x1 query 1 answer 1 query T answer T  transcript T(x') San random coins

-Indistinguishability
A sanitizer is -indistinguishable if for all pairs x,x’  Dn which differ on at most one entry for all adversaries A for all transcripts t Pr[TA(x) = t]  e  Pr[TA(x’) = t]

Semantically Flavored Definitions
Indistinguishability - easy to work with but does not directly say what the adversary can do an learn “Ideal” semantic definition: Adversary does not change his beliefs about me Problem: dependencies, e.g. in form of side information Say you know that I am 20 pounds heavier than average Israeli… You will learn my weight from census results Whether or not I participate Ways to get around: Assume “independence” of X1,…,Xn [DN03,DN04,BDMN05] Compare “what A knows now” vs “what A would have learned anyway” [DM]

Incremental Risk Suppose adversary has prior “beliefs” about x
Probability distribution, r.v. X= (X1,…,Xn) Given transcript t, adversary updates “beliefs” according to Bayes’ rule New distribution Xi’| T(X)=t

It’s the same whether you participate or not
Incremental Risk Bugger! It’s the same whether you participate or not “Proof:” indistinguishability guarantees that updates are the same within 1±  Two options: I participate in census (input = X) I do not participate (input Yi = X1,…,Xi-1,*,Xi+1,…,Xn ) Privacy: whether I participate or not does not significantly influence adversary’s posterior beliefs: For all transcripts t, for all i: X’i |T(X)=t ¼ X’i |T(Yi)=t San Yi San X

Recall – -Indistinguishability
For all pairs x,x’  Dn s.t. dist(x,x’) = 1 For all transcripts t Pr[TA(x) = t]  e  Pr[TA(x’) = t]

An Example – Sum Queries
Pls let me know fA(x)=iA xi xn xn-1  x3 x2 x1 San random coins fA(x) + noise x [0,1]n

Sum Queries – Answering a Query
x 2 [0,1]n fA(x)=iA xi Can be used as a basis for other tasks: clustering, learning, classification… [BDMN05] Answer: xi + Y where Y » Lap(1/) Laplace Distribution: h(y)  e-|y| Note: |fA(x)-fA(x’)|  1

Sensitivity We chose noise magnitude to cover for max |f(x)-f(x’)|
Sf = max ||f(x)-f(x’)||1 Local Sensitivity LSf(x) = max ||f(x)-f(x’)||1 f xn xn-1  x3 x2 x1 San x= x’= x2’ f(x) + noise f(x’) + noise dist(x,x’)=1 dist(x,x’)=1

Calibrating Noise to Sensitivity
Pls let me know f (x) xn xn-1  x3 x2 x1 San random coins f (x) + Lap(Sf /) x Dn h(y)  e-/Sf ||y||1

Calibrating Noise to Sensitivity - Why it Works?
Sf = max |f(x)-f(x’)|1 Property of Lap:  x,y: h(x)/h(y)  e||x-y||1 Pr[T(x)=t] / Pr[T(x’)=t]  e / Sf ||fA(x)- fA(x’)||1  e h(y)  e-/Sf ||y||1 dist(x,x’)=1

Main Result Theorem: If a user U is limited to T adaptive queries of sensitivity Sf then -indistinguishability if iid noise Lap(SfT/) added to query answers Same idea works with other metrics and noise Which useful functions are insensitive? All useful functions should be insensitive… Statistical conclusions should not depend on small variations in data

Using insensitive functions
Strategies: Use theorem, output f(x) + Lap(Sf /) Sf may be hard to analyze/compute Sf high for functions considered ‘insensitive’ Express f in terms of insensitive functions Resulting noise depends on input (in form and magnitude)

Example - Expressing f in terms of insensitive functions
x  {0,1}n f(x) = ( xi)2 Sf = n2 - (n-1)2 = 2n-1 af = ( xi)2 + Lap(2n/) If f(x) << n noise dominates However f(x) = (g(x))2 where g(x) =  xi Sg=1 Better to query for g Get ag = xi + Lap(1/) Estimate f(x) as (ag)2 Taking  constant results in stddev O( xi) – (1/ )2

Useful Insensitive functions
Means, variances,… With appropriate assumptions on data Histograms & contingency tables Singular value decomposition Distance to a property Functions with low query complexity

Histograms/Contingency Tables
x1,…,xn 2 D where D partitioned into d disjoint bins b1,…,bd h(x) = (v1,…,vd) where vi=|{i : xi  bi}| Sh = 2 Changing one value xi changes vector by · 2 Irrespective of d Add Laplacian with std. dev. 2/ to each count Can do that with sum queries … b1 b2 … b4

Distance to a Property x P Say P = set of “good” databases
Distance to P = min # points in x that must be changed to make x in P Always has sensitivity 1 Add Laplacian with stdev 1/ Examples: Distance to being clusterable Weight of minimum cut in graph P x distance to P

Approximations with Low Query Complexity
Lemma: Assume algorithm A that randomly samples n points and Pr[ A(x)  f(x) ± ] > (1+)/2 Then Sf · 2 Proof: Consider x,x’ that differ on point i Let Ai be A conditioned on not choosing point i Pr[Ai(x)  f(x)±  | pt i not sampled] > 1/2 Pr[Ai(x’)  f(x’)± | pt i not sampled] > 1/2  point p that is within dist  from both f(x), f(x’)  Sf · 2 Support of Ai(x)=Ai(x) p

Local Sensitivity Median – typically insensitive, large (global) sensitivity LSf(x) = max ||f(x)-f(x’)||1 Example: f(x) = min(xi, 10) where xi{0,1} LSf(x) = 1 if xi  10 and 0 otherwise dist(x,x’)=1 10 n  xi

Local Sensitivity – First Attempt
Calibrate noise to LSf(x) Answer query f by f(x) + Lap(LSf(x)/) If x1…x10=1 and x11…xn=0 Answer = 10 + Lap(1/) If x1…x11=1 and x12…xn=0 Answer = 10 Noise magnitude may be disclosive! 10 n  xi

How to Calibrate Noise to Local Sensitivity?
Noise magnitude at a point x depends on LS(y) for all y  Dn N*f = max (LSf(y) e- dist(x,y)) Median 10 n  xi

Talk Outline A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to privacy definitions Examples of sensitivity based analysis New ideas Basic models for privacy Local vs. global Noninteractive vs. Interactive

Models for Data Privacy
Collection and sanitization You Bob Alice Users (government, researchers, marketers, …)

Models for Data Privacy – Local vs. Global
Collection and sanitization You Bob Alice San San Collection and sanitization You Bob Alice Including “SFE”

Models for Data Privacy – Interactive vs. Noninteractive
You Bob Alice Collection and sanitization You Bob Alice Collection and sanitization

Models for Data Privacy - Summary
Local (vs. Global) Non central trusted party Individuals interact directly with (untrusted) user Individuals control their own privacy Noninteractive (vs. Interactive) Easier distribution: web site, book, CD, … More secure: can erase the data once it is processed Almost all work in statistics, data mining is noninteractive! OLD NOTES! This talk is about database privacy. The term can mean many things but for this talk, the example to keep in mind is a government census. Individuals provide information to a trusted government agency, which processes the information and makes some sanitized version of it available for public use. - privacy is required by law - ethical - pragmatic: people won’t answer unless they trust you There are two goals: we want users to be able to extract global statistics about the population being studied. However, for legal, ethical and pragmatic reasons, we also want to protect the privacy of the individuals who participate. And so we have a fundamental tradeoff between privacy on one hadn and utility on the other. The extremes are easy: publishing nothing at all provides complet eprivacy, but no utility, and publishing the raw data exactly provides the most utility but no privacy. Thus the first-order goal of this paper is to plot some middle course between the extremes; that is, to find a compromise which allows users to obtain useful information while also providing a meaningful guarantee of privacy. This problem is not new: it is often called the "statistical database" problem. I would say a second-order goal of this paper is to change the way the problem is approached and treated in the literature… Graphically, this is what is going on. As I said, there are two goals, utility and privacy. Utility is easy to understand, and to explain to a user. To prove that your scheme provides a particular utility, just give an algoriithm and an analysis. Privacy is much harder to get a handle on…

? Four Basic Models Global, interactive incomparable Local,
noninteractive Global, interactive incomparable ?

Interactive vs. Noninteractive
Local, noninteractive Global, interactive

Separating Interactive from Noninteractive
Random samples: can compute estimates for many stats (essentially) no need to decide upon queries ahead of time But not private (unless small domain, small sample [CM06]) Interaction: get the power of random samples With privacy! E.g. Sum queries f(x) = i fi(xi) Even chosen adaptively! Noninteractive schemes seem weaker Intuition: privacy  cannot answer all questions ahead of time (e.g. [DN03]) Intuition: sanitization must be tailored to specific functions

Separating Interactive from Noninteractive
Theorem: If D={0,1}d, then for any private, noninteractive scheme, many sum queries cannot be learned, unless d = o(log n) Weaker than Interactive Cannot emulate random sample if data is complex

Local vs. Global Local, noninteractive Global, interactive

Separating Local from Global
D = {0,1}d for d = (log n) View x as an nd matrix Local: rank(x) has sensitivity 1, can release with low noise Global: cannot distinguish whether rank(x) = k or much larger than k For suitable choice of d,n,k

To sum up Defined privacy in terms of indistinguishability
Considered semantic versions of definitions “Crypto” with non-negligible error How to Calibrate noise to sensitivity and # of queries Seems that useful stats should be insensitive Some commonly used functions have low sensitivity For others – local sensitivity? Begun to explore the relationships between basic models

Questions Which useful functions are insensitive?
What would you like to compute? Can we get stronger results using: Local sensitivity? Computational assumptions? [MS06] Entropy in data? How to deal with small databases? Privacy in a broader context Rationalizing privacy and privacy related decisions Which types of privacy? How to decide upon privacy parameters? … Handling rich data Audio, Video, Pictures, Text, …

Calibrating Noise to Sensitivity in Private Data Analysis

Similar presentations

Presentation on theme: "Calibrating Noise to Sensitivity in Private Data Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Calibrating Noise to Sensitivity in Private Data Analysis

Similar presentations

Presentation on theme: "Calibrating Noise to Sensitivity in Private Data Analysis"— Presentation transcript:

Similar presentations

About project

Feedback