Foundations of Privacy Lecture 2 Lecturer: Moni Naor.

Foundations of Privacy Lecture 2 Lecturer: Moni Naor

Recap of last week’s lecture Privacy is hard to capture precisley but clarified direction Cryptography, Secure Function Evaluation and Privacy Examples of attacks on presumed sanitization Dalenius’s Goal The Impossibility of Disclosure Prevention Netflix

Cryptography and Privacy Extremely relevant - but does not solve the privacy problem Secure function Evaluation How to distributively compute a function f(X 1, X 2, …,X n ), –where X j known to party j. E.g.,  = sum(a,b,c, …) –Parties should only learn final output (  ) Many results depending on –Number of players –Means of communication –The power and model of the adversary –How the function is represented More worried what to compute than how to compute

Example: Securely Computing Sums X1X1 X2X2 X3X3 X4X4 X5X5 0 · X i · P-1. Want to compute  X i Party 1 selects r 2 R [0..P-1]. Sends Y 1 = X 1 +r Party i received Y i-1 and sends Y i = Y i-1 + X i Party 1 received Y n and announces  =  X i = Y n -r Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5  mod P

Is this Protocol Secure? To talk rigorously about cryptographic security: Specify the Power of the Adversary –Access to the data/system –Computational power? – “Auxiliary” information? Define a Break of the System –What is compromise –What is a “win” for the adversary? If it controls two players - insecure Can be all powerful here

The Simulation Paradigm A protocol is considered secure if: For every adversary (of a certain type) There exists a (adversary) simulator that outputs an indistinguishable ``transcript”. Examples: Encryption Zero-knowledge Secure function evaluation Power of analogy

SFE: Simulating the ideal model A protocol is considered secure if: For every adversary there exists a simulator operating in the ``ideal” (trusted party) model that outputs an indistinguishable transcript. Major result : “ Any function f that can be evaluated using polynomial resources can be securely evaluated using polynomial resources ” Breaking = distinguishing! Ideal model: can obtain the value at a given point

The Problem with SFE SFE does not imply privacy: The problem is with ideal model –E.g.,  = sum(a,b) –Each player learns only what can be deduced from  and her own input to f –if  and a yield b, so be it. Need ways of talking about leakage even in the ideal model

Homework question 1 Suggest a simulator for the protocol X1X1 X2X2 X3X3 X4X4 X5X5 Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5  Assume benign behavior on adversary: Sees the messages of a single party Does not deviate from protocol May have arbitrary knowledge of inputs X 1, X 2, …, X n

Computational Security of Encryption Semantic Security Whatever Adversary A can compute on encrypted string X  0,1  n, so can A ’ that does not see the encryption of X, yet simulates A ’s knowledge with respect to X A selects: Distribution D n on  0,1  n Relation R(X,Y) - computable in probabilistic polynomial time For every pptm A there is an pptm A’ so that for all pptm relation R for X  R D n  Pr  R(X,A(E(X))  - Pr  R(X,A’(  ))   is negligible Outputs of A and A’ are indistinguishable even for a tester who knows X

XY R E(X) A XY R. A’ A: D n A’: D n ¼ X 2 R D n

Homework question 2: one-time pad One-time pad encryption –Let r  R {0,1} n –Think of r and m as elements in a group –To encrypt m send: r+m –To decrypt z send m=z-r © m r c E(m) Show that one time pad satisfies semantic security

Why Settle for Ad Hoc Notions of Privacy? Dalenius, 1977: Anything that can be learned about a respondent from the statistical database can be learned without access to the database –Captures possibility that “ I ” may be an extrovert –The database doesn’t leak personal information –Adversary is a user Analogous to Semantic Security for Crypto – Anything that can be learned from the ciphertext can be learned without the ciphertext –Adversary is an eavesdropper Goldwasser- Micali 1982

Making it Slightly less Vague Cryptographic Rigor Applied to Privacy Define a Break of the System –What is compromise –What is a “win” for the adversary? Specify the Power of the Adversary –Access to the data –Computational power? – “Auxiliary” information? Conservative/Paranoid by Nature –Protect against all feasible attacks

In full generality: Dalenius Goal Impossible –Database teaches smoking causes cancer –I smoke in public –Access to DB teaches that I am at increased risk for cancer But what about cases where there is significant knowledge about database distribution

Outline The Framework A General Impossibility Result –Dalenius’ goal cannot be achieved in a very general sense The Proof –Simplified –General case

Two Models DatabaseSanitized Database ? San Non-Interactive: Data are sanitized and released

Two Models Database Interactive: Multiple Queries, Adaptively Chosen ? San

Auxiliary Information Common theme in many privacy horror stories: Not taking into account side information –Netflix challenge: not taking into account IMDb [Narayanan-Shmatikov] The auxiliary information The Database SAN(DB) =remove names

Not learning from DB With access to the database San A Auxiliary Information San A’ Auxiliary Information DB There is some utility of DB that legitimate users should learn Possible breach of privacy Goal: users learn the utility without the breach Without access to the database

Not learning from DB With access to the database Without access to the database San A Auxiliary Information San A’ Auxiliary Information DB Want: anything that can be learned about an individual from the database can be learned without access to the database D 8 D 8 A 9 A’ whp DB 2 R D 8 auxiliary information z |Prob [A(z) $ DB wins] – Prob[ A’(z) wins]| is small

Illustrative Example for Difficulty Want: anything that can be learned about a respondent from the database can be learned without access to the database More Formally 8 D 8 A 9 A’ whp DB 2 R D 8 auxiliary information z |Probability [A(z) $ DB wins] – Probability [ A’(z) wins]| is small Example: suppose height of individual is sensitive information –Average height in DB not known a priori Aux z = “Adam is 5 cm shorter than average in DB ” –A learns average height in DB, hence, also Adam’s height –A’ does not

Defining “Win”: The Compromise Function Notion of privacy compromise Compromise? y 0/1 Adv DB D Privacy breach Privacy compromise should be non trivial: Should not be possible to find privacy breach from auxiliary information alone – hardly a convincing one! Privacy breach should exist: Given DB there should be y that is a privacy breach Should be possible to find y efficiently Let  be bound on breach

Basic Concepts D Distribution on (Finite) Databases D –Something about the database must be unknown –Captures knowledge about the domain E.g., rows of database correspond to owners of 2 pets D Privacy Mechanism San( D, DB) –Can be interactive or non-interactive –May have access to the distribution D D Auxiliary Information Generator AuxGen( D, DB) –Has access to the distribution and to DB – Formalizes partial knowledge about DB Utility Vector w –Answers to k questions about the DB –(Most of) utility vector can be learned by user –Utility: Must inherit sufficient min-entropy from source D

Impossibility Theorem: Informal DFor any* distribution D on Databases DB For any* reasonable privacy compromise decider C. Fix any useful* privacy mechanism San Then There is an auxiliary info generator AuxGen and an adversary A Such that For all adversary simulators A’ [A(z) $ San( DB)] wins, but [A’(z)] does not win Tells us information we did not know z=AuxGen(DB) Finds a compromise

Impossibility Theorem Fix any useful* privacy mechanism San and any reasonable privacy compromise decider C. Then D There is an auxiliary info generator AuxGen and an adversary A such that for “ all ” distributions D and all adversary simulators A’ Pr[A( D, San( D,DB), AuxGen( D, DB)) wins] - Pr[A’( D, AuxGen( D, DB)) wins] ≥  for suitable, large,  D The probability spaces are over choice of DB 2 R D and the coin flips of San, AuxGen, A, and A’ To completely specify: need assumption on the entropy of utility vector W and how well SAN(W) behaves

Strategy The auxiliary info generator will provide a hint that together with the utility vector w will yield the privacy breach. Want AuxGen to work without knowing D just DB –Find privacy breach y and encode in z –Make sure z alone does not give y. Only with w Complication: is the utility vector w –Exactly learned by the user? –Or just an approximation?

Entropy of Random Sources Source : –Probability distribution X on {0,1} n. –Contains some “randomness”. Measure of “randomness” – Shannon entropy : H(X) = - ∑ x  Γ P x (x) log P x (x) Represents how much we can compress X on the average But even a high entropy source may have a point with prob 0.9 – min-entropy: H min (X) = - log max x  Γ P x (x) Represents the most likely value of X Definition : X is a k -source if H 1 (X) ¸ k. i.e. Pr[X = x] · 2 -k for all x {0,1} n

Strategy DB San A Aux DB A’ Aux Utility w Breach y y Y? z z A plays role of legit user and learns Breach y

Min-entropy Definition : X is a k -source if H 1 (X) ¸ k. i.e. Pr[X = x] · 2 -k for all x Examples : – Bit-fixing : some k coordinates of X uniform, rest fixed or even depend arbitrarily on others. – Unpredictable Source : 8 i 2 [n], b 1,..., b i-1 2 {0,1}, k/n · Prob[X i =1| X 1, X 2, … X i-1 = b 1,..., b i-1 ] · 1-k/n – Flat k -source : Uniform over S µ {0,1} n, |S|=2 k Fact every k -source is convex combination of flat ones.

Min-Entropy and Statistical Distance For a probability distribution X over {0,1} n H 1 (X) = - log max x Pr[X = x] X is a k -source if H 1 (X) ¸ k Represents the probability of the most likely value of X ¢ (X,Y) =  a  |Pr[X=a] – Pr[Y=a]| Statistical distance : Want to be close to uniform distribution :

Extractors Universal procedure for “purifying” an imperfect source Definition: Ext: {0,1} n £ {0,1} d ! {0,1} ℓ is a (k,  ) -extractor if: for any k -source X ¢ (Ext(X, U d ), U ℓ ) ·  d random bits “seed” E XT k -source of length n ℓ almost-uniform bits x s {0,1} n 2 k strings

Strong extractors Output looks random even after seeing the seed. Definition : Ext is a (k,  ) strong extractor if Ext ’ (x,s) = s ◦ Ext(x,s) is a (k,  ) -extractor i.e. 8 k -sources X, for a 1-  ’ frac. of s 2 {0,1} d Ext(X,s) is  -close to U ℓ.

Extractors from Hash Functions Leftover Hash Lemma [ILL89]: universal ( pairwise independent ) hash functions yield strong extractors –output length: ℓ = k-O(1) –seed length: d = O(n) Example: Ext(x,(a,b))= first ℓ bits of a ¢ x+b in GF[2 n ] Almost pairwise independence : –seed length: d= O(log n+k) ℓ = k – 2log(1/  )

Suppose w Learned Exactly AuxGen and A share a secret: w AuxGen(DB) Find privacy breach y of DB of length ℓ Find w from DB – simulate A Choose s 2 R {0,1} d and compute Ext(w,s) Set z = (s, Ext(w,s) © y) San DB Aux Gen A C 0/1 w z

Suppose w Learned Exactly AuxGen and A share a secret: w DB Aux Gen A’ C 0/1San DB Aux Gen A C 0/1 w z z = (s, Ext(w,s) © y) z Technical Conditions: H min (W|y) ≥ |y| and |y| “safe”

Why is it a compromise? AuxGen and A share a secret: w Why doesn’t A’ learn y : For each possible value of y (s, Ext(w,s)) is  -close to uniform Hence: (s, Ext(w,s) © y) is  -close to uniform San DB Aux Gen A C 0/1 w z z = (s, Ext(w,s) © y) Need H min (W) ¸ 3 ℓ +O(1) Technical Conditions: H min (W|y) ≥ |y| and |y| “safe”

If w Not Learned Completely by A Relaxed Utility: Something close to w is learned D AuxGen( D, DB) does not know exactly what A will learn Need: being close to w produces same extracted randomness as w Ordinary extractors offer no such guarantee Fuzzy Extractors (m,ℓ,t,  ) : (Gen, Rec) Gen(w) outputs extracted r 2 {0,1} ℓ and public string p. For any distribution W min-entropy at least m (R, P) Ã Gen(W) ) (R, P) and (U ℓ, P) are within stat distance  Rec(p,w*) : reconstructs r given p and any w * sufficiently close to w (r, p) Ã Gen(w) and ||w – w * || 1 · t ) Rec(w *, p) = r. Dodis, Reyzin and Smith

Construction Based on ECC Error-correcting code ECC : –Any two codewords differ by at least 2t+1 bits. Can correct t errors Gen(w): p = w © ECC(r’) –where r’ is random r is extracted from r’ Given p and w’ close to w : –Compute w’ © p –Decode to get ECC(r’) –r is extracted from r’ 2 t+1 w ECC(r’) p w’ As previously…

Fuzzy Extractor and biometric data Original motivation of fuzzy extractors: Storing biometric data so that –Hard to retrieve original –Given another `measurement’ of data can compare and retrieve hidden data Traditional use them: hide data We derive impossibility by hiding data

Distance to codeword Fuzzy Extractors (m,ℓ,t,  ) : (Gen, Rec) Gen(w) outputs extracted r 2 {0,1} ℓ and public string p. For any distribution W of sufficient min-entropy (R, P) Ã Gen(W) ) (R, P) and (U ℓ, P) are within stat distance  Rec: reconstructs r given p and any w * sufficiently close to w (r, p) Ã Gen(w) and || w – w * || 1 · t ) Rec(w *, p) = r. Idea: (r, p) Ã Gen(w); Set z = (p, r © y) A reconstructs r from w* close to w r looks almost uniform to A’ even given p Problem : p might leak information about w - might disclose different privacy breach y’ Solution: AuxGen interacts with DB to learn safe w’ (r, p) Ã Gen(w’); Set z = (p, r © y) w’’ (learned by A ) and w’ both sufficiently close to w ) w’, w’’ close to each other ) A(w’’, p) can reconstruct r. By assumption w’ does not yield a breach! Simulates sanitizer

Case w Not Learned Completely AuxGen and A share a secret: r DB Aux Gen A’ C 0/1San DB Aux Gen A C 0/1 w’’ z (p, r) = Gen(w’) z = (p, r © y) A: r = Rec(p, w’’) z r almost unif, given p p should not be disclosive w’

Case w Not Learned Completely Pr[A’(z)] wins D ≤ Pr[A $ San( D, DB) wins ] +  ≤  +  DB Aux Gen A’ C 0/1San DB Aux Gen A C 0/1 w’’ z (p, r) = Gen(w’) z = (p, r © y) A: r = Rec(p, w’’) z r almost unif, given p p should not be disclosive w’

Case w Not Learned Completely Need extra min-entropy: H min (W) ≥ 3 ℓ +|p| Pr[A’(z)] wins ≤ Pr[A $ San( D, DB) wins ] +  ≤  +  DB Aux Gen A’ C 0/1San DB Aux Gen A C 0/1 w’’ z (p, r) = Gen(w’) z = (p, r © y) A: r = Rec(p, w’’) z r almost unif, given p p should not be disclosive w’

Open Problem: Dealing with Relations What if the result of SAN(DB) varies a lot: some w such that (DB,w) satisfy a relation –Could be many different w ’s that match a given DB Is it useful? Is it possible to hide information in aux in this case?

Conclusions Should not give up on rigorous privacy –Should come up with more modest goals Works Even if subject is not in Database! –Motivates a definition based on increased risk incurred by joining the database, Risk to Adam if in database vs Risk to Adam if not in DB Computational efficiency Other notions Differential Privacy

Desirable Properties from a sanitization mechanism Composability –Applying the sanitization several time yields a graceful degradation –q releases, each  -DP, are q ¢  -DP Robustness to side information –No need to specify exactly what the adversary knows Differential Privacy: satisfies both…

Adjacency: D+Me and D-Me Differential Privacy Protect individual participants: Probability of every bad event - or any event - increases only by small multiplicative factor when I enter the DB. May as well participate in DB… ε -differentially private sanitizer A For all DBs D, all Me and all events T Pr A [A(D+Me) 2 T] Pr A [A(D-Me) 2 T] ≤ e ε ≈ 1+ ε e-ε ≤e-ε ≤ Handles aux input Dwork, McSherry, Nissim and Smith

49 Differential Privacy Bad Responses : XXX Pr [response] ratio bounded A gives  -  differential privacy if for all neighboring D 1 and D 2, and all T µ range( A ): Pr[ K (D1) 2 C] ≤ e  Pr[ K (D2) 2 C] Neutralizes all linkage attacks. Composes unconditionally and automatically: Σ i  i

Differential Privacy: Important Properties Handles auxiliary information Composes naturally A 1 (D) is ε 1 -diffP for all z 1, A 2 (D,z 1 ) is ε 2 -diffP, Then A 2 (D,A 1 (D)) is (ε 1 +ε 2 ) -diffP Proof: for all adjacent D, D’ and (z 1,z 2 ) : e -ε 1 ≤ P[z 1 ] / P’[z 1 ] ≤ e ε 1 e -ε 2 ≤ P[z 2 ] / P’[z 2 ] ≤ e ε 2 e -(ε 1 +ε 2 ) ≤ P[(z 1,z 2 )]/P’[(z 1,z 2 )] ≤ e ε 1 +ε 2 P[z 1 ] = Pr z~A 1 (D) [z=z 1 ] P’[z 1 ] = Pr z~A 1 (D’) [z=z 1 ] P[z 2 ] = Pr z~A 2 (D,z 1 ) [z=z 2 ] P’[z 2 ] = Pr z~A 2 (D’,z 1 ) [z=z 2 ]

Example: NO Differential Privacy U set of (name,tag 2 {0,1}) tuples One counting query: #of participants with tag=1 Sanitizer A : choose and release a few random tags Bad event T : Only my tag is 1, my tag released Pr A [A(D+Me) 2 T] ≥ 1/n Pr A [A(D-Me) 2 T] = 0 Not diff private for any ε ! Pr A [A(D+Me) 2 T] Pr A [A(D-Me) 2 T] ≤ e ε ≈ 1+ ε e-ε ≤e-ε ≤

Size of ε How small can ε be? Cannot be negligible Why? Hybrid argument How large can it be? Think of a small constant D, D’ – totally unrelated databases Utility should be very different Consider sequence D 0 =D, D 1, D 2, …, D n =D’ where D i and D i+1 adjacent db. For each output T Prob[T|D] ¸ Prob[T|D’] ¢ e εn

Answering a single counting query U set of (name,tag 2 {0,1}) tuples One counting query : #of participants with tag=1 Sanitizer A: output #of 1’s + noise Differentially private! If choose noise properly Choose noise from Laplace distribution

0 12345-2-3-4 Laplacian Noise Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e -|y|/b Standard deviation: O( b) Take b=1/ε, get that Pr[Y=y] Ç e -  |y|

Laplacian Noise: ε- Privacy Take b=1/ε, get that Pr[Y=y] Ç e -  |y| Release: q(D) + Lap(1/ε) For adjacent D, D’ : |q(D) – q(D’)| ≤ 1 For output a : e -  ≤ Pr by D [a]/Pr by D’ [a] ≤ e  0 12345-2-3-4

0 12345-2-3-4 Laplacian Noise: Õ(1/ε)- Error Take b=1/ε, get that Pr[Y=y] Ç e -  |y| Pr y~Y [|y| > k·1/ε] = O(e -k ) Expected error is 1/ε, w.h.p error is Õ(1/ε)

Randomized Response Randomized Response Technique [Warner 1965] –Method for polling stigmatizing questions –Idea: Lie with known probability. Specific answers are deniable Aggregate results are still valid The data is never stored “in the plain” 1 noise + 0 + 1 + … “trust no-one” Popular in DB literature

Randomized Response with Laplacian Noise Initial idea: each user i, on input x i 2 {0, 1} Add to x i independent Laplace noise with magnitude 1/ε Privacy: since each increment protected by Laplace noise – differentially private whether x i is 0 or 1 Accuracy: noise cancels out, error Õ(√ T ) For sparse streams: this error too high. T – total number of users 0 12345-2-3-4

Scaling Noise to Sensitivity [DMNS06] (Global) sensitivity of query q:U n → [0,n] GS q = max D,D’ |q(D) – q(D’)| For a counting query q : GS q =1 Previous argument generalizes: For any query q:U n → [0,n] release q(D) + Lap(GS q /ε) ε-private error Õ(GS q /ε)

Projects Report on a paper Apply a notion studied to some known domain Checking the state of privacy is some setting Privacy in GWAS Privacy in crowd sourcing Privacy Preserving Wordle Unique identification bounds How much worse are differential privacy guarantees in estimation Contextual Privacy

Official Description The availability of fast and cheap computers coupled with massive storage devices has enabled the collection and mining of data on a scale previously unimaginable. This opens the door to potential abuse regarding individuals' information. There has been considerable research exploring the tension between utility and privacy in this context. The goal is to explore techniques and issues related to data privacy. In particular: Definitions of data privacy Techniques for achieving privacy Limitations on privacy in various settings. Privacy issues in specific settings

Planned Topics Privacy of Data Analysis Differential Privacy – Definition and Properties – Statistical databases – Dynamic data Privacy of learning algorithms Privacy of genomic data Interaction with cryptography SFE Voting Entropic Security Data Structures Everlasting Security Privacy Enhancing Tech. – Mixed nets

Course Information Foundation of Privacy - Spring 2010 Instructor: Moni Naor When: Mondays, 11:00--13:00 (2 points) Where: Ziskind 1 Course web page : www.wisdom.weizmann.ac.il/~naor/COURSE/foundations_of_privacy.html Prerequisites: familiarity with algorithms, data structures, probability theory, and linear algebra, at an undergraduate level; a basic course in computability is assumed. Requirements: – Participation in discussion in class Best: read the papers ahead of time – Homework : There will be several homework assignments Homework assignments should be turned in on time (usually two weeks after they are given)! – Class Project and presentation – Exam : none planned Office: Ziskind 248 Phone: 3701 E-mail: moni.naor@

Foundations of Privacy Lecture 2 Lecturer: Moni Naor.

Similar presentations

Presentation on theme: "Foundations of Privacy Lecture 2 Lecturer: Moni Naor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Foundations of Privacy Lecture 2 Lecturer: Moni Naor.

Similar presentations

Presentation on theme: "Foundations of Privacy Lecture 2 Lecturer: Moni Naor."— Presentation transcript:

Similar presentations

About project

Feedback