Foundations of Privacy Lecture 11 Lecturer: Moni Naor
Recap of recent lecture Continual changing data –Counters –How to combine expert advice –Multi-counter and the list update problem Pan Privacy General Transformation to continual output
The Dynamic Privacy Zoo Differentially Private Outputs Privacy under Continual Observation Pan Privacy User level Privacy Continual Pan Privacy Petting Sketch vs. Stream
Sanitization Can’t be Too Accurate Usual counting queries –Query: q µ [n] – i 2 q d i Response = Answer + noise Blatant Non-Privacy: Adversary Guesses 99% bits Theorem : If all responses are within o(n) of the true answer, then the algorithm is blatantly non-private. But: require exponential # of queries. 4
Proof: Exponential Adversary Focus on Column Containing Super Private Bit Assume all answers are within error bound . 5 “ The database ” d Will show that cannot be o(n)
Proof: Exponential Adversary Estimate # 1 ’s in all possible sets – 8 S µ [n] : | K (S) – i 2 S d i | ≤ Weed Out “Distant” DBs –For each possible candidate database c : If for any S µ [n] : | i 2 S c i – K (S)| > , then rule out c. –If c not ruled out, halt and output c Claim : Real database d won’t be ruled out 6 K (S) real answer on S
Proof: Exponential Adversary Suppose: 8 S µ [n] : |K(S) – i 2 S d i | ≤ Claim : For c that has not been ruled out Hamming distance (c,d) ≤ 2 S0S0 S1S1 d c ≤ 2 | K(S 0 ) - i 2 S 0 c i | ≤ ( c not ruled out) |K(S 1 ) - i 2 S 1 c i | ≤ ( c not ruled out)
Contradiction? We have seen algorithms that allow answer each query with accuracy o(n) – O(√n) and O(n 2/3 ) Why is there no contradiction with current results
What can we do efficiently ? Allowed “too” much power to the adversary Number of queries Computation On the other hand: lack of wild errors in the responses Theorem : For any sanitization algorithm: If all responses are within o(√n) of the true answer, then it is blatantly non-private even against a polynomial time adversary making O(n log 2 n) random queries. Show the adversary
The model As before: database d is a bit string of length n. Users query for subset sums : –A query is a subset q µ {1, …, n} –The (exact) answer is a q = i 2 q d i -perturbation –for an answer: a q ± Slide 10
Privacy requires Ω(√n) perturbation Consider a database with o(√n) perturbation Adversary makes t = n log 2 n random queries q j, getting noisy answers a j Privacy violating Algorithm : Construct database c = {c i } 1 ≤ i ≤ n by solving Linear Program: 0 ≤ c i ≤ 1 for 1 ≤ i ≤ n a j - ≤ i 2 q c i ≤ a j + for 1 ≤ j ≤ t Round the solution: – if c i > 1/2 set to 1 and to 0 otherwise A solution must exist: d itself For every query q j : its answer according to c is at most 2 far from its (real) answer in d.
Bad solutions to LP do not survive A query disqualifies a potential database c if its answer for the query is more than 2 + 1 far from its real answer in d. Idea: show that for a database c that is far away from d a random query disqualifies c with some constant probability Want to use the Union Bound : all far away solutions are disqualified w.p. at least 1 – n n (1 - ) t = 1–neg(n) How do we limit the solution space? Round each one value to closest 1/n
Privacy requires Ω(√n) perturbation A query disqualifies a potential database c if its answer for the query is more than 2 + 1 far from its real answer in d. Claim : a random query disqualifies far away from d database c with some constant probability Therefore: t = n log 2 n queries leave a negligible probability for each far reconstruction. Union bound : all far away suggestions are disqualified w.p. at least 1 – n n (1 - ) t = 1 – neg(n) Can apply union bound by discretization Count number of entries far from d
Review and Conclusion When the perturbation is o(√n), choosing Õ(n) random queries gives enough information to efficiently reconstruct an o(n) -close db. Database reconstructed using Linear programming – polynomial time. Slide 14 o(√n) databases are Blatantly Non-Private. poly(n) time reconstructable
Ω(√n) lower bound revisited An attack on a o(√n)- perturbation database with substantially better performance Previous attack uses n log 2 n queries and runs in n 5 log 4 n time (LP) New attack: issues n queries and runs in O(nlog n) time New attack is deterministic –Fixed set of queries for each size –Not necessarily an advantage – must ask certain queries Slide 15
The Fourier Attack Treat the database d as a function Z 2 logn → Z 2 Query specific subset sums: from which the Fourier coefficients of the function can be calculated –One for each Fourier coefficient Round reconstructed function’s values to bits When the sums have o(√n) error, so do the coefficients –the reconstruction can be shown to have o(n) error. Fourier transform can be computed in time O(n log n) Slide 16 Key point: linearity of Fourier transform implies small error in coefficients also mean small error in function Vector defines a functi on
Fourier Transform The characters of Z 2 k : homomorphisms into {-1,1} There are 2 k characters : one for each a=(a 1, a 2, …, a k ) 2 Z 2 k a (x) = (-1) i=1 a i x i For function f: Z 2 logn → R The Fourier coefficients f( a ) are x a (x) f(x) We have: f(x) = a a (x) f( a ) Æ Æ k H = 2 k x 2 k Hadamard matrix H H = 2 k I f = H f f = 1/2 k H f H a,b = a (b) Æ Æ
Parseval’s Identity Relates the absolute values of f to absolute values of Fourier coefficients of f x 2 Z 2 k |f(x)| 2 = 1/2 k a 2 Z 2 k |f( a )| 2 Æ
Evaluating Fourier Coefficients with Counting queries Let 0 = x f(x) For a=(a 1, a 2, …, a k ) let S a = {x| =0 mod 2} f( a ) = 2 x 2 S a f(x) - 0 Approximation of counting query on S a yields approximation of f( a ) with related term f = 1/2 k H f => 1/2 k H (f + e) = f + 1/2 k He |S a |= 2 k-1 Æ Æ Æ e : error vector of Fourier co. Æ e=(e 1, e 2, …, e n )
f = 1/2 k H f => 1/2 k H (f + e) = f + 1/2 k He If 1/2 k He has (n) entries which are ¸ ½ Then by Parseval’s: 1/2 k a 2 Z 2 k |e a | 2 is (n) Hence: at least one |e a | is (√n) ÆÆ n e : error vector of Fourier co. e=(e 1, e 2, …, e n ) x 2 Z 2 k |f(x)| 2 = 1/2 k a 2 Z 2 k |f( a )| 2 Contradicting assumption on accuracy
Changing the Model: weighted counting Previous attacks: assume all queries are within some small perturbation New model: To up to ½- of the queries unbounded noise is added To the rest “small” noise bounded Stronger query model : subset sums are weighted with weights 0...p-1 for Slide 21 Cannot “hide” single bits: all the weight might be there some prime p = Ω(1/ 2 + / ) Want some randomness of queries – otherwise repetition
Interpolation attack Treat database as linear form of n variables over Z p Treat a query q = (q 1, …, q n ) as the evaluation of the form at a point f(q 1, …, q n ) = Σ i=1..n d i q i mod p –An answer to query q =((p-1)/2, 0, …, 0) that is within (p-1)/4 error tells us the first db bit –Similarly to all other bits No point in asking the query directly: these useful queries might have unbounded noise Need to deduce (approximate) answer to q from other queries Slide 22 By dropping info
Interpolation attack - implementation Want to evaluate a specific query q with small error Pick a random degree-2 curve that passes through q and issue queries for the p points on the curve Key issue: points on curve are pairwise independent Therefore: for sufficiently many queries, with high probability interpolation gives a correct (up to small noise) answer for q Can try exhaustively all degree 2 polynomials Slide 23 Similar to Reed Muller decoding
Interpolation attack … Interpolation implemented by searching all p 3 degree 2 polynomials for one which is -close at ½- of the entries polynomial –restrictions of a deg-2 curve to a linear form is a deg-2 polynomial Any two such polynomials must be 2 -close, due to low degree Hence the accuracy of the reconstructed answer is 2 . For (p-1)/4 > 2 : can figure out any specific database bit with high probability Slide 24 To query
Interpolation Attack: evaluating a query accurately DB: f(q 1, …, q n ) = Σ i=1..n d i q i (Z p n → Z p ) Pick a curve: for two random points u 1, u 2 in Z p n : c(t) = q + u 1 t + u 2 t 2 (Z p → Z p n ) Restriction of f to c : f| c (t) = f(c(t)) this is a degree-2 polynomial ( Z p → Z p ) Query all p points of c to get evaluations of f| c –answers are inaccurate Interpolate to find f| c up to a small error Evaluate f| c (0) = f(q) accurately Slide 25
Interpolation attack - performance Time for finding any specific bit: O(p 4 )=O( -8 ) Independent of db size n ? (querying time? |q| = Θ( n )) –Can be used with very large databases if interesting part is small Time to construct whole db with small error: O(n) with pn queries (or O( n 2 )) Slide 26
Summary Ω(√ n ) perturbation lower bound revisited – simple and efficient attack When queries allow sufficiently large weights, an adversary can: –Handle unbounded noise on large portion of the queries –Find out private data in time independent of size of DB Slide 27