Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Complexity & Differential Privacy Salil Vadhan Harvard University Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov,

Similar presentations


Presentation on theme: "Computational Complexity & Differential Privacy Salil Vadhan Harvard University Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov,"— Presentation transcript:

1 Computational Complexity & Differential Privacy Salil Vadhan Harvard University Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov, Moni Naor, Omkant Pandey, Toni Pitassi, Omer Reingold, Guy Rothblum, Jon Ullman TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A

2 Data Privacy: The Problem Given a dataset with sensitive information, such as: Health records Census data Search engine logs How can we: Compute and release useful functions of the dataset (utility) Without compromising info about individuals (privacy)?

3 Data Privacy: The Challenge Traditional approach (e.g. in HIPAA): “anonymize” by removing “personally identifying information (PII)” Many supposedly anonymized datasets have been subject to reidentification: – Gov. Weld’s medical record re-identified by linkage with voter records [Swe97]. – Netflix Challenge database reidentified by linkage with IMDb [NS08] – AOL search users reidentified by contents of their queries [BZ06] – …

4 Differential Privacy [DN03,DN04,BDMN05,DMNS06] A strong, new notion of privacy that: Is robust to auxiliary information possessed by an adversary Degrades gracefully under repetition/composition Allows for many useful computations

5 Differential Privacy: Definition Def: A randomized algorithm C : X n  Y is (  ) differentially private if for every two databases D 1, D 2 that differ on one row, and every set T  Y, Pr[C(D 1 )  T]  exp(  )  Pr[C(D 2 )  T] +   Pr[C(D 2  T] Think of  as a small constant, e.g.  =.01,  as cryptographically small, e.g.  = Database D  X n C(D) C curator

6 Differential Privacy: Interpretations Def: A randomized algorithm C : X n  Y is (  ) differentially private if for every two databases D 1, D 2 that differ on one row, and every set T  Y, Pr[C(D 1 )  T].  Pr[C(D 2  T] My data affects what an adversary sees by at most . Whatever an adversary learns about me, it could have learned from everyone else’s data. Above interpretations hold regardless of adversary’s auxiliary information. Composes gracefully (k repetitions ) k  differentially private) But no protection for information that is not localized to a few rows.

7 D = (x 1,…,x n )  X n Goal: given ¼ : X ! {0,1} estimate counting query  (D):=(  i  (x i ))/n within error   Example: X = {0,1} d  = conjunction on  k variables Counting query = k-way marginal e.g. What fraction of people in D smoke and have cancer? Differential Privacy: Example >35Smoker?Cancer?

8 D = (x 1,…,x n )  X n Goal: estimate (fractional) counting query  (D):=(  i  (x i ))/n within error   Solution: C(D) =  (D) + Lap(1/ ² n), where Lap( ¾ ) has density / exp(- ¾ |x|) ) error · ® provided n & 1/( ²® ) For k queries, suffices to have n & k/( ²® ) or even n & k 1/2 /( ²® ) ) Differential Privacy: Example

9 Other Differentially Private Algorithms histograms [DMNS06] contingency tables [BCDKMT07, GHRU11], machine learning [BDMN05,KLNRS08], logistic regression [CM08] clustering [BDMN05] social network analysis [HLMJ09] approximation algorithms [GLMRT10] …

10 Computational Complexity When do computational resource constraints change what is possible? Examples: Computational Learning Theory [Valiant `84]: small VC dimension  learnable with efficient algorithms (bad news) Cryptography [Diffie & Hellman `76]: don’t need long shared secrets against a computationally bounded adversary (good news)

11 This talk: Computational Complexity in Differential Privacy Q: Do computational resource constraints change what is possible? Computationally bounded curator – Makes differential privacy harder – Differentially private & accurate synthetic data infeasible to construct – Open: release other types of summaries/models? Computationally bounded adversary – Makes differential privacy easier – Provable gain in accuracy for 2-party protocols (e.g. for estimating Hamming distance)

12 A More Ambitious Goal: Noninteractive Data Release Original Database DSanitization C(D) C Goal: From C(D), can answer many questions about D, e.g. all counting queries associated with a large family of predicates P = { ¼ : X ! {0,1}}

13 Noninteractive Data Release: Desidarata ( ,  )-differential privacy: for every D 1, D 2 that differ in one row and every set T, Pr[C(D 1 )  T]  exp(  )  Pr[C(D 2 )  T]+  with  negligible Utility: C(D) allows answering many questions about D Computational efficiency: C is polynomial-time computable.

14 D = (x 1,…,x n )  X n P = {  : X  {0,1}} For any  P, want to estimate (from C(D)) counting query  (D):=(  i  (x i ))/n within accuracy error   Example: X = {0,1} d P = {conjunctions on  k variables} Counting query = k-way marginal e.g. What fraction of people in D smoke and have cancer? Utility: Counting Queries >35Smoker?Cancer?

15 Form of Output Ideal: C(D) is a synthetic dataset –   P |  (C(D))-  (D)|   – Values consistent – Use existing software Alternatives? – Explicit list of |P| answers (e.g. contingency table) – Median of several synthetic datasets [RR10] – Program M s.t.   P |M(  )-  (D)|   >35Smoker?Cancer?

16 Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) N D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

17 Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

18 Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) [BDCKMT07] Õ((2d) k /  ) Ypoly(n,2 d ) D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

19 Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) [BDCKMT07] Õ((2d) k /  ) Ypoly(n,2 d ) [BLR08] O(d  log|P|/    )Õ(dk/    ) Y D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

20 Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) [BDCKMT07] Õ((2d) k /  ) Ypoly(n,2 d ) [BLR08] O(d  log|P|/    )Õ(dk/    ) Yqpoly(n,|P|,2 d )qpoly(n,2 d ) D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

21 Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) [BDCKMT07] Õ((2d) k /  ) Ypoly(n,2 d ) [BLR08] O(d  log|P|/    )Õ(dk/    ) Yqpoly(n,|P|,2 d )qpoly(n,2 d ) [DNRRV09, DRV10, HR10] O(d  log 2 |P|/     )Õ(dk 2 /     ) Ypoly(n,|P|,2 d ) D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  error in accuracy  = privacy Summary: Can construct synthetic databases accurate on huge families of counting queries, but complexity may be exponential in dimensions of data and query set P. Question: is the complexity inherent?

22 Our Result: Intractability of Synthetic Data Informally: Producing accurate & differentially private synthetic data is as hard as breaking cryptography (e.g. factoring large integers). Inherently exponential in dimensionality of data (and in dimensionality of queries).

23 Our Result: Intractability of Synthetic Data Thm [UV11, building on DNRRV10]: Under standard crypto assumptions (OWF), there is no n=poly(d) and curator that: – Produces synthetic databases. – Is differentially private. – Runs in time poly(n,d). – Achieves error  =.01 for 2-way marginals. Proof overview: 1.Use digital signatures to show hardness for a complicated (but efficiently computable) family of counting queries [DNRRV10] 2.Use PCPs to reduce complicated queries to simple ones [UV11].

24 Tool 1: Digital Signature Schemes A digital signature scheme consists of 3 poly-time algorithms (Gen,Sign,Ver): On security parameter d, Gen(d) = (SK,PK)  {0,1} d  {0,1} d On m  {0,1} d, can compute  =Sign SK (m)  {0,1} d s.t. Ver PK (m,  )=1 Given many (m,  ) pairs, infeasible to generate new (m’,  ’) satisfying Ver PK Thm [NY89,Rom90]: Digital signature schemes exist iff one-way functions exist.

25 Hard-to-Sanitize Databases I [DNRRV10] Ver PK (D) = 1 m1m1 Sign SK (m 1 ) m2m2 Sign SK (m 2 ) m3m3 Sign SK (m 3 )  mnmn Sign SK (m n ) Generate random (PK,SK)  Gen(d), m 1, m 2,…, m n  {0,1} d D m’ 1 11 m’ 2 22  m’ k kk curator C(D) Case 1: m’ i  D  Forgery! Case 2: m’ i  D  Reidentification! If error · ® wrt Ver PK : Ver PK (C(D))  1-  > 0  i Ver PK (m’ i,  i )=1

26 Tool 2: Probabilistically Checkable Proofs The PCP Theorem:  efficient algorithms (Red,Enc,Dec) s.t. w s.t. V(w)=1 Circuit V of size poly(d) Enc Red Set of 3-clauses on d’=poly(d) vars  V ={x 1  x 5   x 7,  x 1  v 5  x d’,…} z  {0,1} d’ satisfying all of  V z’  {0,1} d’ satisfying.99 fraction of  V Decw’ s.t. V(w’)=1

27 Hard-to-Sanitize Databases II [UV11] Let  PK = Red(Ver PK ) Each clause in  PK is satisfied by all z i m1m1 Sign SK (m 1 ) m2m2 Sign SK (m 2 ) m3m3 Sign SK (m 3 )  mnmn Sign SK (m n ) D z’ 1 z’ 2  z’ k curator C(D) Case 1: m’ i  D  Forgery! Case 2: m’ i  D  Reidentification! If error ·.01 wrt 3-way marginals: Each clause in  PK is satisfied by .99 fraction of z’ i  i s.t. z’ i satisfies  1-  fraction of clauses Dec(z’ i ) = valid (m’ i,  i ) z1z1 z2z2 z3z3  znzn Enc Ver PK

28 Conclusions Producing private, synthetic databases that preserve simple statistics requires computation exponential in the dimension of the data. How to bypass? Average-case accuracy: Heuristics that don’t give good accuracy on all databases, only those from some class of models. Non-synthetic data: – Hardness extends to some generalizations of synthetic data (e.g. medians of synthetic data). – Thm [DNRRV09]: For unnatural but efficiently computable P  efficient curators “iff”  efficient “traitor-tracing” schemes – But for natural P (e.g. P={all marginals}), wide open!

29 Computational Differential Privacy Differential privacy protects even against adversaries with unlimited computational power. Can we gain by restricting to adversaries with bounded (but still huge) computational power? – Better accuracy/utility? – Enormous success in cryptography from considering computationally bounded adversaries.

30 Computational Differential Privacy: state of affairs Implicit in works on distributed implementations of differential privacy [DKMMN06,BKO08] Formal definitions in [MPRV09], related to “dense model theorems” in additive combinatorics & pseudorandomness. Provable gains in accuracy for 2-party protocols – Computational differential privacy has error O (1/ ² ) [MPRV09] – 2-party differential privacy requires error £ ~(n 1/2 ) [MMPRTV10] – Info-theoretic 2-party DP related to communication complexity, randomness extraction. [MMPRTV10] Little gain for client/server setting [GKK11]

31 2-Party Privacy 2-party (& multiparty) privacy: each party has a sensitive dataset, want to do a joint computation f(D A,D B ) m1m1 m2m2 m3m3 m k-1 mkmk DADA x1x1 x2x2  xnxn DBDB y1y1 y2y2  ymym Z A  f(D A,D B ) Z B  f(D A,D B ) A’s view should be a (computational) differentially private function of D B (even if A deviates from protocol), and vice-versa

32 Conclusions Computational complexity is relevant to differential privacy. Bad news: producing synthetic data is intractable Good news: better protocols against bounded adversaries Interaction with differential privacy likely to benefit complexity theory too.


Download ppt "Computational Complexity & Differential Privacy Salil Vadhan Harvard University Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov,"

Similar presentations


Ads by Google