Presentation is loading. Please wait.

Presentation is loading. Please wait.

Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis Yahoo! Labs Sunnyvale February.

Similar presentations


Presentation on theme: "Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis Yahoo! Labs Sunnyvale February."— Presentation transcript:

1 Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis Yahoo! Labs Sunnyvale February 24, 2015

2 ‘Big Data’ is the amassing of huge amounts of statistical information on social and economic trends and human behavior. – M. Chen, The Nation Privacy and Big Data Hasn’t data always been big?!  Particle Physics, Astronomy, Geology, … This data contains sensitive information about individuals. “Big Data is transforming our world!” – just about everyone This work: How can we get the benefits of Big Data and provide rigorous privacy guarantees?

3 3 Curator/agency Individuals Users A queries answers ) ( Government, researchers, businesses (or) Malicious adversary Privacy in Statistical Databases Two conflicting goals  Utility: Users can extract “aggregate” statistics.  Privacy: Individual information remains hidden. Challenge: achieve both!  Not easy!

4 Ubiquity of information A A queries answers ) ( Government, researchers, businesses (or) malicious adversary Curator x1x1 x2x2 xnxn...... Users internet social networks anonymized datasets External sources of information  Cannot assume we know or control them.  Cannot ignore them. Ad-hoc anonymization schemes are regularly broken.

5  Anonymized datasets [Narayanan, Shmatikov ’08, …]  Social networks [Backstrom, Dwork, Kleinberg ’07, NS’09, …]  Genetic data (GWAS) [Homer et al ’08, …]  Microtargeted advertising [Korolova ’11, …]  Recommendation Systems [Calandrino et al ’11, …]  Combining independent anonymized releases [Ganta, Kasiviswanathan, Smith ’08] Hospital B Hospital A Attacker Some Published Attacks

6 Attack on Recommender Systems [Calandrino et al ’11] Bob (Attacker) Side info = Alice bought A, B, C Alice Bought A, B, C Now Bought G A List(A) B List(B) C List(C) Later A List(A) + G B List(B) + G C List(C) + G

7 Several aggregate statistics encode information about datasets  Average salary before and after a professor resigns. Reconstruction attacks  Too many, too “accurate” statistics  reconstruct data [Dinur, Nissim ’03, Dwork, McSherry, Talwar ’07, Kasiviswanathan, Rudelson, Smith, Ullman ’10, KRS ’13, …] “Aggregate” not necessarily safe

8 This work Gives efficient algorithms for statistical data analyses with optimal accuracy under rigorous, provable privacy guarantees.

9 This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

10 This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

11 Differential privacy A A x1x1 x2x2 xnxn x2’x2’ x1x1 xnxn Differential privacy requires that these two settings “look the same” to the user. User/ Analyst Individuals Curator Individuals Curator User/ Analyst [Dwork-McSherry-Nissim-Smith’06, Dwork-Kendapathi-McSherry-Mironov-Noar’06]

12 Differential privacy [DMNS’06, DKMMN’06] local random coins A A x1x1 x2x2 xnxn x2’x2’ x1x1 Datasets x and x ’ are called neighbors if they differ in one record. xnxn Require: Neighbor datasets induce close distributions on outputs Def.: A randomized algorithm A is -differentially private if, for all neighbor data sets and, for all events, “Almost same” conclusions will be reached from the output regardless of whether any individual opts into or opts out of the data set. Think of Two regimes:  -differential privacy Worst-case definition: DP gives same guarantee regardless of side information of attacker. Worst-case definition: DP gives same guarantee regardless of side information of attacker.

13 Two models for private data analysis A Individuals Trusted Curator x1x1 x2x2 xnxn A is differentially private w.r.t. datasets of size n Centralized model B Individuals Untrusted Curator y1y1 y2y2 ynyn x1x1 x2x2 xnxn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn Each Q i is differentially private w.r.t. datasets of size 1 Local model

14 This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

15 Example of Convex ERM: Support Vector Machines Goal: Classify data points of different “types”  Find a hyper-plane separating two different “types” of data points. Many applications  Medical studies: Disease classification based on protein structures. Tested +ve Tested -ve Many applications  Medical studies: Disease classification based on protein structures. Coefficients of hyper-plane is the solution of a convex optimization problem defined by the data set. is given by a linear combination of only few data points called support vectors.

16 Convex empirical risk minimization Dataset. Convex constraint set. Empirical risk function where is convex for all. C

17 Convex empirical risk minimization Actual minimizer C Dataset. Convex constraint set. Empirical risk function where is convex for all. Goal: Find a “parameter” that minimizes

18 Excess risk Output Actual minimizer C Dataset. Convex constraint set. Empirical risk function where is convex for all. Goal: Find a “parameter” that minimizes Output such that Convex empirical risk minimization

19 Other examples Median Linear regression

20 Why privacy is hard to maintain in ERM? Dual form of SVM: typically contains a subset of the exact data points in the clear. Median: Minimizer is always a data point.

21 Private convex ERM Studied by [Chaudhuri-et-al ‘11, Rubinstein-et-al ’11, Kifer- Smith-Thakurta‘12, Smith-Thakurta ’13, …] Privacy: A is differentially private in input Utility measured by (worst-case) expected excess risk: A -diff. private Dataset Convex setRisk, Random coins

22 1.Efficient algorithms with optimal excess risk Separate set of algorithms for strongly convex risk functions.) 2.Matching lower bounds on the excess risk. Best previous work [Chaudhuri-et-al’11, Kifer et al.’12] address special case (smooth functions)  Application of many problems (e.g., SVM, median, …) introduces large additional error. Contributions [B, Smith, Thakurta ‘14] This work improves previous excess risk bounds by factor of

23 PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) Normalized bounds: Risk is 1-Lipschitz on parameter set C of diameter 1. Results ( dataset size =, C )

24 PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) Results ( dataset size =, C ) Normalized bounds: Risk is 1-Lipschitz on parameter set C of diameter 1.

25 Exponential sampling Contributions:  Efficient sampling based on rapidly mixing MCMC  Tight analysis exploiting structure of convex functions. Define a probability distribution over C : Output a sample from C according to

26 Run SGD with noisy queries for sufficiently many iterations. Noisy stochastic gradient descent Contributions:  Stochastic  privacy amplification  Tight analysis.

27 This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

28 Finance.com Fashion.com WeirdStuff.com How many users like Business.com?...... A conundrum Server How can the server compute aggregate statistics about users without storing user-specific information? How can the server compute aggregate statistics about users without storing user-specific information?

29 ...... n 1 2...... Untrusted server A set of items (e.g. websites) = [d] = {1, …, d} Set of users = [n] Frequency of an item a is f(a) = ( ♯ users holding a)/n Finance.com Fashion.com WeirdStuff.com Goal is to produce a succinct histogram: a list of frequent items (“heavy hitters”) and estimates of their frequencies while providing rigorous privacy guarantees to the users. Goal is to produce a succinct histogram: a list of frequent items (“heavy hitters”) and estimates of their frequencies while providing rigorous privacy guarantees to the users.... 1 2 3 Item ♯... d-2 d-1 d... 1 2 3 Item ♯... d-2 d-1 d... Succinct histogram = for some implicitly Succinct histograms

30 Local model of Differential Privacy Algorithm Q is -local differentially private (LDP) if for any pair v, v’ [d], for all events S, v1v1...... v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn...... z1z1 z2z2 znzn Succinct histogram is item of user z i is differentially-private report of user i LDP protocols for frequency estimation is used in Chrome web browser (RAPPOR) [Erlingsson-Korolova-Pihur’14] as a basis for other estimation tasks [Dwork-Nissim’04]

31 Error is measured by the worst-case estimation error: Performance measures v1v1...... v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn...... z1z1 z2z2 znzn Succinct histogram is item of user z i is differentially-private report of user i A protocol is efficient if it runs in time poly(log(d), n) Communication Complexity measured by number of bits transmitted per user. d is very large, e.g., number of all possible URL’s log(d) = # of bits to describe single URL d is very large, e.g., number of all possible URL’s log(d) = # of bits to describe single URL

32 Contributions [B, Smith ‘15] 1.Efficient -LDP protocol with optimal error: run in time poly(log(d), n). Estimate all frequencies up to error. 2.Matching lower bound on the error. 3.Generic transformation reducing the communication complexity to 1 bit/user. Previous protocols either  ran in time [Mishra-Sandler’06, Hsu-Khanna-Roth’12, EKP’14]  or, had larger error [HKR’12] Too slow Too much error Best previous lower bound was

33 UHH: at least fraction of users have the same item while the rest have (i.e., “no item”) Design paradigm Reduction from a simpler problem with a unique heavy hitter (UHH problem)  Efficient protocol with optimal error for UHH  efficient protocol with optimal error for the general problem.

34 Construction for the UHH problem v*v* v*v*...... Encoder z1z1 Noising operator z2z2 znzn Round Decoder (error-correcting code) Key idea: is the signal-to-noise ratio. Decoding succeeds when Each user has either v* or v* is unknown to the server Goal: Find v* and estimate f(v*) Each user has either v* or v* is unknown to the server Goal: Find v* and estimate f(v*) Similar to [Duchi et al.’13]

35 Guarantees that w.h.p., every heavy hitter is allocated a “collision-free” copy of the UHH protocol. v1v1 vnvn Hash.......... 1 K 2.... 1 K 2........ v1v1 vnvn........ UHH.. Item whose frequency Construction for the general setting Key insight: Decompose general scenario into multiple instances of UHH via hashing. Run parallel copies of the UHH protocol on these instances.

36 ...... Efficient Private Protocol for a unique heavy hitter UHH Efficient Private Protocol for estimating all heavy hitters Efficient Private Protocol for a unique heavy hitter UHH Time poly(log(d), n) All frequencies up to the optimal error Efficient Private Protocol for a unique heavy hitter UHH Recap: Construction of succinct histograms

37 Transforming to a protocol with 1-bit reports Our transformation gives essentially same error and computational efficiency! v1v1...... v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn...... 1 bit Succinct histogram 1 bit

38 This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

39 Attacker’s side information A A queries answers ) ( Curator x1x1 xixi xnxn.... Attacker internet social networks anonymized datasets.... Attacker’s side information is the main reason privacy is hard.

40 Attacker’s side information A A queries answers ) ( Curator x1x1 xixi xnxn.... Omniscient attacker.... everything except x i Differential privacy is robust against arbitrary side information. Attackers typically have limited knowledge. Contributions [B, Groce, Katz, Smith’13]: Rigorous framework for formalizing and exploiting limited adversarial information. Protocols with higher accuracy than is possible under differential privacy Contributions [B, Groce, Katz, Smith’13]: Rigorous framework for formalizing and exploiting limited adversarial information. Protocols with higher accuracy than is possible under differential privacy

41 Exploiting attacker’s uncertainty [BGKS’13] A A queries answers ) ( Curator x1x1 xixi xnxn.... Attacker.... Side info in Δ for any side information in Δ, Given some restricted class of attacker’s knowledge Δ, the output of A must “look the same” to the attacker regardless of whether any single individual is in or out of the computation.

42 Distributional Differential Privacy [BGKS’13] local random coins A A x1x1 xixi xnxn xixi x1x1 xnxn A is -DDP if, for any distribution on the data set, for any index i, for any value v of a data entry, and for any event A is -DDP if, for any distribution on the data set, for any index i, for any value v of a data entry, and for any event This implies: for all distributions and for all i, w.p. : For any distribution in Δ, almost same inferences will be made about Alice whether or not Alice’s data is present in the data set.

43 What can we release exactly and privately? Sums  whenever the data distribution has a small uniform component. Histograms  constructed from a random sample from the population. Stable functions  small probability that the output changes when any single entry of the dataset changes. Under modest distributional assumptions, we can release several exact statistics while satisfying DDP:

44 Work in Progress

45 Reliable Adaptive Statistical Analysis A A q1q1 a1a1 Curator x1x1 x2x2 xnxn...... Analyst...... qmqm amam Sample Population A A q1q1 b1b1 Curator Analyst...... qmqm bmbm Want to minimize the worst error between the true answers based on the population and the answers based on the sample. Answers of differentially private algorithms do not depend on outliers. Can DP limit this error? For what queries? DP gives rigorous error guarantees for statistical queries [Dwork et al ‘15] ; optimal? Improvements: B., Smith, Steinke, Ullman (in progress):  Better error guarantees for a larger class of queries (ERM, PAC learning, …). DP gives rigorous error guarantees for statistical queries [Dwork et al ‘15] ; optimal? Improvements: B., Smith, Steinke, Ullman (in progress):  Better error guarantees for a larger class of queries (ERM, PAC learning, …).

46 Future Work

47 Merging Differential Privacy & Secure Function Evaluation SFE: Individuals want to compute a function f s.t. no party gets more information beyond f( x 1, x 2, x 3 ) x1x1 x2x2 x3x3

48 Merging Differential Privacy & Secure Function Evaluation x1x1 x2x2 x3x3 SFE: Individuals want to compute a function f s.t. no party gets more information beyond f( x 1, x 2, x 3 ) f ( x 1, x 2, x 3 ) DP: Protect against what can be revealed from f( x 1, x 2, x 3 ) itself. Secure MPC protocols for differentially private computation of f  Computational Differential privacy [Mironov et al ‘09] in multiparty setting?  Currently: Limited results  2-party setting [McGregor et al ’10], large error [McGregor et al ‘10, Beimel-Nissim-Omri ‘08].

49 Conclusions Privacy, a pressing concern in “Big Data”, but hard to define intuitively. Differential privacy, a sound rigorous approach:  Robust against arbitrary side information This work:  the first efficient differentially private algorithms with optimal accuracy guarantees for essential tasks in statistical data analysis.  generic definitional framework for privacy relaxing DP.


Download ppt "Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis Yahoo! Labs Sunnyvale February."

Similar presentations


Ads by Google