Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February.

Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February 23, 2015

Privacy in Statistical Databases A A queries answers ) ( Government, researchers, businesses (or) malicious adversary Curator x1x1 x2x2 xnxn...... Users Two conflicting goals: Utility vs. Privacy internet social networks anonymized datasets Balancing these goals is tricky:  No control over external sources of information  Ad-hoc Anonymization schemes are unreliable:  [Narayanan-Shmatikov’08],  [Korolova’11],  [Calendrino et al.’12], … Need algorithms with robust, provable privacy guarantees.

This work Gives efficient algorithms for statistical data analyses with optimal accuracy under rigorous, provable privacy guarantees.

Differential privacy [DMNS’06, DKMMN’06] local random coins A A x1x1 x2x2 xnxn x2’x2’ x1x1 Datasets x and x ’ are called neighbors if they differ in one record. xnxn Require: Neighbor datasets induce close distributions on outputs Def.: A randomized algorithm A is -differentially private if, for all neighbor data sets and, for all events, “Almost same” conclusions will be reached from the output regardless of whether any individual opts into or opts out of the data set. Think of Worst-case definition: DP gives same guarantee regardless of side information of attacker. Worst-case definition: DP gives same guarantee regardless of side information of attacker. Two regimes:  -differential privacy  -differential privacy,

Two models for private data analysis A Individuals Trusted Curator x1x1 x2x2 xnxn A is differentially private w.r.t. datasets of size n Centralized model B Individuals Untrusted Curator y1y1 y2y2 ynyn x1x1 x2x2 xnxn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn Each Q i is differentially private w.r.t. datasets of size 1 Local model

This talk 1.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 2.Generic framework for relaxing Differential Privacy

1.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 2.Generic framework for relaxing Differential Privacy This talk

Example of Convex ERM: Support Vector Machines Goal: Classify data points of different “types”  Find a hyper-plane separating two different “types” of data points. Many applications  Medical studies: Disease classification based on protein structures. Tested +ve Tested -ve Many applications  Medical studies: Disease classification based on protein structures. Coefficients of hyper-plane is the solution of a convex optimization problem defined by the data set. is given by a linear combination of only few data points called support vectors.

Convex empirical risk minimization C Dataset. Convex constraint set. Loss function where is convex for all.

Convex empirical risk minimization Actual minimizer C Dataset. Convex constraint set. Loss function where is convex for all. Goal: Find a “parameter” that minimizes

Excess risk Output Actual minimizer C Dataset. Convex constraint set. Loss function where is convex for all. Goal: Find a “parameter” that minimizes Output such that Convex empirical risk minimization

Other examples Median Linear regression

Why privacy is hard to maintain in ERM? Dual form of SVM: typically contains a subset of the exact data points in the clear. Median: Minimizer is always a data point.

Private convex ERM [Chaudhuri-Monteleoni 08 & -- Sarwate 11] Studied by [Chaudhuri-et-al ‘11, Rubinstein-et-al ’11, Kifer- Smith-Thakurta‘12, Smith-Thakurta ’13, …] Privacy: A is differentially private in input Utility measured by (worst-case) expected excess risk: A -diff. private Dataset Convex setLoss, Random coins

Best previous work [Chaudhuri-et-al’11, Kifer et al.’12] address special case (smooth functions)  Application to many problems (e.g., SVM, median, …) introduces large additional error. Contributions [B, Smith, Thakurta ‘14] This work improves previous excess risk bounds by factor of 1.New algorithms with optimal excess risk assuming: Loss function is Lipschitz. Parameter set C is bounded. (Separate set of algorithms for strongly convex loss.) 2.Matching lower bounds

PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) Normalized bounds: Loss is 1-Lipschitz on parameter set C of diameter 1. Results ( dataset size =, C )

PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) Results ( dataset size =, C ) Normalized bounds: Loss is 1-Lipschitz on parameter set C of diameter 1.

Exponential sampling Define a probability distribution over C : Output a sample from C according to Define a probability distribution over C : Output a sample from C according to An instance of the exponential mechanism [McSherry-Talwar’08]  Efficient construction based on rapidly mixing MCMC:  Uses [Applegate-Kannan’91] as a subroutine.  Provides purely multiplicative convergence guarantee.  Does not follow directly from existing results.  Tight utility analysis via a “peeling” argument:  Exploits structure of convex functions: A 1, A 2, … are decreasing in volume  Shows that when

Run SGD with noisy queries for sufficiently many iterations. Noisy stochastic gradient descent Our contributions:  Tight privacy analysis  Stochastic  privacy amplification  Running SGD for many iterations (T = n 2 iterations)  optimal excess risk. Remarks: Stochastic part only for efficiency. Empirically, [CSS’13] showed few iterations are enough in some cases.

Generalization error For a distribution, generalization error at : For any distribution, for output of any -DP algorithm: -DP algorithm such that: Generalized linear model: we get  optimal.

Finance.com Fashion.com WeirdStuff.com How many users like Business.com?...... A conundrum server How can the server compute aggregate statistics about users without storing user-specific information? How can the server compute aggregate statistics about users without storing user-specific information?

...... n 1 2...... Untrusted server A set of items (e.g. websites) = [d] = {1, …, d} Set of users = [n] Frequency of an item a is f(a) = ( ♯ users holding a)/n Finance.com Fashion.com WeirdStuff.com Goal is to produce a succinct histogram: a list of frequent items (“heavy hitters”) and estimates of their frequencies while providing rigorous privacy guarantees to the users. Goal is to produce a succinct histogram: a list of frequent items (“heavy hitters”) and estimates of their frequencies while providing rigorous privacy guarantees to the users.... 1 2 3 Item ♯... d-2 d-1 d... 1 2 3 Item ♯... d-2 d-1 d... Succinct histogram = for some implicitly Succinct histograms

Local model of Differential Privacy Algorithm Q is -local differentially private (LDP) if for any pair v, v’ [d], for all events S, v1v1...... v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn...... z1z1 z2z2 znzn Succinct histogram is item of user z i is differentially-private report of user i LDP protocols for frequency estimation is used in Chrome web browser (RAPPOR) [Erlingsson-Korolova-Pihur’14] as a basis for other estimation tasks [Dwork-Nissim’04]

Error is measured by the worst-case estimation error: Performance measures v1v1...... v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn...... z1z1 z2z2 znzn Succinct histogram is item of user z i is differentially-private report of user i A protocol is efficient if it runs in time poly(log(d), n) Communication Complexity measured by number of bits transmitted per user. d is very large, e.g., number of all possible URL’s log(d) = # of bits to describe single URL d is very large, e.g., number of all possible URL’s log(d) = # of bits to describe single URL

Contributions [B, Smith ‘15] 1.Efficient -LDP protocol with optimal error: run in time poly(log(d), n). Estimate all frequencies up to error. 2.Matching lower bound on the error. 3.Generic transformation reducing the communication complexity to 1 bit/user. Previous protocols either  ran in time [Mishra-Sandler’06, Hsu-Khanna-Roth’12, EKP’14]  or, had larger error [HKR’12] Too slow Too much error Best previous lower bound was

UHH: at least fraction of users have the same item while the rest have (i.e., “no item”) Design paradigm Reduction from a simpler problem with a unique heavy hitter (UHH problem)  Efficient protocol with optimal error for UHH  efficient protocol with optimal error for the general problem.

Construction for the UHH problem v*v* v*v*...... Encoder z1z1 Noising operator z2z2 znzn Round Decoder (error-correcting code) Key idea: is the signal-to-noise ratio. Decoding succeeds when Each user has either v* or v* is unknown to the server Goal: Find v* and estimate f(v*) Each user has either v* or v* is unknown to the server Goal: Find v* and estimate f(v*) Similar to [Duchi et al.’13]

Guarantees that w.h.p., every heavy hitter is allocated a “collision-free” copy of the UHH protocol. v1v1 vnvn Hash.......... 1 K 2.... 1 K 2........ v1v1 vnvn........ UHH.. Item whose frequency Construction for the general setting Key insight: Decompose general scenario into multiple instances of UHH via hashing. Run parallel copies of the UHH protocol on these instances.

...... Efficient Private Protocol for a unique heavy hitter UHH Efficient Private Protocol for estimating all heavy hitters Efficient Private Protocol for a unique heavy hitter UHH Time poly(log(d), n) All frequencies up to the optimal error Efficient Private Protocol for a unique heavy hitter UHH Recap: Construction of succinct histograms

Transforming to a protocol with 1-bit reports generate public random string; one for each user User i sends a biased bit B i Conditioned on B i = 1, the public string has the same distribution as the output of local randomizer Q i Gen( Q i, v i, s i ) vivi BiBi s i Local randomizer: Q i IF B i = 1, THEN report of user i = s i ELSE ignore user i IF B i = 1, THEN report of user i = s i ELSE ignore user i  This transformation works for any local protocol not only heavy hitters. Key idea: What matters is the distribution of the output of each local randomizer.  Public string does not depend on private data: can be generated by untrusted server.  For our HH protocol, this transformation gives essentially same error and computational efficiency (Gen can be computed in O(log(d))).

Attacker’s side information A A queries answers ) ( Curator x1x1 xixi xnxn.... Attacker internet social networks anonymized datasets.... Attacker’s side information is the main reason privacy is hard.

Attacker’s side information A A queries answers ) ( Curator x1x1 xixi xnxn.... Omniscient attacker.... everything except x i Differential privacy is robust against arbitrary side information. Attackers typically have limited knowledge. Contributions [B, Groce, Katz, Smith’13]: Rigorous framework for formalizing and exploiting limited adversarial information: coupled-worlds privacy Algorithms with higher accuracy than is possible under differential privacy Contributions [B, Groce, Katz, Smith’13]: Rigorous framework for formalizing and exploiting limited adversarial information: coupled-worlds privacy Algorithms with higher accuracy than is possible under differential privacy

Exploiting attacker’s uncertainty [BGKS’13] A A queries answers ) ( Curator x1x1 xixi xnxn.... Attacker.... Side info in Δ for any side information in Δ, Given some restricted class of attacker’s knowledge Δ, the output of A must “look the same” to the attacker regardless of whether any single individual is in or out of the computation.

Distributional Differential Privacy [BGKS’13] local random coins A A x1x1 xixi xnxn xixi x1x1 xnxn A is -DDP if, for any distribution on the data set, for any index i, for any value v of a data entry, and for any event A is -DDP if, for any distribution on the data set, for any index i, for any value v of a data entry, and for any event This implies: for all distributions and for all i, w.p. : For any distribution in Δ, almost same inferences will be made about Alice whether or not Alice’s data is present in the data set.

What can we release exactly and privately? Sums  whenever the data distribution has a small uniform component. Histograms  constructed from a random sample from the population. Stable functions  small probability that the output changes when any single entry of the dataset changes. Under modest distributional assumptions, we can release several exact statistics while satisfying DDP:

Conclusions Privacy, a pressing concern in “Big Data”, but hard to define intuitively. Differential privacy, a sound rigorous approach:  Robust against arbitrary side information This work:  the first efficient differentially private algorithms with optimal accuracy guarantees for essential tasks in statistical data analysis.  generic definitional framework for privacy relaxing DP.

Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February.

Similar presentations

Presentation on theme: "Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February.

Similar presentations

Presentation on theme: "Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February."— Presentation transcript:

Similar presentations

About project

Feedback