Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis Yahoo! Labs Sunnyvale February.

Slides:

Advertisements

Similar presentations

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Raef Bassily Penn State Local, Private, Efficient Protocols for Succinct Histograms Based on joint work with Adam Smith (Penn State) (To appear in STOC.

Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.

Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev

Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.

Amortizing Garbled Circuits Yan Huang, Jonathan Katz, Alex Malozemoff (UMD) Vlad Kolesnikov (Bell Labs) Ranjit Kumaresan (Technion) Cut-and-Choose Yao-Based.

Eran Omri, Bar-Ilan University Joint work with Amos Beimel and Ilan Orlov, BGU Ilan Orlov…!??!!

Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.

Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February.

Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)

Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Locally Decodable Codes Uri Nadav. Contents What is Locally Decodable Code (LDC) ? Constructions Lower Bounds Reduction from Private Information Retrieval.

Foundations of Privacy Lecture 11 Lecturer: Moni Naor.

On Everlasting Security in the Hybrid Bounded Storage Model Danny Harnik Moni Naor.

Private Analysis of Graphs

The Complexity of Differential Privacy Salil Vadhan Harvard University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.

1 Privacy-Preserving Distributed Information Sharing Nan Zhang and Wei Zhao Texas A&M University, USA.

Data mining and machine learning A brief introduction.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

Privacy by Learning the Database Moritz Hardt DIMACS, October 24, 2012.

Refined privacy models

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.

On the Communication Complexity of SFE with Long Output Daniel Wichs (Northeastern) joint work with Pavel Hubáček.

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

Foundations of Privacy Lecture 5 Lecturer: Moni Naor.

Differential Privacy Some contents are borrowed from Adam Smith’s slides.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)

Differential Privacy (1). Outline  Background  Definition.

Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Secure Data Outsourcing

Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.

Learning with General Similarity Functions Maria-Florina Balcan.

Sergey Yekhanin Institute for Advanced Study Lower Bounds on Noise.

Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies Florian Tramèr, Zhicong Huang, Erman Ayday,

Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.

University of Texas at El Paso

Data Transformation: Normalization

Private Data Management with Verification

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Understanding Generalization in Adaptive Data Analysis

Privacy-preserving Release of Statistics: Differential Privacy

Generalization and adaptivity in stochastic convex optimization

Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.

Privacy-Preserving Classification

Differential Privacy in Practice

Differential Privacy in the Local Setting

Vitaly (the West Coast) Feldman

Current Developments in Differential Privacy

Differential Privacy and Statistical Inference: A TCS Perspective

Privacy-preserving Prediction

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Generalization bounds for uniformly stable algorithms

Published in: IEEE Transactions on Industrial Informatics

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Refined privacy models

Differential Privacy (1)

Presentation transcript:

Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis Yahoo! Labs Sunnyvale February 24, 2015

‘Big Data’ is the amassing of huge amounts of statistical information on social and economic trends and human behavior. – M. Chen, The Nation Privacy and Big Data Hasn’t data always been big?!  Particle Physics, Astronomy, Geology, … This data contains sensitive information about individuals. “Big Data is transforming our world!” – just about everyone This work: How can we get the benefits of Big Data and provide rigorous privacy guarantees?

3 Curator/agency Individuals Users A queries answers ) ( Government, researchers, businesses (or) Malicious adversary Privacy in Statistical Databases Two conflicting goals  Utility: Users can extract “aggregate” statistics.  Privacy: Individual information remains hidden. Challenge: achieve both!  Not easy!

Ubiquity of information A A queries answers ) ( Government, researchers, businesses (or) malicious adversary Curator x1x1 x2x2 xnxn Users internet social networks anonymized datasets External sources of information  Cannot assume we know or control them.  Cannot ignore them. Ad-hoc anonymization schemes are regularly broken.

 Anonymized datasets [Narayanan, Shmatikov ’08, …]  Social networks [Backstrom, Dwork, Kleinberg ’07, NS’09, …]  Genetic data (GWAS) [Homer et al ’08, …]  Microtargeted advertising [Korolova ’11, …]  Recommendation Systems [Calandrino et al ’11, …]  Combining independent anonymized releases [Ganta, Kasiviswanathan, Smith ’08] Hospital B Hospital A Attacker Some Published Attacks

Attack on Recommender Systems [Calandrino et al ’11] Bob (Attacker) Side info = Alice bought A, B, C Alice Bought A, B, C Now Bought G A List(A) B List(B) C List(C) Later A List(A) + G B List(B) + G C List(C) + G

Several aggregate statistics encode information about datasets  Average salary before and after a professor resigns. Reconstruction attacks  Too many, too “accurate” statistics  reconstruct data [Dinur, Nissim ’03, Dwork, McSherry, Talwar ’07, Kasiviswanathan, Rudelson, Smith, Ullman ’10, KRS ’13, …] “Aggregate” not necessarily safe

This work Gives efficient algorithms for statistical data analyses with optimal accuracy under rigorous, provable privacy guarantees.

This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

Differential privacy A A x1x1 x2x2 xnxn x2’x2’ x1x1 xnxn Differential privacy requires that these two settings “look the same” to the user. User/ Analyst Individuals Curator Individuals Curator User/ Analyst [Dwork-McSherry-Nissim-Smith’06, Dwork-Kendapathi-McSherry-Mironov-Noar’06]

Differential privacy [DMNS’06, DKMMN’06] local random coins A A x1x1 x2x2 xnxn x2’x2’ x1x1 Datasets x and x ’ are called neighbors if they differ in one record. xnxn Require: Neighbor datasets induce close distributions on outputs Def.: A randomized algorithm A is -differentially private if, for all neighbor data sets and, for all events, “Almost same” conclusions will be reached from the output regardless of whether any individual opts into or opts out of the data set. Think of Two regimes:  -differential privacy Worst-case definition: DP gives same guarantee regardless of side information of attacker. Worst-case definition: DP gives same guarantee regardless of side information of attacker.

Two models for private data analysis A Individuals Trusted Curator x1x1 x2x2 xnxn A is differentially private w.r.t. datasets of size n Centralized model B Individuals Untrusted Curator y1y1 y2y2 ynyn x1x1 x2x2 xnxn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn Each Q i is differentially private w.r.t. datasets of size 1 Local model

This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

Example of Convex ERM: Support Vector Machines Goal: Classify data points of different “types”  Find a hyper-plane separating two different “types” of data points. Many applications  Medical studies: Disease classification based on protein structures. Tested +ve Tested -ve Many applications  Medical studies: Disease classification based on protein structures. Coefficients of hyper-plane is the solution of a convex optimization problem defined by the data set. is given by a linear combination of only few data points called support vectors.

Convex empirical risk minimization Dataset. Convex constraint set. Empirical risk function where is convex for all. C

Convex empirical risk minimization Actual minimizer C Dataset. Convex constraint set. Empirical risk function where is convex for all. Goal: Find a “parameter” that minimizes

Excess risk Output Actual minimizer C Dataset. Convex constraint set. Empirical risk function where is convex for all. Goal: Find a “parameter” that minimizes Output such that Convex empirical risk minimization

Other examples Median Linear regression

Why privacy is hard to maintain in ERM? Dual form of SVM: typically contains a subset of the exact data points in the clear. Median: Minimizer is always a data point.

Private convex ERM Studied by [Chaudhuri-et-al ‘11, Rubinstein-et-al ’11, Kifer- Smith-Thakurta‘12, Smith-Thakurta ’13, …] Privacy: A is differentially private in input Utility measured by (worst-case) expected excess risk: A -diff. private Dataset Convex setRisk, Random coins

1.Efficient algorithms with optimal excess risk Separate set of algorithms for strongly convex risk functions.) 2.Matching lower bounds on the excess risk. Best previous work [Chaudhuri-et-al’11, Kifer et al.’12] address special case (smooth functions)  Application of many problems (e.g., SVM, median, …) introduces large additional error. Contributions [B, Smith, Thakurta ‘14] This work improves previous excess risk bounds by factor of

PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) Normalized bounds: Risk is 1-Lipschitz on parameter set C of diameter 1. Results ( dataset size =, C )

PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) Results ( dataset size =, C ) Normalized bounds: Risk is 1-Lipschitz on parameter set C of diameter 1.

Exponential sampling Contributions:  Efficient sampling based on rapidly mixing MCMC  Tight analysis exploiting structure of convex functions. Define a probability distribution over C : Output a sample from C according to

Run SGD with noisy queries for sufficiently many iterations. Noisy stochastic gradient descent Contributions:  Stochastic  privacy amplification  Tight analysis.

This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

Finance.com Fashion.com WeirdStuff.com How many users like Business.com? A conundrum Server How can the server compute aggregate statistics about users without storing user-specific information? How can the server compute aggregate statistics about users without storing user-specific information?

n Untrusted server A set of items (e.g. websites) = [d] = {1, …, d} Set of users = [n] Frequency of an item a is f(a) = ( ♯ users holding a)/n Finance.com Fashion.com WeirdStuff.com Goal is to produce a succinct histogram: a list of frequent items (“heavy hitters”) and estimates of their frequencies while providing rigorous privacy guarantees to the users. Goal is to produce a succinct histogram: a list of frequent items (“heavy hitters”) and estimates of their frequencies while providing rigorous privacy guarantees to the users Item ♯... d-2 d-1 d Item ♯... d-2 d-1 d... Succinct histogram = for some implicitly Succinct histograms

Local model of Differential Privacy Algorithm Q is -local differentially private (LDP) if for any pair v, v’ [d], for all events S, v1v v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn z1z1 z2z2 znzn Succinct histogram is item of user z i is differentially-private report of user i LDP protocols for frequency estimation is used in Chrome web browser (RAPPOR) [Erlingsson-Korolova-Pihur’14] as a basis for other estimation tasks [Dwork-Nissim’04]

Error is measured by the worst-case estimation error: Performance measures v1v v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn z1z1 z2z2 znzn Succinct histogram is item of user z i is differentially-private report of user i A protocol is efficient if it runs in time poly(log(d), n) Communication Complexity measured by number of bits transmitted per user. d is very large, e.g., number of all possible URL’s log(d) = # of bits to describe single URL d is very large, e.g., number of all possible URL’s log(d) = # of bits to describe single URL

Contributions [B, Smith ‘15] 1.Efficient -LDP protocol with optimal error: run in time poly(log(d), n). Estimate all frequencies up to error. 2.Matching lower bound on the error. 3.Generic transformation reducing the communication complexity to 1 bit/user. Previous protocols either  ran in time [Mishra-Sandler’06, Hsu-Khanna-Roth’12, EKP’14]  or, had larger error [HKR’12] Too slow Too much error Best previous lower bound was

UHH: at least fraction of users have the same item while the rest have (i.e., “no item”) Design paradigm Reduction from a simpler problem with a unique heavy hitter (UHH problem)  Efficient protocol with optimal error for UHH  efficient protocol with optimal error for the general problem.

Construction for the UHH problem v*v* v*v* Encoder z1z1 Noising operator z2z2 znzn Round Decoder (error-correcting code) Key idea: is the signal-to-noise ratio. Decoding succeeds when Each user has either v* or v* is unknown to the server Goal: Find v* and estimate f(v*) Each user has either v* or v* is unknown to the server Goal: Find v* and estimate f(v*) Similar to [Duchi et al.’13]

Guarantees that w.h.p., every heavy hitter is allocated a “collision-free” copy of the UHH protocol. v1v1 vnvn Hash K K v1v1 vnvn UHH.. Item whose frequency Construction for the general setting Key insight: Decompose general scenario into multiple instances of UHH via hashing. Run parallel copies of the UHH protocol on these instances.

Efficient Private Protocol for a unique heavy hitter UHH Efficient Private Protocol for estimating all heavy hitters Efficient Private Protocol for a unique heavy hitter UHH Time poly(log(d), n) All frequencies up to the optimal error Efficient Private Protocol for a unique heavy hitter UHH Recap: Construction of succinct histograms

Transforming to a protocol with 1-bit reports Our transformation gives essentially same error and computational efficiency! v1v v2v2 vnvn Q1Q1 Q1Q1 Q2Q2 Q2Q2 QnQn QnQn bit Succinct histogram 1 bit

This talk 1.Background: Differential Privacy 2.Differentially private algorithms for:  Convex Empirical Risk Minimization in the centralized model  Estimating Succinct Histograms in the local model 3.Generic framework for relaxing Differential Privacy

Attacker’s side information A A queries answers ) ( Curator x1x1 xixi xnxn.... Attacker internet social networks anonymized datasets.... Attacker’s side information is the main reason privacy is hard.

Attacker’s side information A A queries answers ) ( Curator x1x1 xixi xnxn.... Omniscient attacker.... everything except x i Differential privacy is robust against arbitrary side information. Attackers typically have limited knowledge. Contributions [B, Groce, Katz, Smith’13]: Rigorous framework for formalizing and exploiting limited adversarial information. Protocols with higher accuracy than is possible under differential privacy Contributions [B, Groce, Katz, Smith’13]: Rigorous framework for formalizing and exploiting limited adversarial information. Protocols with higher accuracy than is possible under differential privacy

Exploiting attacker’s uncertainty [BGKS’13] A A queries answers ) ( Curator x1x1 xixi xnxn.... Attacker.... Side info in Δ for any side information in Δ, Given some restricted class of attacker’s knowledge Δ, the output of A must “look the same” to the attacker regardless of whether any single individual is in or out of the computation.

Distributional Differential Privacy [BGKS’13] local random coins A A x1x1 xixi xnxn xixi x1x1 xnxn A is -DDP if, for any distribution on the data set, for any index i, for any value v of a data entry, and for any event A is -DDP if, for any distribution on the data set, for any index i, for any value v of a data entry, and for any event This implies: for all distributions and for all i, w.p. : For any distribution in Δ, almost same inferences will be made about Alice whether or not Alice’s data is present in the data set.

What can we release exactly and privately? Sums  whenever the data distribution has a small uniform component. Histograms  constructed from a random sample from the population. Stable functions  small probability that the output changes when any single entry of the dataset changes. Under modest distributional assumptions, we can release several exact statistics while satisfying DDP:

Work in Progress

Reliable Adaptive Statistical Analysis A A q1q1 a1a1 Curator x1x1 x2x2 xnxn Analyst qmqm amam Sample Population A A q1q1 b1b1 Curator Analyst qmqm bmbm Want to minimize the worst error between the true answers based on the population and the answers based on the sample. Answers of differentially private algorithms do not depend on outliers. Can DP limit this error? For what queries? DP gives rigorous error guarantees for statistical queries [Dwork et al ‘15] ; optimal? Improvements: B., Smith, Steinke, Ullman (in progress):  Better error guarantees for a larger class of queries (ERM, PAC learning, …). DP gives rigorous error guarantees for statistical queries [Dwork et al ‘15] ; optimal? Improvements: B., Smith, Steinke, Ullman (in progress):  Better error guarantees for a larger class of queries (ERM, PAC learning, …).

Future Work

Merging Differential Privacy & Secure Function Evaluation SFE: Individuals want to compute a function f s.t. no party gets more information beyond f( x 1, x 2, x 3 ) x1x1 x2x2 x3x3

Merging Differential Privacy & Secure Function Evaluation x1x1 x2x2 x3x3 SFE: Individuals want to compute a function f s.t. no party gets more information beyond f( x 1, x 2, x 3 ) f ( x 1, x 2, x 3 ) DP: Protect against what can be revealed from f( x 1, x 2, x 3 ) itself. Secure MPC protocols for differentially private computation of f  Computational Differential privacy [Mironov et al ‘09] in multiparty setting?  Currently: Limited results  2-party setting [McGregor et al ’10], large error [McGregor et al ‘10, Beimel-Nissim-Omri ‘08].

Conclusions Privacy, a pressing concern in “Big Data”, but hard to define intuitively. Differential privacy, a sound rigorous approach:  Robust against arbitrary side information This work:  the first efficient differentially private algorithms with optimal accuracy guarantees for essential tasks in statistical data analysis.  generic definitional framework for privacy relaxing DP.