Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.

Slides:



Advertisements
Similar presentations
Raef Bassily Penn State Local, Private, Efficient Protocols for Succinct Histograms Based on joint work with Adam Smith (Penn State) (To appear in STOC.
Advertisements

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Private Analysis of Graph Structure With Vishesh Karwa, Sofya Raskhodnikova and Adam Smith Pennsylvania State University Grigory Yaroslavtsev
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Separating Hyperplanes
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Visual Recognition Tutorial
Machine Learning Neural Networks
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February.
x – independent variable (input)
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Mathematical Programming in Support Vector Machines
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Computational Optimization
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Private Analysis of Graphs
Data mining and machine learning A brief introduction.
Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Universit at Dortmund, LS VIII
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Linear Discrimination Reading: Chapter 2 of textbook.
Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis Yahoo! Labs Sunnyvale February.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Probabilistic km-anonymity (Efficient Anonymization of Large Set-valued Datasets) Gergely Acs (INRIA) Jagdish Achara (INRIA)
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Private Release of Graph Statistics using Ladder Functions J.ZHANG, G.CORMODE, M.PROCOPIUC, D.SRIVASTAVA, X.XIAO.
Yang, et al. Differentially Private Data Publication and Analysis. Tutorial at SIGMOD’12 Part 4: Data Dependent Query Processing Methods Yin “David” Yang.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Private Data Management with Verification
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Lecture 07: Soft-margin SVM
Boosting and Additive Trees (2)
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
A Simple Artificial Neuron
Generalization and adaptivity in stochastic convex optimization
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
Probabilistic Models for Linear Regression
Privacy-Preserving Classification
Differential Privacy in Practice
Machine Learning Today: Reading: Maria Florina Balcan
Vitaly (the West Coast) Feldman
Current Developments in Differential Privacy
Differential Privacy and Statistical Inference: A TCS Perspective
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
Logistic Regression & Parallel SGD
Lecture 08: Soft-margin SVM
Privacy-preserving Prediction
Lecture 07: Soft-margin SVM
Convolutional networks
CS639: Data Management for Data Science
Generalization bounds for uniformly stable algorithms
Stochastic Methods.
Presentation transcript:

Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn State (work done at BU/Harvard) (work done at MSR/Stanford)

Privacy in Statistical Databases internet social networks anonymized datasets A A queries answers ) ( Government, researchers, businesses (or) malicious adversary Trusted curator x1x1 x2x2 xnxn Users Two conflicting goals: Utility vs. Privacy Balancing these goals is tricky:  No control over external sources of information  Anonymization is unreliable:  [Narayanan-Shmatikov’08],  [Korolova’11],  [Calendrino et al.’12], …

Differential privacy [Dwork-McSherry-Nissim-Smith’06] local random coins A A x1x1 x2x2 xnxn x2’x2’ x1x1 Datasets x and x ’ are called neighbors if they differ in one record. xnxn Require: Neighbor datasets induce close distributions on outputs Def.: A randomized algorithm A is -differentially private if, for all data sets and that differ in one element, for all events, “Almost same” conclusions will be reached independent of whether any individual opts into or opts out of the data set. Think of Two regimes:  -differential privacy  -differential privacy,

This work Construct efficient and differentially private algorithms for convex empirical risk minimization with optimal excess risk

Convex empirical risk minimization Dataset. Convex set. Loss function. where is convex for all. C

Convex empirical risk minimization Dataset. Convex set. Loss function. where is convex for all. Actual minimizer C Goal: Find a “parameter” that minimizes the empirical risk: Goal: Find a “parameter” that minimizes the empirical risk:

Convex empirical risk minimization Dataset. Convex set. Loss function. where is convex for all. Output such that Excess risk Output Goal: Find a “parameter” that minimizes the empirical risk: Goal: Find a “parameter” that minimizes the empirical risk: Actual minimizer C

Examples Median Linear regression Support vector machine where

Private convex ERM [Chaudhuri-Monteleoni 08 & -- Sarwate 11] Studied by [Chaudhuri-et-al ‘11, Rubinstein-et-al ’11, Kifer- Smith-Thakurta‘12, Smith-Thakurta ’13, …] Privacy: A is differentially private in input Utility measured by (worst-case) expected excess risk: (Recall that ) A -diff. private Dataset Convex setLoss, Random coins

Why care about privacy in ERM? Dual form of SVM: typically contains a subset of the exact data points in the clear. Median: Minimizer is always a data point.

Contributions 1.New algorithms with optimal excess risk assuming: Loss function is Lipschitz. Parameter set C is bounded. (Separate set of algorithms for strongly convex loss.) 2.Matching lower bounds Best previous work [Chaudhuri-et-al’11, Kifer et al.’12] additionally assumes is smooth (bounded 2 nd derivative) This work improves bounds by factor of  Non-smooth loss is common: SVM, median, …  Applying their technique in general requires smoothing the loss, introducing extra error.

Lipschitz Λ -strongly convex PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) -DP Localization (new technique) -DP Noisy stochastic gradient descent (or localization) is 1-Lipschitz on parameter set C of diameter 1. Results – upper bounds ( dataset size =, C )

Lipschitz Λ -strongly convex PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) -DP Localization (new technique) -DP Noisy stochastic gradient descent (or localization) is 1-Lipschitz on parameter set C of diameter 1. Results – upper bounds ( dataset size =, C )

Lipschitz Λ -strongly convex PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10], [Jain-Kothari-Thakurta’12] and [Chaudhuri-Sarwate-Song’13]) -DP Localization (new technique) -DP Noisy stochastic gradient descent (or localization) is 1-Lipschitz on parameter set C of diameter 1. Results – upper bounds ( dataset size =, C )

Lipschitz Λ -strongly convex PrivacyExcess riskTechnique -DP Localization (new technique) -DP Noisy stochastic gradient descent (or localization) is 1-Lipschitz on parameter set C of diameter 1. Results – upper bounds ( dataset size =, C )

Results – lower bounds PrivacyExcess risk Form of used -DP Linear: -DP Quadratic: -DP Lipschitz Strongly convex Reduction from -DP release of 1-way marginals [HT’10] Reduction from -DP release of 1-way marginals [BUV’13]

Lipschitz Λ -strongly convex PrivacyExcess riskTechnique -DP Exponential sampling (inspired by [McSherry-Talwar’07]) -DP Noisy stochastic gradient descent (rigorous analysis of & improvements to [McSherry-Williams’10] and [Chaudhuri-Sarwate-Song’13]) -DP Localization (new technique) -DP Noisy stochastic gradient descent (or, Localization) is 1-Lipschitz on parameter set C of diameter 1. Results – upper bounds

Optimal Noisy Stochastic Gradient Descent Algorithm

Noisy stochastic gradient descent algorithm Inputs: Data, 1-Lipschitz loss, convex set C,

Noisy stochastic gradient descent algorithm Choose arbitrary

Noisy stochastic gradient descent algorithm For :

Noisy stochastic gradient descent algorithm At iteration t :

Noisy stochastic gradient descent algorithm At iteration t :

Noisy stochastic gradient descent algorithm At iteration t : Learning rate

Noisy stochastic gradient descent algorithm At iteration t :

Noisy stochastic gradient descent algorithm Fresh data sample At iteration t+1 :

Noisy stochastic gradient descent algorithm Repeat for iterations, then output. Fresh data sample

Privacy of the noisy SGD Noisy SGD algorithm is -differentially private. After iterations, by strong composition [DRV’10], privacy degrades from to. Sampling amplifies privacy [KLNRS’08]: Key point:

Optimal Exponential Sampling Algorithm

Exponential sampling algorithm Define a probability distribution over C : Output a sample from Define a probability distribution over C : Output a sample from An instance of the exponential mechanism [McSherry-Talwar’08]  Efficient construction based on rapidly mixing MCMC  Uses [Applegate-Kannan’91] as a subroutine.  Provides purely multiplicative convergence guarantee.  Does not follow directly from existing results.  Exploits structure of convex functions: A 1, A 2, … are decreasing in volume  Shows that when  Tight utility analysis via a “peeling” argument

1.New algorithms with optimal excess risk assuming: Loss function is Lipschitz. Parameter set C is bounded. (Separate set of algorithms for strongly convex loss.) 2.Matching lower bounds Summary New Localization technique: optimal algorithm for strongly convex loss. Generalization error guarantees: Not known to be tight in general. Not in this talk: