Generalization bounds for uniformly stable algorithms

Slides:



Advertisements
Similar presentations
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Advertisements

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Fast Convergence of Selfish Re-Routing Eyal Even-Dar, Tel-Aviv University Yishay Mansour, Tel-Aviv University.
1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.
Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.
Optimization Tutorial
A general agnostic active learning algorithm
Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion.
On the Limits of Dictatorial Classification Reshef Meir School of Computer Science and Engineering, Hebrew University Joint work with Shaull Almagor, Assaf.
Learning Submodular Functions Nick Harvey University of Waterloo Joint work with Nina Balcan, Georgia Tech.
Raef Bassily Computer Science & Engineering Pennsylvania State University New Tools for Privacy-Preserving Statistical Analysis IBM Research Almaden February.
The 1’st annual (?) workshop. 2 Communication under Channel Uncertainty: Oblivious channels Michael Langberg California Institute of Technology.
Probably Approximately Correct Model (PAC)
Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.
Sparse vs. Ensemble Approaches to Supervised Learning
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.
Application of reliability prediction model adapted for the analysis of the ERP system Frane Urem, Krešimir Fertalj, Željko Mikulić College of Šibenik,
Multiplicative Weights Algorithms CompSci Instructor: Ashwin Machanavajjhala 1Lecture 13 : Fall 12.
online convex optimization (with partial information)
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Approximating Submodular Functions Part 2 Nick Harvey University of British Columbia Department of Computer Science July 12 th, 2015 Joint work with Nina.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.
TECH Computer Science Problem: Selection Design and Analysis: Adversary Arguments The selection problem >  Finding max and min Designing against an adversary.
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Sample Complexity Bounds on Differentially Private Learning via Communication Complexity Vitaly Feldman David Xiao IBM Research – AlmadenCNRS, Universite.
Logistic Regression William Cohen.
SS r SS r This model characterizes how S(t) is changing.
Chebyshev’s Inequality Markov’s Inequality Proposition 2.1.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Game Theory Just last week:
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Vitaly Feldman and Jan Vondrâk IBM Research - Almaden
Introduction to Machine Learning
Lecture 07: Soft-margin SVM
Understanding Generalization in Adaptive Data Analysis
Generalization and adaptivity in stochastic convex optimization
Privacy-Preserving Classification
Warren Center for Network and Data Sciences
Vitaly (the West Coast) Feldman
Lecture 4: CountSketch High Frequencies
Preserving Validity in Adaptive Data Analysis
Random Sampling over Joins Revisited
CIS 700: “algorithms for Big Data”
RL methods in practice Alekh Agarwal.
Differential Privacy and Statistical Inference: A TCS Perspective
For convex optimization
Mathematics in Deep Learning
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
The
Logistic Regression & Parallel SGD
Online Learning Kernels
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Parallel and Distributed Block Coordinate Frank Wolfe
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 15 SIMULATION-BASED OPTIMIZATION II: STOCHASTIC GRADIENT AND.
Sublinear Algorihms for Big Data
Stochastic Optimization Maximization for Latent Variable Models
Kai-Wei Chang University of Virginia
Lecture 08: Soft-margin SVM
Compute convex lower bounding function and optimize it instead!
Privacy-preserving Prediction
CSCI B609: “Foundations of Data Science”
Lecture 07: Soft-margin SVM
Capacity of Ad Hoc Networks
Tight Analyses for Non-Smooth Stochastic Gradient Descent
Presentation transcript:

Generalization bounds for uniformly stable algorithms Vitaly Feldman Brain with Jan Vondrak

Uniform stability Uniform stability [Bousquet,Elisseeff 02]: Domain 𝑍 (e.g. 𝑋×𝑌) Dataset 𝑆= 𝑧 1 ,…, 𝑧 𝑛 ∈ 𝑍 𝑛 Learning algorithm 𝐴: 𝑍 𝑛 →𝑊 Loss function ℓ:𝑊×𝑍→ ℝ + Uniform stability [Bousquet,Elisseeff 02]: 𝐴 has uniform stability 𝛾 w.r.t. ℓ if For all neighboring 𝑆, 𝑆 ′ ,𝑧∈𝑍 ℓ 𝐴 𝑆 ,𝑧 −ℓ 𝐴 𝑆 ′ ,𝑧 ≤𝛾

Generalization error Probability distribution 𝑃 over 𝑍, 𝑆∼ 𝑃 𝑛 Population loss: 𝐄 𝑃 ℓ 𝑤 = 𝐄 𝑧∼𝑃 ℓ 𝑤,𝑧 Empirical loss: 𝓔 𝑆 ℓ 𝑤 = 1 𝑛 𝑖=1 𝑛 ℓ 𝑤, 𝑧 𝑖 Generalization error/gap: Δ 𝑆 ℓ 𝐴 = 𝐄 𝑃 ℓ 𝐴 𝑆 − 𝓔 𝑆 ℓ 𝐴 𝑆

Stochastic convex optimization 𝑊= 𝔹 2 𝑑 1 ≐ 𝑤 𝑤 2 ≤1} For all 𝑧∈𝑍, ℓ(𝑤,𝑧) is convex 1-Lipschitz in 𝑤 Minimize 𝐄 𝑃 ℓ 𝑤 ≐ 𝐄 𝑧∼𝑃 [ℓ(𝑤,𝑧)] over 𝑊 𝐿 ∗ = min 𝑤∈𝑊 𝐄 𝑃 ℓ 𝑤 For all 𝑃, 𝐴 being SGD with rate 𝜂=1/ 𝑛 : Pr 𝑆∼ 𝑃 𝑛 𝐄 𝑃 ℓ 𝐴(𝑆) ≥ 𝐿 ∗ +𝑂 log 1/𝛿 𝑛 ≤𝛿 Uniform convergence error: ≳ 𝑑 𝑛 ERM might have generalization error: ≳ 𝑑 𝑛 [Shalev-Shwartz,Shamir,Srebro,Sridharan ’09; Feldman 16]

Stable optimization Strongly convex ERM [BE 02, SSSS 09] 𝐴 𝜆 (𝑆)=argmi n 𝑤∈𝑊 𝓔 𝑆 ℓ 𝑤 + 𝜆 2 𝑤 2 2 𝐴 𝜆 is 1 𝜆𝑛 uniformly stable and minimizes 𝓔 𝑆 ℓ 𝑤 within 𝜆 2 Gradient descent on smooth losses [Hardt,Recht,Singer 16] 𝐴 𝑇 (𝑆) : 𝑇 steps of GD on 𝓔 𝑆 ℓ 𝑤 with 𝜂= 1 𝑇 𝐴 𝑇 is 𝑇 𝑛 uniformly stable and minimizes 𝓔 𝑆 ℓ 𝑤 within 2 𝑇

Generalization bounds For 𝐴,ℓ w/ range 0,1 and uniform stability 𝛾∈ 1 𝑛 ,1 1 𝑛 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴 ) ≤𝛾 [Rogers,Wagner 78] 𝐏𝐫 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) ≥𝛾 𝒏 log 1/𝛿 ≤𝛿 [Bousquet,Elisseeff 02] Vacuous when 𝛾≥1/ 𝑛 𝐏𝐫 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) ≥ 𝛾 log 1/𝛿 ≤𝛿 NEW!

Comparison [BE 02] This work 𝑛=100 Generalization error 1 𝑛 γ

Second moment 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) 2 ≤ 𝛾 2 + 1 𝑛 TIGHT! NEW! 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) 2 ≤ 𝛾 2 + 1 𝑛 Chebyshev: 𝐏𝐫 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ 𝐴 ≥ 𝛾+1/ 𝑛 𝛿 ≤𝛿 NEW! TIGHT! 1 𝑛 γ Generalization error Previously [Devroye,Wagner 79; BE 02] 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 ℓ(𝐴) 2 ≤𝛾+ 1 𝑛 [BE 02] This work

Implications Differentially-private prediction [Dwork,Feldman 18] 𝐏𝐫 𝑆∼ 𝑃 𝑛 𝐄 𝑃 ℓ 𝐴(𝑆) ≥ 𝐿 ∗ +𝑂 1 𝛿 1/4 𝑛 ≤𝛿 𝐏𝐫 𝑆∼ 𝑃 𝑛 𝐄 𝑃 ℓ 𝐴(𝑆) ≥ 𝐿 ∗ +𝑂 log 1/𝛿 𝑛 1/3 ≤𝛿 For any loss 𝐿:𝑌×𝑌→[0,1], 𝐄 𝐴 [𝐿(𝐴 𝑆,𝑥 ,𝑦)] has uniform stability 𝑒 𝜖 −1≈𝜖 Stronger generalization bounds for DP prediction algorithms Differentially-private prediction [Dwork,Feldman 18] 𝑍=𝑋×𝑌. A randomized algorithm 𝐴(𝑆,𝑥) has 𝜖-DP prediction if for all 𝑆, 𝑆 ′ ,𝑥∈𝑋 𝐷 ∞ 𝐴 𝑆,𝑥 ||𝐴 𝑆 ′ ,𝑥 ≤𝜖

Data-dependent functions Consider 𝑀: 𝑍 𝑛 ×𝑍→ℝ . E.g. 𝑀 𝑆,𝑧 ≡ℓ(𝐴 𝑆 ,𝑧) 𝑀 has uniform stability 𝛾 if For all neighboring 𝑆, 𝑆 ′ ,𝑧∈𝑍 𝑀 𝑆,𝑧 −𝑀 𝑆 ′ ,𝑧 ≤𝛾 𝑀 𝑆,⋅ −𝑀 𝑆 ′ ,⋅ ∞ ≤𝛾 Generalization error/gap: Δ 𝑆 𝑀 = 𝐄 𝑃 𝑀 𝑆 − 𝓔 𝑆 𝑀 𝑆

Generalization in expectation Goal: 𝐄 𝑆∼ 𝑃 𝑛 𝐄 𝑃 𝑀(𝑆) − 𝓔 𝑆 𝑀(𝑆) ≤𝛾 𝐄 𝑆∼ 𝑃 𝑛 𝓔 𝑆 𝑀(𝑆) = 𝐄 𝑆∼ 𝑃 𝑛 ,𝑖∼[𝑛] 𝑀(𝑆, 𝑧 𝑖 ) For all 𝑖 and 𝑧∈𝑍, 𝑀 𝑆, 𝑧 𝑖 ≤𝑀 𝑆 𝑖←𝑧 , 𝑧 𝑖 +𝛾 𝑀 𝑆, 𝑧 𝑖 ≤ 𝐄 𝑧∼𝑃 𝑀 𝑆 𝑖←𝑧 , 𝑧 𝑖 +𝛾 𝐄 𝑆∼ 𝑃 𝑛 ,𝑖∼[𝑛] 𝑀(𝑆, 𝑧 𝑖 ) ≤ 𝐄 𝑆∼ 𝑃 𝑛 , 𝑧∼𝑃,𝑖∼ 𝑛 𝑀 𝑆 𝑖←𝑧 , 𝑧 𝑖 +𝛾 = 𝐄 𝑆∼ 𝑃 𝑛 ,𝑧∼𝑃 𝑀 𝑆,𝑧 +𝛾 = 𝐄 𝑆∼ 𝑃 𝑛 𝐄 𝑃 𝑀(𝑆) +𝛾 𝑆 𝑖←𝑧 , 𝑧 𝑖 ≡ 𝑃 𝑛+1 ≡ 𝑆,𝑧

Concentration via McDiarmid For all neighboring 𝑆, 𝑆 ′ Δ 𝑆 𝑀 − Δ 𝑆 ′ 𝑀 ≤2𝛾+ 1 𝑛 1. 𝐄 𝑧∼𝑃 𝑀 𝑆,𝑧 − 𝐄 𝑧∼𝑃 𝑀 𝑆′,𝑧 ≤𝛾 2. 𝓔 𝑆 𝑀 𝑆 − 𝓔 𝑆 ′ 𝑀 𝑆′,𝑧 ≤ 𝓔 𝑆 𝑀 𝑆 − 𝓔 𝑆 𝑀 𝑆 ′ ,𝑧 + 𝓔 𝑆 𝑀 𝑆 ′ − 𝓔 𝑆 ′ 𝑀 𝑆 ′ ,𝑧 ≤𝛾+ 1 𝑛 McDiarmid: 𝐏𝐫 𝑆∼ 𝑃 𝑛 Δ 𝑆 𝑀 ≥𝜇+ 2𝛾+ 1 𝑛 𝒏 log 1/𝛿 ≤𝛿 where 𝜇= 𝐄 𝑆∼ 𝑃 𝑛 Δ 𝑆 𝑀 ≤𝛾

Proof technique UNSTABLE! Based on [Nissim,Stemmer 15; BNSSSU 16] Let 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 for 𝑚= ln 2 𝛿 Need to bound 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ma x 𝑗∈ 𝑚 Δ 𝑆 𝑗 𝑀 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 , ℓ=argma x 𝑗 { Δ 𝑆 𝑗 𝑀 } 𝓔 𝑆 ℓ 𝑀 𝑆 ℓ ≈? 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 , ℓ=argma x 𝑗 { Δ 𝑆 𝑗 𝑀 } 𝐄 𝑃 𝑀 𝑆 ℓ Let 𝑄 be a distribution over ℝ: Pr 𝑣∼𝑄 𝑣≥2 𝐄 𝑣 1 ,…, 𝑣 𝑚 ∼𝑄 max 0, 𝑣 1 ,…, 𝑣 𝑚 ≤ ln 2 𝑚 UNSTABLE!

Stable max Exponential mechanism [McSherry,Talwar 07] EM 𝛼 : sample 𝑗∝ 𝑒 𝛼 Δ 𝑆 𝑗 𝑀 Stable: 2𝛼 2𝛾+ 1 𝑛 - differentially private 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ,ℓ=E M 𝛼 𝓔 𝑆 ℓ 𝑀 𝑆 ℓ ≤ 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ,ℓ=E M 𝛼 𝐄 𝑃 𝑀 𝑆 ℓ + exp 2𝛼 2𝛾+ 1 𝑛 −1 +𝛾 Approximates max 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ,ℓ=E M 𝛼 Δ 𝑆 ℓ 𝑀 ≥ 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ma x 𝑗∈ 𝑚 Δ 𝑆 𝑗 𝑀 − ln 𝑚 𝛼

Game over 𝐄 𝑆 1 ,…, 𝑆 𝑚 ~ 𝑃 𝑛 ma x 𝑗∈ 𝑚 Δ 𝑆 𝑗 𝑀 ≤ ln 𝑚 𝛼 + exp 2𝛼 2𝛾+ 1 𝑛 −1 +𝛾 Pick 𝛼= ln 𝑚 𝛾+1/𝑛 get 𝑂 𝛾+ 1 𝑛 ln 1 𝛿 Let 𝑄 be a distribution over ℝ: Pr 𝑣∼𝑄 𝑣≥2 𝐄 𝑣 1 ,…, 𝑣 𝑚 ∼𝑄 max 0, 𝑣 1 ,…, 𝑣 𝑚 ≤ ln 2 𝑚

Conclusions Better understanding of uniform stability New technique Open Gap between upper and lower bounds High probability generalization without strong uniformity