Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Linear Regression.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
MS&E 211 Quadratic Programming Ashish Goel. A simple quadratic program Minimize (x 1 ) 2 Subject to: -x 1 + x 2 ≥ 3 -x 1 – x 2 ≥ -2.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Model Assessment, Selection and Averaging
Distributed Optimization with Arbitrary Local Solvers
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling.
x – independent variable (input)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Visual Recognition Tutorial
Linear Discriminant Functions Chapter 5 (Duda et al.)
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Maximum likelihood (ML)
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
An Introduction to Support Vector Machines Martin Law.
Collaborative Filtering Matrix Factorization Approach
Biointelligence Laboratory, Seoul National University
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Benk Erika Kelemen Zsolt
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
An Introduction to Support Vector Machines (M. Law)
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
M Machine Learning F# and Accord.net.
CpSc 881: Machine Learning
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
The role of optimization in machine learning
Chapter 7. Classification and Prediction
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Ch3: Model Building through Regression
Understanding Generalization in Adaptive Data Analysis
Classification with Perceptrons Reading:
Machine Learning Basics
Probabilistic Models for Linear Regression
Neural Networks and Backpropagation
Machine Learning Today: Reading: Maria Florina Balcan
Collaborative Filtering Matrix Factorization Approach
Logistic Regression & Parallel SGD
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Large Scale Support Vector Machines
Biointelligence Laboratory, Seoul National University
Lecture 15: Data Cleaning for ML
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

Based on  Basic Method: S2GD  Konečný and Richtárik. Semi-Stochastic Gradient Descent Methods, December 2013  Mini-batching (& proximal setting): mS2GD  Konečný, Liu, Richtárik and Takáč. mS2GD: mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014  Coordinate descent variant: S2CD  Konečný, Qu and Richtárik. S2CD: Semi-Stochastic Coordinate Descent, October 2014

Introduction

Large scale problem setting  Problems are often structured  Frequently arising in machine learning Structure – sum of functions is BIG

Examples  Linear regression (least squares)   Logistic regression (classification) 

Assumptions  Lipschitz continuity of derivative of  Strong convexity of

Gradient Descent (GD)  Update rule  Fast convergence rate  Alternatively, for accuracy we need iterations  Complexity of single iteration – (measured in gradient evaluations)

Stochastic Gradient Descent (SGD)  Update rule  Why it works  Slow convergence  Complexity of single iteration – (measured in gradient evaluations) a step-size parameter

Goal GD SGD Fast convergence gradient evaluations in each iteration Slow convergence Complexity of iteration independent of Combine in a single algorithm

Semi-Stochastic Gradient Descent S2GD

Intuition  The gradient does not change drastically  We could reuse the information from “old” gradient

Modifying “old” gradient  Imagine someone gives us a “good” point and  Gradient at point, near, can be expressed as  Approximation of the gradient Already computed gradientGradient change We can try to estimate

The S2GD Algorithm Simplification; size of the inner loop is random, following a geometric rule

Theorem

Convergence rate  How to set the parameters ? Can be made arbitrarily small, by decreasing For any fixed, can be made arbitrarily small by increasing

Setting the parameters  The accuracy is achieved by setting  Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations # of epochs stepsize # of iterations Fix target accuracy

Complexity  S2GD complexity  GD complexity  iterations  complexity of a single iteration  Total

Related Methods  SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)  Refresh single stochastic gradient in each iteration  Need to store gradients.  Similar convergence rate  Cumbersome analysis  SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014)  Refined analysis  MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)  Similar to SAG, slightly worse performance  Elegant analysis

Related Methods  SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013)  Arises as a special case in S2GD  Prox-SVRG (Tong Zhang, Lin Xiao, 2014)  Extended to proximal setting  EMGD – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi, Rong Jin, 2013)  Handles simple constraints,  Worse convergence rate

Experiment (logistic regression on: ijcnn, rcv, real-sim, url)

Extensions

Extensions summary  S2GD:  Efficient handling of sparse data  Pre-processing with SGD (S2GD+)  Non-strongly convex losses  High-probability result  Minibatching: mS2GD  Konečný, Liu, Richtárik and Takáč. mS2GD: mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014  Coordinate descent variant:S2CD  Konečný, Qu, Richtárik. S2CD: Semi-Stochastic Coordinate Descent, October 2014

Sparse data  For linear/logistic regression, gradient copies sparsity pattern of example.  But the update direction is fully dense  Can we do something about it? DENSESPARSE

Sparse data  Yes we can!  To compute, we only need coordinates of corresponding to nonzero elements of  For each coordinate, remember when was it updated last time –  Before computing in inner iteration number, update required coordinates  Step being  Compute direction and make a single update Number of iterations when the coordinate was not updated The “old gradient”

Sparse data implementation

S2GD+  Observing that SGD can make reasonable progress, while S2GD computes first full gradient (in case we are starting from arbitrary point), we can formulate the following algorithm (S2GD+)

S2GD+ Experiment

High Probability Result  The result holds only in expectation  Can we say anything about the concentration of the result in practice? For any we have: Paying just logarithm of probability Independent from other parameters

Code  Efficient implementation for logistic regression - available at MLOSS

mS2GD (mini-batch S2GD)  How does mini-batching influence the algorithm?  Replace by  Provides two-fold speedup  Provably less gradient evaluations are needed (up to certain number of mini-batches)  Easy possibility of parallelism

S2CD (Semi-Stochastic Coordinate Descent)  SGD type methods  Sampling rows (training examples) of data matrix  Coordinate Descent type methods  Sampling columns (features) of data matrix  Question: Can we do both?  Sample both columns and rows

S2CD (Semi-Stochastic Coordinate Descent)  Comlpexity  S2GD

S2GD as a Learning Algorithm

Problem with “us” optimizers  Optimizers care about optimization  Statisticians care about statistics  In isolation  Practical need to control both statistical predictive power and effort spent on optimization, is not well understood.  Optimizers should be aware of…  The following framework is mostly due to Bottou and Bousquets, 2007

Machine Learning Setting  Space of input-output pairs  Unknown distribution  A relationship between inputs and outputs  Loss function to measure discrepancy between predicted and real output  Define Expected Risk

Machine Learning Setting  Ideal goal: Find such that,  But you cannot even evaluate  Define Expected Risk

Machine Learning Setting  We at least have iid samples  Define Empirical Risk

 First learning principle – fix a family of candidate prediction functions  Find Empirical Minimizer  Define Empirical Risk Machine Learning Setting

 Since optimal is unlikely to belong to, we also define  Define Empirical Risk Machine Learning Setting

 Finding by minimizing the Empirical Risk exactly is often computationally expensive  Run optimization algorithm that returns such that  Define Empirical Risk Machine Learning Setting

Recapitulation Ideal optimum “Best” from our family Empirical Minimizer From approximate optimization

Machine Learning Goal  Big goal is to minimize the Excess Risk  Approximation error  Estimation Error  Optimization Error

Generic Machine Learning Problem  All this leads to a complicated compromise  Three variables  Family of functions  Number or examples  Optimization accuracy  Two constraints  Maximal number of examples  Maximal computational time available

Generic Machine Learning Problem  Small scale learning problem  If first inequality is tight  Can reduce to insignificant levels and recover approximation-estimation tradeoff (well studied)  Large scale learning problem  If second inequality is tight  More complicated compromise

Solving Large Scale ML Problem  Several simplifications needed  Not carefully balance the three terms; instead we only ensure that asymptotically  Consider fixed family of functions, linearly parameterized by a vector  Effectively setting to be a constant  Simplifies to Estimation–Optimization tradeoff

Estimation–Optimization tradeoff  Using uniform convergence bounds, one can obtain  Often considered weak

Estimation–Optimization tradeoff  Using Localized Bounds (Bousquet, PhD thesis, 2004) or Isomorphic Coordinate Projections (Bartlett and Mendelson, 2006), we get … if we can establish the following variance condition  Often, for example under strong convexity, or making assumptions on the data distribution

Estimation–Optimization tradeoff  Using the previous bounds yields where is an absolute constant  We want to push this term below  Choosing and using and we get the following table