Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Similar presentations

Presentation on theme: "Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014."— Presentation transcript:

1 Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

2 Based on  Basic Method: S2GD  Konečný and Richtárik. Semi-Stochastic Gradient Descent Methods, December 2013  Mini-batching (& proximal setting): mS2GD  Konečný, Liu, Richtárik and Takáč. mS2GD: mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014  Coordinate descent variant: S2CD  Konečný, Qu and Richtárik. S2CD: Semi-Stochastic Coordinate Descent, October 2014

3 Introduction

4 Large scale problem setting  Problems are often structured  Frequently arising in machine learning Structure – sum of functions is BIG

5 Examples  Linear regression (least squares)   Logistic regression (classification) 

6 Assumptions  Lipschitz continuity of derivative of  Strong convexity of

7 Gradient Descent (GD)  Update rule  Fast convergence rate  Alternatively, for accuracy we need iterations  Complexity of single iteration – (measured in gradient evaluations)

8 Stochastic Gradient Descent (SGD)  Update rule  Why it works  Slow convergence  Complexity of single iteration – (measured in gradient evaluations) a step-size parameter

9 Goal GD SGD Fast convergence gradient evaluations in each iteration Slow convergence Complexity of iteration independent of Combine in a single algorithm

10 Semi-Stochastic Gradient Descent S2GD

11 Intuition  The gradient does not change drastically  We could reuse the information from “old” gradient

12 Modifying “old” gradient  Imagine someone gives us a “good” point and  Gradient at point, near, can be expressed as  Approximation of the gradient Already computed gradientGradient change We can try to estimate

13 The S2GD Algorithm Simplification; size of the inner loop is random, following a geometric rule

14 Theorem

15 Convergence rate  How to set the parameters ? Can be made arbitrarily small, by decreasing For any fixed, can be made arbitrarily small by increasing

16 Setting the parameters  The accuracy is achieved by setting  Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations # of epochs stepsize # of iterations Fix target accuracy

17 Complexity  S2GD complexity  GD complexity  iterations  complexity of a single iteration  Total

18 Related Methods  SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)  Refresh single stochastic gradient in each iteration  Need to store gradients.  Similar convergence rate  Cumbersome analysis  SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014)  Refined analysis  MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)  Similar to SAG, slightly worse performance  Elegant analysis

19 Related Methods  SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013)  Arises as a special case in S2GD  Prox-SVRG (Tong Zhang, Lin Xiao, 2014)  Extended to proximal setting  EMGD – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi, Rong Jin, 2013)  Handles simple constraints,  Worse convergence rate

20 Experiment (logistic regression on: ijcnn, rcv, real-sim, url)

21 Extensions

22 Extensions summary  S2GD:  Efficient handling of sparse data  Pre-processing with SGD (S2GD+)  Non-strongly convex losses  High-probability result  Minibatching: mS2GD  Konečný, Liu, Richtárik and Takáč. mS2GD: mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014  Coordinate descent variant:S2CD  Konečný, Qu, Richtárik. S2CD: Semi-Stochastic Coordinate Descent, October 2014

23 Sparse data  For linear/logistic regression, gradient copies sparsity pattern of example.  But the update direction is fully dense  Can we do something about it? DENSESPARSE

24 Sparse data  Yes we can!  To compute, we only need coordinates of corresponding to nonzero elements of  For each coordinate, remember when was it updated last time –  Before computing in inner iteration number, update required coordinates  Step being  Compute direction and make a single update Number of iterations when the coordinate was not updated The “old gradient”

25 Sparse data implementation

26 S2GD+  Observing that SGD can make reasonable progress, while S2GD computes first full gradient (in case we are starting from arbitrary point), we can formulate the following algorithm (S2GD+)

27 S2GD+ Experiment

28 High Probability Result  The result holds only in expectation  Can we say anything about the concentration of the result in practice? For any we have: Paying just logarithm of probability Independent from other parameters

29 Code  Efficient implementation for logistic regression - available at MLOSS

30 mS2GD (mini-batch S2GD)  How does mini-batching influence the algorithm?  Replace by  Provides two-fold speedup  Provably less gradient evaluations are needed (up to certain number of mini-batches)  Easy possibility of parallelism

31 S2CD (Semi-Stochastic Coordinate Descent)  SGD type methods  Sampling rows (training examples) of data matrix  Coordinate Descent type methods  Sampling columns (features) of data matrix  Question: Can we do both?  Sample both columns and rows

32 S2CD (Semi-Stochastic Coordinate Descent)  Comlpexity  S2GD

33 S2GD as a Learning Algorithm

34 Problem with “us” optimizers  Optimizers care about optimization  Statisticians care about statistics  In isolation  Practical need to control both statistical predictive power and effort spent on optimization, is not well understood.  Optimizers should be aware of…  The following framework is mostly due to Bottou and Bousquets, 2007

35 Machine Learning Setting  Space of input-output pairs  Unknown distribution  A relationship between inputs and outputs  Loss function to measure discrepancy between predicted and real output  Define Expected Risk

36 Machine Learning Setting  Ideal goal: Find such that,  But you cannot even evaluate  Define Expected Risk

37 Machine Learning Setting  We at least have iid samples  Define Empirical Risk

38  First learning principle – fix a family of candidate prediction functions  Find Empirical Minimizer  Define Empirical Risk Machine Learning Setting

39  Since optimal is unlikely to belong to, we also define  Define Empirical Risk Machine Learning Setting

40  Finding by minimizing the Empirical Risk exactly is often computationally expensive  Run optimization algorithm that returns such that  Define Empirical Risk Machine Learning Setting

41 Recapitulation Ideal optimum “Best” from our family Empirical Minimizer From approximate optimization

42 Machine Learning Goal  Big goal is to minimize the Excess Risk  Approximation error  Estimation Error  Optimization Error

43 Generic Machine Learning Problem  All this leads to a complicated compromise  Three variables  Family of functions  Number or examples  Optimization accuracy  Two constraints  Maximal number of examples  Maximal computational time available

44 Generic Machine Learning Problem  Small scale learning problem  If first inequality is tight  Can reduce to insignificant levels and recover approximation-estimation tradeoff (well studied)  Large scale learning problem  If second inequality is tight  More complicated compromise

45 Solving Large Scale ML Problem  Several simplifications needed  Not carefully balance the three terms; instead we only ensure that asymptotically  Consider fixed family of functions, linearly parameterized by a vector  Effectively setting to be a constant  Simplifies to Estimation–Optimization tradeoff

46 Estimation–Optimization tradeoff  Using uniform convergence bounds, one can obtain  Often considered weak

47 Estimation–Optimization tradeoff  Using Localized Bounds (Bousquet, PhD thesis, 2004) or Isomorphic Coordinate Projections (Bartlett and Mendelson, 2006), we get … if we can establish the following variance condition  Often, for example under strong convexity, or making assumptions on the data distribution

48 Estimation–Optimization tradeoff  Using the previous bounds yields where is an absolute constant  We want to push this term below  Choosing and using and we get the following table

Download ppt "Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014."

Similar presentations

Ads by Google