Download presentation

Presentation is loading. Please wait.

Published byTylor Cockman Modified over 2 years ago

1
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014

2
Based on Basic Method: S2GD Konečný and Richtárik. Semi-Stochastic Gradient Descent Methods, December 2013 Mini-batching (& proximal setting): mS2GD Konečný, Liu, Richtárik and Takáč. mS2GD: mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014 Coordinate descent variant: S2CD Konečný, Qu and Richtárik. S2CD: Semi-Stochastic Coordinate Descent, October 2014

3
Introduction

4
Large scale problem setting Problems are often structured Frequently arising in machine learning Structure – sum of functions is BIG

5
Examples Linear regression (least squares) Logistic regression (classification)

6
Assumptions Lipschitz continuity of derivative of Strong convexity of

7
Gradient Descent (GD) Update rule Fast convergence rate Alternatively, for accuracy we need iterations Complexity of single iteration – (measured in gradient evaluations)

8
Stochastic Gradient Descent (SGD) Update rule Why it works Slow convergence Complexity of single iteration – (measured in gradient evaluations) a step-size parameter

9
Goal GD SGD Fast convergence gradient evaluations in each iteration Slow convergence Complexity of iteration independent of Combine in a single algorithm

10
Semi-Stochastic Gradient Descent S2GD

11
Intuition The gradient does not change drastically We could reuse the information from “old” gradient

12
Modifying “old” gradient Imagine someone gives us a “good” point and Gradient at point, near, can be expressed as Approximation of the gradient Already computed gradientGradient change We can try to estimate

13
The S2GD Algorithm Simplification; size of the inner loop is random, following a geometric rule

14
Theorem

15
Convergence rate How to set the parameters ? Can be made arbitrarily small, by decreasing For any fixed, can be made arbitrarily small by increasing

16
Setting the parameters The accuracy is achieved by setting Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations # of epochs stepsize # of iterations Fix target accuracy

17
Complexity S2GD complexity GD complexity iterations complexity of a single iteration Total

18
Related Methods SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013) Refresh single stochastic gradient in each iteration Need to store gradients. Similar convergence rate Cumbersome analysis SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014) Refined analysis MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014) Similar to SAG, slightly worse performance Elegant analysis

19
Related Methods SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013) Arises as a special case in S2GD Prox-SVRG (Tong Zhang, Lin Xiao, 2014) Extended to proximal setting EMGD – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi, Rong Jin, 2013) Handles simple constraints, Worse convergence rate

20
Experiment (logistic regression on: ijcnn, rcv, real-sim, url)

21
Extensions

22
Extensions summary S2GD: Efficient handling of sparse data Pre-processing with SGD (S2GD+) Non-strongly convex losses High-probability result Minibatching: mS2GD Konečný, Liu, Richtárik and Takáč. mS2GD: mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014 Coordinate descent variant:S2CD Konečný, Qu, Richtárik. S2CD: Semi-Stochastic Coordinate Descent, October 2014

23
Sparse data For linear/logistic regression, gradient copies sparsity pattern of example. But the update direction is fully dense Can we do something about it? DENSESPARSE

24
Sparse data Yes we can! To compute, we only need coordinates of corresponding to nonzero elements of For each coordinate, remember when was it updated last time – Before computing in inner iteration number, update required coordinates Step being Compute direction and make a single update Number of iterations when the coordinate was not updated The “old gradient”

25
Sparse data implementation

26
S2GD+ Observing that SGD can make reasonable progress, while S2GD computes first full gradient (in case we are starting from arbitrary point), we can formulate the following algorithm (S2GD+)

27
S2GD+ Experiment

28
High Probability Result The result holds only in expectation Can we say anything about the concentration of the result in practice? For any we have: Paying just logarithm of probability Independent from other parameters

29
Code Efficient implementation for logistic regression - available at MLOSS http://mloss.org/software/view/556/

30
mS2GD (mini-batch S2GD) How does mini-batching influence the algorithm? Replace by Provides two-fold speedup Provably less gradient evaluations are needed (up to certain number of mini-batches) Easy possibility of parallelism

31
S2CD (Semi-Stochastic Coordinate Descent) SGD type methods Sampling rows (training examples) of data matrix Coordinate Descent type methods Sampling columns (features) of data matrix Question: Can we do both? Sample both columns and rows

32
S2CD (Semi-Stochastic Coordinate Descent) Comlpexity S2GD

33
S2GD as a Learning Algorithm

34
Problem with “us” optimizers Optimizers care about optimization Statisticians care about statistics In isolation Practical need to control both statistical predictive power and effort spent on optimization, is not well understood. Optimizers should be aware of… The following framework is mostly due to Bottou and Bousquets, 2007

35
Machine Learning Setting Space of input-output pairs Unknown distribution A relationship between inputs and outputs Loss function to measure discrepancy between predicted and real output Define Expected Risk

36
Machine Learning Setting Ideal goal: Find such that, But you cannot even evaluate Define Expected Risk

37
Machine Learning Setting We at least have iid samples Define Empirical Risk

38
First learning principle – fix a family of candidate prediction functions Find Empirical Minimizer Define Empirical Risk Machine Learning Setting

39
Since optimal is unlikely to belong to, we also define Define Empirical Risk Machine Learning Setting

40
Finding by minimizing the Empirical Risk exactly is often computationally expensive Run optimization algorithm that returns such that Define Empirical Risk Machine Learning Setting

41
Recapitulation Ideal optimum “Best” from our family Empirical Minimizer From approximate optimization

42
Machine Learning Goal Big goal is to minimize the Excess Risk Approximation error Estimation Error Optimization Error

43
Generic Machine Learning Problem All this leads to a complicated compromise Three variables Family of functions Number or examples Optimization accuracy Two constraints Maximal number of examples Maximal computational time available

44
Generic Machine Learning Problem Small scale learning problem If first inequality is tight Can reduce to insignificant levels and recover approximation-estimation tradeoff (well studied) Large scale learning problem If second inequality is tight More complicated compromise

45
Solving Large Scale ML Problem Several simplifications needed Not carefully balance the three terms; instead we only ensure that asymptotically Consider fixed family of functions, linearly parameterized by a vector Effectively setting to be a constant Simplifies to Estimation–Optimization tradeoff

46
Estimation–Optimization tradeoff Using uniform convergence bounds, one can obtain Often considered weak

47
Estimation–Optimization tradeoff Using Localized Bounds (Bousquet, PhD thesis, 2004) or Isomorphic Coordinate Projections (Bartlett and Mendelson, 2006), we get … if we can establish the following variance condition Often, for example under strong convexity, or making assumptions on the data distribution

48
Estimation–Optimization tradeoff Using the previous bounds yields where is an absolute constant We want to push this term below Choosing and using and we get the following table

Similar presentations

OK

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on geotextile Ppt on united nations and its various organs Ppt on surface water quality English ppt on reported speech Ppt on simple carburetors Convert pdf ppt to ppt online form Ppt on the origin of the universe Ppt on famous indian entrepreneurs Ppt on operating systems Ppt on the elements of the short story