Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

MCMC estimation in MlwiN

Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.

Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Linear Regression.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,

1 Regression Models & Loss Reserve Variability Prakash Narayan Ph.D., ACAS 2001 Casualty Loss Reserve Seminar.

Optimization Tutorial

Peter Richtárik Parallel coordinate NIPS 2013, Lake Tahoe descent methods.

Distributed Optimization with Arbitrary Local Solvers

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

The loss function, the normal equation,

Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling.

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Collaborative Filtering Matrix Factorization Approach

Learning with large datasets Machine Learning Large scale machine learning.

Online Learning for Matrix Factorization and Sparse Coding

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.

Mathematical formulation XIAO LIYING. Mathematical formulation.

Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

M Machine Learning F# and Accord.net.

Bundle Adjustment A Modern Synthesis Bill Triggs, Philip McLauchlan, Richard Hartley and Andrew Fitzgibbon Presentation by Marios Xanthidis 5 th of No.

Logistic Regression William Cohen.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Collaborative filtering applied to real- time bidding.

Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

Machine Learning Supervised Learning Classification and Regression

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

The role of optimization in machine learning

Chapter 7. Classification and Prediction

A Fast Trust Region Newton Method for Logistic Regression

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Machine Learning Basics

Machine Learning – Regression David Fenyő

Neural Networks and Backpropagation

Machine Learning Today: Reading: Maria Florina Balcan

Collaborative Filtering Matrix Factorization Approach

CSCI B609: “Foundations of Data Science”

Logistic Regression & Parallel SGD

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Large Scale Support Vector Machines

Convolutional networks

Lecture 15: Data Cleaning for ML

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Softmax Classifier.

CS639: Data Management for Data Science

Linear regression with one variable

Reinforcement Learning (2)

Logistic Regression Geoff Hulten.

Presentation transcript:

Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014

Based on  Basic method: S2GD & S2GD+  Konecny and Richtarik. Semi-Stochastic Gradient Descent Methods, arXiv: , December 2013  Mini-batching (& proximal setup): mS2GD  Konecny, Liu, Richtarik and Takac. mS2GD: Minibatch Semi-Stochastic Coordinate Descent in the Proximal Setting, October 2014  Coordinate descent variant: S2CD  Konecny, Qu and Richtarik. S2CD: Semi-Stochastic Coordinate Descent, October 2014

The Problem

Minimizing Average Loss  Problems are often structured  Frequently arising in machine learning Structure – sum of functions is BIG

Examples  Linear regression (least squares)   Logistic regression (classification) 

Assumptions  Lipschitz continuity of the gradient of  Strong convexity of

Applications

SPAM DETECTION

PAGE RANKING

FA’K’E DETECTION

RECOMMENDER SYSTEMS

GEOTAGGING

Gradient Descent vs Stochastic Gradient Descent

Gradient Descent (GD)  Update rule  Fast convergence rate  Alternatively, for accuracy we need iterations  Complexity of single iteration: (measured in gradient evaluations)

Stochastic Gradient Descent (SGD)  Update rule  Why it works  Slow convergence  Complexity of single iteration – (measured in gradient evaluations) a step-size parameter

Dream… GD SGD Fast convergence gradient evaluations in each iteration Slow convergence Complexity of iteration independent of Combine in a single algorithm

S2GD: Semi-Stochastic Gradient Descent

Why dream may come true…  The gradient does not change drastically  We could reuse old information

Modifying “old” gradient  Imagine someone gives us a “good” point and  Gradient at point, near, can be expressed as  Approximation of the gradient Already computed gradientGradient change We can try to estimate

The S2GD Algorithm Simplification; size of the inner loop is random, following a geometric rule

Theorem

Convergence Rate  How to set the parameters ? Can be made arbitrarily small, by decreasing For any fixed, can be made arbitrarily small by increasing

Setting the Parameters  The accuracy is achieved by setting  Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations # of epochs stepsize # of iterations Fix target accuracy

Complexity  S2GD complexity  GD complexity  iterations  complexity of a single iteration  Total

Experiment (logistic regression on: ijcnn, rcv, real-sim, url)

Related Methods  SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)  Refresh single stochastic gradient in each iteration  Need to store gradients.  Similar convergence rate  Cumbersome analysis  SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014)  Refined analysis  MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)  Similar to SAG, slightly worse performance  Elegant analysis

Related Methods  SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013)  Arises as a special case in S2GD  Prox-SVRG (Tong Zhang, Lin Xiao, 2014)  Extended to proximal setting  EMGD – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi, Rong Jin, 2013)  Handles simple constraints,  Worse convergence rate

Extensions

 S2GD:  Efficient handling of sparse data  Pre-processing with SGD (-> S2GD+)  Inexact computation of gradients  Non-strongly convex losses  High-probability result  Mini-batching: mS2GD  Konecny, Liu, Richtarik and Takac. mS2GD: Minibatch Semi- Stochastic Coordinate Descent in the Proximal Setting, October 2014  Coordinate variant: S2CD  Konecny, Qu and Richtarik. S2CD: Semi-Stochastic Coordinate Descent, October 2014

Semi-Stochastic Coordinate Descent Complexity: S2GD:

Mini-batch Semi-Stochastic Gradient Descent

Sparse Data  For linear/logistic regression, gradient copies sparsity pattern of example.  But the update direction is fully dense  Can we do something about it? DENSESPARSE

Sparse Data (Continued)  Yes we can!  To compute, we only need coordinates of corresponding to nonzero elements of  For each coordinate, remember when was it updated last time –  Before computing in inner iteration number, update required coordinates  Step being  Compute direction and make a single update Number of iterations when the coordinate was not updated The “old gradient”

S2GD: Implementation for Sparse Data

S2GD+  Observing that SGD can make reasonable progress while S2GD computes the first full gradient, we can formulate the following algorithm (S2GD+)

S2GD+ Experiment

High Probability Result  The result holds only in expectation  Can we say anything about the concentration of the result in practice? For any we have: Paying just logarithm of probability Independent from other parameters

Inexact Case  Question: What if we have access to inexact oracle?  Assume we can get the same update direction with error :  S2GD algorithm in this setting gives with

Code  Efficient implementation for logistic regression - available at MLOSS