Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

Slides:



Advertisements
Similar presentations
Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Advertisements

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Linear Regression.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,
Optimization Tutorial
Distributed Optimization with Arbitrary Local Solvers
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Computer vision: models, learning and inference
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.
Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.
Collaborative Filtering Matrix Factorization Approach
Learning with large datasets Machine Learning Large scale machine learning.
Online Learning for Matrix Factorization and Sparse Coding
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Non Negative Matrix Factorization
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
An Introduction to Support Vector Machines (M. Law)
Investigation of Various Factorization Methods for Large Recommender Systems G. Takács, I. Pilászy, B. Németh and D. Tikk 10th International.
Linear Classification with Perceptrons
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
M Machine Learning F# and Accord.net.
Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
The role of optimization in machine learning
Sathya Ronak Alisha Zach Devin Josh
Large-scale Machine Learning
Multiplicative updates for L1-regularized regression
Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides
Lecture 07: Soft-margin SVM
Classification with Perceptrons Reading:
Machine Learning Basics
Probabilistic Models for Linear Regression
Neural Networks and Backpropagation
CMPT 733, SPRING 2016 Jiannan Wang
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
Logistic Regression & Parallel SGD
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Large Scale Support Vector Machines
Biointelligence Laboratory, Seoul National University
Learning Theory Reza Shadmehr
Softmax Classifier.
CS639: Data Management for Data Science
Mathematical Foundations of BME
Reinforcement Learning (2)
ADMM and DSO.
Linear regression with one variable
Reinforcement Learning (2)
Logistic Regression Geoff Hulten.
Presentation transcript:

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014

Based on  Basic Method: S2GD  Konečný and Richtárik. Semi-Stochastic Gradient Descent Methods, December 2013  Mini-batching (& proximal setting): mS2GD  Konečný, Liu, Richtárik and Takáč. mS2GD: mS2GD: Minibatch semi-stochastic gradient descent in the proximal setting, October 2014  Coordinate descent variant: S2CD  Konečný, Qu and Richtárik. Semi-stochastic coordinate descent, December 2014

Introduction

Large scale problem setting  Problems are often structured  Frequently arising in machine learning Structure – sum of functions is BIG

Examples  Linear regression (least squares)   Logistic regression (classification) 

Assumptions  Lipschitz continuity of derivative of  Strong convexity of

Gradient Descent (GD)  Update rule  Fast convergence rate  Alternatively, for accuracy we need iterations  Complexity of single iteration – (measured in gradient evaluations)

Stochastic Gradient Descent (SGD)  Update rule  Why it works  Slow convergence  Complexity of single iteration – (measured in gradient evaluations) a step-size parameter

Goal GD SGD Fast convergence gradient evaluations in each iteration Slow convergence Complexity of iteration independent of Combine in a single algorithm

Semi-Stochastic Gradient Descent S2GD

Intuition  The gradient does not change drastically  We could reuse the information from “old” gradient

Modifying “old” gradient  Imagine someone gives us a “good” point and  Gradient at point, near, can be expressed as  Approximation of the gradient Already computed gradientGradient change We can try to estimate

The S2GD Algorithm Simplification; size of the inner loop is random, following a geometric rule

Theorem

Convergence rate  How to set the parameters ? Can be made arbitrarily small, by decreasing For any fixed, can be made arbitrarily small by increasing

Setting the parameters  The accuracy is achieved by setting  Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations # of epochs stepsize # of iterations Fix target accuracy

Complexity  S2GD complexity  GD complexity  iterations  complexity of a single iteration  Total

Related Methods  SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)  Refresh single stochastic gradient in each iteration  Need to store gradients.  Similar convergence rate  Cumbersome analysis  SAGA (Aaron Defazio, Francis Bach, Simon Lacoste-Julien, 2014)  Refined analysis  MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)  Similar to SAG, more general  Elegant analysis, not really applicable for sparse data  Julien is here

Related Methods  SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013)  Arises as a special case in S2GD  Prox-SVRG (Tong Zhang, Lin Xiao, 2014)  Extended to proximal setting  EMGD – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi, Rong Jin, 2013)  Handles simple constraints,  Worse convergence rate

Experiment (logistic regression on: ijcnn, rcv, real-sim, url)

Extensions

Sparse data  For linear/logistic regression, gradient copies sparsity pattern of example.  But the update direction is fully dense  Can we do something about it? DENSESPARSE

Sparse data  Yes we can!  To compute, we only need coordinates of corresponding to nonzero elements of  For each coordinate, remember when was it updated last time –  Before computing in inner iteration number, update required coordinates  Step being  Compute direction and make a single update Number of iterations when the coordinate was not updated The “old gradient”

Sparse data implementation

High Probability Result  The result holds only in expectation  Can we say anything about the concentration of the result in practice? For any we have: Paying just logarithm of probability Independent from other parameters

Code  Efficient implementation for logistic regression - available at MLOSS

mS2GD (mini-batch S2GD)  How does mini-batching influence the algorithm?  Replace by  Provides two-fold speedup  Provably less gradient evaluations are needed (up to certain number of mini-batches)  Easy possibility of parallelism

S2CD (Semi-Stochastic Coordinate Descent)  SGD type methods  Sampling rows (training examples) of data matrix  Coordinate Descent type methods  Sampling columns (features) of data matrix  Question: Can we do both?  Sample both columns and rows

S2CD (Semi-Stochastic Coordinate Descent)  Comlpexity  S2GD