Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Slides:

Advertisements

Similar presentations

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Advertisements

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.

Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Linear Regression.

Regularization David Kauchak CS 451 – Fall 2013.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Optimization Tutorial

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Distributed Optimization with Arbitrary Local Solvers

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.

Artificial Neural Networks

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.

Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Computational Optimization

Collaborative Filtering Matrix Factorization Approach

Learning with large datasets Machine Learning Large scale machine learning.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Mathematical formulation XIAO LIYING. Mathematical formulation.

Andrew Ng Linear regression with one variable Model representation Machine Learning.

Discriminant Functions

Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.

M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.

Non-Bayes classifiers. Linear discriminants, neural networks.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

M Machine Learning F# and Accord.net.

Mathematical Analysis of MaxEnt for Mixed Pixel Decomposition

Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.

Machine Learning Supervised Learning Classification and Regression

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

The role of optimization in machine learning

Lecture 07: Soft-margin SVM

Machine Learning – Regression David Fenyő

Probabilistic Models for Linear Regression

Neural Networks and Backpropagation

Lecture 8 Generalized Linear Models &

Machine Learning Today: Reading: Maria Florina Balcan

Collaborative Filtering Matrix Factorization Approach

Lecture 07: Soft-margin SVM

Large Scale Support Vector Machines

Lecture 08: Soft-margin SVM

Lecture 07: Soft-margin SVM

Biointelligence Laboratory, Seoul National University

Lecture 15: Data Cleaning for ML

Learning Theory Reza Shadmehr

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Softmax Classifier.

CS639: Data Management for Data Science

Mathematical Foundations of BME

Linear regression with one variable

Logistic Regression Geoff Hulten.

Presentation transcript:

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh

Introduction

Large scale problem setting  Problems are often structured  Frequently arising in machine learning Structure – sum of functions is BIG

Examples  Linear regression (least squares)   Logistic regression (classification) 

Assumptions  Lipschitz continuity of derivative of  Strong convexity of

Gradient Descent (GD)  Update rule  Fast convergence rate  Alternatively, for accuracy we need iterations  Complexity of single iteration – (measured in gradient evaluations)

Stochastic Gradient Descent (SGD)  Update rule  Why it works  Slow convergence  Complexity of single iteration – (measured in gradient evaluations) a step-size parameter

Goal GD SGD Fast convergence gradient evaluations in each iteration Slow convergence Complexity of iteration independent of Combine in a single algorithm

Semi-Stochastic Gradient Descent S2GD

Intuition  The gradient does not change drastically  We could reuse the information from “old” gradient

Modifying “old” gradient  Imagine someone gives us a “good” point and  Gradient at point, near, can be expressed as  Approximation of the gradient Already computed gradientGradient change We can try to estimate

The S2GD Algorithm Simplification; size of the inner loop is random, following a geometric rule

Theorem

Convergence rate  How to set the parameters ? Can be made arbitrarily small, by decreasing For any fixed, can be made arbitrarily small by increasing

Setting the parameters  The accuracy is achieved by setting  Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations # of epochs stepsize # of iterations Fix target accuracy

Complexity  S2GD complexity  GD complexity  iterations  complexity of a single iteration  Total

Related Methods  SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)  Refresh single stochastic gradient in each iteration  Need to store gradients.  Similar convergence rate  Cumbersome analysis  MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)  Similar to SAG, slightly worse performance  Elegant analysis

Related Methods  SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013)  Arises as a special case in S2GD  Prox-SVRG (Tong Zhang, Lin Xiao, 2014)  Extended to proximal setting  EMGD – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi, Rong Jin, 2013)  Handles simple constraints,  Worse convergence rate

Experiment  Example problem, with