Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Slides:



Advertisements
Similar presentations
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.
Advertisements

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Linear Regression.
Regularization David Kauchak CS 451 – Fall 2013.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Distributed Optimization with Arbitrary Local Solvers
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.
Artificial Neural Networks
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.
Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.
Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Computational Optimization
Collaborative Filtering Matrix Factorization Approach
Learning with large datasets Machine Learning Large scale machine learning.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Mathematical formulation XIAO LIYING. Mathematical formulation.
Andrew Ng Linear regression with one variable Model representation Machine Learning.
Discriminant Functions
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
Non-Bayes classifiers. Linear discriminants, neural networks.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
M Machine Learning F# and Accord.net.
Mathematical Analysis of MaxEnt for Mixed Pixel Decomposition
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.
Machine Learning Supervised Learning Classification and Regression
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
The role of optimization in machine learning
Lecture 07: Soft-margin SVM
Machine Learning – Regression David Fenyő
Probabilistic Models for Linear Regression
Neural Networks and Backpropagation
Lecture 8 Generalized Linear Models &
Machine Learning Today: Reading: Maria Florina Balcan
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
Large Scale Support Vector Machines
Lecture 08: Soft-margin SVM
Lecture 07: Soft-margin SVM
Biointelligence Laboratory, Seoul National University
Lecture 15: Data Cleaning for ML
Learning Theory Reza Shadmehr
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Softmax Classifier.
CS639: Data Management for Data Science
Mathematical Foundations of BME
ADMM and DSO.
Linear regression with one variable
Logistic Regression Geoff Hulten.
Presentation transcript:

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh

Introduction

Large scale problem setting  Problems are often structured  Frequently arising in machine learning Structure – sum of functions is BIG

Examples  Linear regression (least squares)   Logistic regression (classification) 

Assumptions  Lipschitz continuity of derivative of  Strong convexity of

Gradient Descent (GD)  Update rule  Fast convergence rate  Alternatively, for accuracy we need iterations  Complexity of single iteration – (measured in gradient evaluations)

Stochastic Gradient Descent (SGD)  Update rule  Why it works  Slow convergence  Complexity of single iteration – (measured in gradient evaluations) a step-size parameter

Goal GD SGD Fast convergence gradient evaluations in each iteration Slow convergence Complexity of iteration independent of Combine in a single algorithm

Semi-Stochastic Gradient Descent S2GD

Intuition  The gradient does not change drastically  We could reuse the information from “old” gradient

Modifying “old” gradient  Imagine someone gives us a “good” point and  Gradient at point, near, can be expressed as  Approximation of the gradient Already computed gradientGradient change We can try to estimate

The S2GD Algorithm Simplification; size of the inner loop is random, following a geometric rule

Theorem

Convergence rate  How to set the parameters ? Can be made arbitrarily small, by decreasing For any fixed, can be made arbitrarily small by increasing

Setting the parameters  The accuracy is achieved by setting  Total complexity (in gradient evaluations) # of epochs full gradient evaluation cheap iterations # of epochs stepsize # of iterations Fix target accuracy

Complexity  S2GD complexity  GD complexity  iterations  complexity of a single iteration  Total

Related Methods  SAG – Stochastic Average Gradient (Mark Schmidt, Nicolas Le Roux, Francis Bach, 2013)  Refresh single stochastic gradient in each iteration  Need to store gradients.  Similar convergence rate  Cumbersome analysis  MISO - Minimization by Incremental Surrogate Optimization (Julien Mairal, 2014)  Similar to SAG, slightly worse performance  Elegant analysis

Related Methods  SVRG – Stochastic Variance Reduced Gradient (Rie Johnson, Tong Zhang, 2013)  Arises as a special case in S2GD  Prox-SVRG (Tong Zhang, Lin Xiao, 2014)  Extended to proximal setting  EMGD – Epoch Mixed Gradient Descent (Lijun Zhang, Mehrdad Mahdavi, Rong Jin, 2013)  Handles simple constraints,  Worse convergence rate

Experiment  Example problem, with