Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.

Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.

Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Peter Richtarik Why parallelizing like crazy and being lazy can be good.

Regularization David Kauchak CS 451 – Fall 2013.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,

Raef Bassily Adam Smith Abhradeep Thakurta Penn State Yahoo! Labs Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds Penn.

Optimization Tutorial

Peter Richtarik School of Mathematics Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =

Peter Richtárik Parallel coordinate NIPS 2013, Lake Tahoe descent methods.

Distributed Optimization with Arbitrary Local Solvers

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling.

Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.

1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.

Linear Regression  Using a linear function to interpolate the training set  The most popular criterion: Least squares approach  Given the training set:

Reduced Support Vector Machine

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Binary Classification Problem Learn a Classifier from the Training Set

Unconstrained Optimization Problem

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.

1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Normalised Least Mean-Square Adaptive Filtering

Collaborative Filtering Matrix Factorization Approach

Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.

Linear Models for Classification

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Recitation4 for BigData Jay Gu Feb LASSO and Coordinate Descent.

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Matrix Factorization Reporter : Sun Yuanshuai

Coordinate Descent Algorithms

StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent

Multiplicative updates for L1-regularized regression

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Dan Roth Department of Computer and Information Science

Lecture 07: Soft-margin SVM

Boosting and Additive Trees (2)

Probabilistic Models for Linear Regression

Collaborative Filtering Matrix Factorization Approach

Lecture 07: Soft-margin SVM

CSCI B609: “Foundations of Data Science”

Lecture 08: Soft-margin SVM

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

CS639: Data Management for Data Science

Linear Discrimination

Logistic Regression Geoff Hulten.

Presentation transcript:

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Contributions

Variants of Randomized Coordinate Descent Methods Block – can operate on “blocks” of coordinates – as opposed to just on individual coordinates General – applies to “general” (=smooth convex) functions – as opposed to special ones such as quadratics Proximal – admits a “nonsmooth regularizer” that is kept intact in solving subproblems – regularizer not smoothed, nor approximated Parallel – operates on multiple blocks / coordinates in parallel – as opposed to just 1 block / coordinate at a time Accelerated – achieves O(1/k^2) convergence rate for convex functions – as opposed to O(1/k) Efficient – avoids adding two full feature vectors

Brief History of Randomized Coordinate Descent Methods + new long stepsizes

Introduction

I. Block Structure II. Block Sampling IV. Fast or Normal? III. Proximal Setup

I. Block Structure

N = # coordinates (variables) n = # blocks

II. Block Sampling Block sampling Average # blocks selected by the sampling

III. Proximal Setup Convex & SmoothConvex & Nonsmooth Loss Regularizer

III. Proximal Setup Loss Functions: Examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13

III. Proximal Setup Regularizers: Examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO

The Algorithm

APPROX Olivier Fercoq and P. R. Accelerated, parallel and proximal coordinate descent, arXiv : , December 2013

Part C RANDOMIZED COORDINATE DESCENT Part B GRADIENT METHODS B1 GRADIENT DESCENT B2 PROJECTED GRADIENT DESCENT B3 PROXIMAL GRADIENT DESCENT B4 FAST PROXIMAL GRADIENT DESCENT C1 PROXIMAL COORDINATE DESCENT C2 PARALLEL COORDINATE DESCENT C3 DISTRIBUTED COORDINATE DESCENT C4 FAST PARALLEL COORDINATE DESCENT new FISTAISTA Olivier Fercoq and P.R. Accelerated, parallel and proximal coordinate descent, arXiv: , Dec 2013

PCDM P. R. and Martin Takac. Parallel coordinate descent methods for big data optimization, arXiv : , December 2012 IMA Fox Prize in Numerical Analysis, 2013

2D Example

Convergence Rate

average # coordinates updated / iteration # blocks # iterations implies Theorem [Fercoq & R. 12/2013]

Special Case: Fully Parallel Variant all blocks are updated in each iteration # normalized weights (summing to n) # iterations implies

New Stepsizes

Expected Separable Overapproximation (ESO): How to Choose Block Stepsizes? P. R. and Martin Takac. Parallel coordinate descent methods for big data optimization, arXiv : , December 2012 Olivier Fercoq and P. R. Smooth minimization of nonsmooth functions by parallel coordinate descent methods, arXiv : , September 2013 P. R. and Martin Takac. Distributed coordinate descent methods for learning with big data, arXiv : , October 2013 SPCDM

Assumptions: Function f Example: (a) (b) (c)

Visualizing Assumption (c)

New ESO Theorem (Fercoq & R. 12/2013) (i) (ii)

Comparison with Other Stepsizes for Parallel Coordinate Descent Methods Example:

Complexity for New Stepsizes Average degree of separability “Average” of the Lipschitz constants With the new stepsizes, we have:

Work in 1 Iteration

Cost of 1 Iteration of APPROX Assume N = n (all blocks are of size 1) and that Sparse matrix Then the average cost of 1 iteration of APPROX is Scalar function: derivative = O(1) arithmetic ops = average # nonzeros in a column of A

Bottleneck: Computation of Partial Derivatives maintained

Preliminary Experiments

L1 Regularized L1 Regression Dorothea dataset: Gradient Method Nesterov’s Accelerated Gradient Method SPCDM APPROX

L1 Regularized L1 Regression

L1 Regularized Least Squares (LASSO) KDDB dataset: PCDM APPROX

Training Linear SVMs Malicious URL dataset:

Importance Sampling

with Importance Sampling Zheng Qu and P. R. Accelerated coordinate descent with importance sampling, Manuscript 2014 Nonuniform ESO P. R. and Martin Takac. On optimal probabilities in stochastic coordinate descent methods, aXiv : , 2013

Convergence Rate Theorem [Qu & R. 2014]

Serial Case: Optimal Probabilities Nonuniform serial sampling: Optimal ProbabilitiesUniform Probabilities

Extra 40 Slides