Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization,

Slides:



Advertisements
Similar presentations
05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Advertisements

Taiji Suzuki Ryota Tomioka The University of Tokyo
Regularized risk minimization
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh Lehigh University December 3, 2014.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Semi-Stochastic Gradient Descent Peter Richtárik ANC/DTC Seminar, School of Informatics, University of Edinburgh Edinburgh - November 4, 2014.
Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Peter Richtarik Why parallelizing like crazy and being lazy can be good.
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,
Optimization Tutorial
Peter Richtarik Operational Research and Optimization Extreme* Mountain Climbing * in a billion dimensional space on a foggy day.
Peter Richtarik School of Mathematics Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =
CMPUT 466/551 Principal Source: CMU
Peter Richtárik Parallel coordinate NIPS 2013, Lake Tahoe descent methods.
Distributed Optimization with Arbitrary Local Solvers
Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
SVM QP & Midterm Review Rob Hall 10/14/ This Recitation Review of Lagrange multipliers (basic undergrad calculus) Getting to the dual for a QP.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.
Trading Convexity for Scalability Marco A. Alvarez CS7680 Department of Computer Science Utah State University.
Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.
Mathematical Programming in Support Vector Machines
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Face Detection Using Large Margin Classifiers Ming-Hsuan Yang Dan Roth Narendra Ahuja Presented by Kiang “Sean” Zhou Beckman Institute University of Illinois.
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Biointelligence Laboratory, Seoul National University
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
For convex optimization
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
The role of optimization in machine learning
Sathya Ronak Alisha Zach Devin Josh
Multiplicative updates for L1-regularized regression
Jeremy Watt and Aggelos Katsaggelos Northwestern University
Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides
Distributed Computation Framework for Machine Learning
Lecture 07: Soft-margin SVM
ECE 5424: Introduction to Machine Learning
Regularized risk minimization
Nonnegative polynomials and applications to learning
Kernels Usman Roshan.
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
Logistic Regression & Parallel SGD
Estimating Networks With Jumps
Parallel and Distributed Block Coordinate Frank Wolfe
PEGASOS Primal Estimated sub-GrAdient Solver for SVM
Lecture 08: Soft-margin SVM
Lecture 07: Soft-margin SVM
Usman Roshan CS 675 Machine Learning
CS639: Data Management for Data Science
ADMM and DSO.
Presentation transcript:

Peter Richtárik Parallel coordinate Simons Institute for the Theory of Computing, Berkeley Parallel and Distributed Algorithms for Inference and Optimization, October 23, 2013 descent methods

Randomized Coordinate Descent in 2D

Find the minimizer of 2D Optimization Contours of a function Goal:

Randomized Coordinate Descent in 2D N S E W

1 N S E W

1 N S E W 2

1 2 3 N S E W

N S E W

N S E W 5

N S E W

N S E W 6 7 S O L V E D !

Convergence of Randomized Coordinate Descent Strongly convex F Smooth or simple nonsmooth F difficult nonsmooth F Focus on n (big data = big n)

Parallelization Dream Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration SerialParallel What do we actually get? WANT

How (not) to Parallelize Coordinate Descent

Naive parallelization Do the same thing as before, but for MORE or ALL coordinates & ADD UP the updates

Failure of naive parallelization 1a 1b 0

Failure of naive parallelization 1 1a 1b 0

Failure of naive parallelization 1 2a 2b

Failure of naive parallelization 1 2a 2b 2

Failure of naive parallelization 2 O O P S !

1 1a 1b 0 Idea: averaging updates may help S O L V E D !

Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...

Averaging may be too conservative 2 WANT BAD!!! But we wanted:

What to do? Averaging: Summation: Update to coordinate i i-th unit coordinate vector Figure out when one can safely use:

Optimization Problems

Problem Convex (smooth or nonsmooth) Convex (smooth or nonsmooth) - separable - allow Loss Regularizer

Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO

Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG11 RT11b TBRS13 RT 13a FR13

3 models for f with small Smooth partially separable f [RT11b ] Nonsmooth max-type f [FR13] f with bounded Hessian [BKBG11, RT13a ]

General Theory

Randomized Parallel Coordinate Descent Method Random set of coordinates (sampling) Current iterateNew iteratei-th unit coordinate vector Update to i-th coordinate

ESO: Expected Separable Overapproximation Definition [RT11b] 1. Separable in h 2. Can minimize in parallel 3. Can compute updates for only Shorthand: Minimize in h

Convergence rate: convex f average # updated coordinates per iteration # coordinatesstepsize parameter error tolerance # iterations implies Theorem [RT11b]

Convergence rate: strongly convex f implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT11b]

Partial Separability and Doubly Uniform Samplings

Serial uniform sampling Probability law:

-nice sampling Probability law: Good for shared memory systems

Doubly uniform sampling Probability law: Can model unreliable processors / machines

ESO for partially separable functions and doubly uniform samplings Theorem [RT11b] 1 Smooth partially separable f [RT11b ]

Theoretical speedup Much of Big Data is here! degree of partial separability # coordinates # coordinate updates / iter WEAK OR NO SPEEDUP: Non-separable (dense) problems LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems

n = 1000 (# coordinates) Theory

Practice n = 1000 (# coordinates)

Experiment with a 1 billion-by-2 billion LASSO problem

Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =

Coordinate Updates

Iterations

Wall Time

Distributed-Memory Coordinate Descent

Distributed -nice sampling Probability law: Machine 2Machine 1Machine 3 Good for a distributed version of coordinate descent

ESO: Distributed setting Theorem [RT13b] 3 f with bounded Hessian [BKBG11, RT13a ] spectral norm of the data

Bad partitioning at most doubles # of iterations spectral norm of the partitioning Theorem [RT13b] # nodes # iterations = implies # updates/node

LASSO with a 3TB data matrix 128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM = # coordinates

Conclusions Coordinate descent methods scale very well to big data problems of special structure – Partial separability (sparsity) – Small spectral norm of the data – Nesterov separability,... Care is needed when combining updates (add them up? average?)

Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2): , [RT11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: , Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block- coordinate descent methods. Technical report, Microsoft Research, References: serial coordinate descent

[BKBG11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011 [RT12] P.R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. arXiv: , 2012 Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013 [FR12] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv: , 2013 [RT13a] P.R. and Martin Takáč, Distributed coordinate descent method for big data learning. arXiv: , 2013 [RT13b] P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. arXiv: , 2013 References: parallel coordinate descent Good entry point to the topic (4p paper)

P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv: , Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10): , Shai Shalev-Shwartz and Tong Zhang, Accelerated mini-batch stochastic dual coordinate ascent. NIPS References: parallel coordinate descent TALK TOMORROW