Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,
Collaborators Olivier Fercoq (Edinburgh/Paris Tech) Zheng Qu (Edinburgh) Martin Takáč (Edinburgh/Lehigh) P.R. and M. Takáč, Distributed coordinate descent for big data optimization, arXiv: , 2013 O. Fercoq, Z. Qu, P.R. and M. Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Machine Learning for Signal Processing,
Part 1 OPTIMIZATION
Big Data Optimization d = BIG convex
God’s Algorithm = Teleportation
Iterative Procedure x0x0 x1x1 x2x2 x3x3
SPAM DETECTION
PAGE RANKING
FA’K’E DETECTION
RECOMMENDER SYSTEMS
GEOTAGGING
Part 2 RANDOMIZED COORDINATE DESCENT IN 2D
Find the minimizer 2D Optimization Contours of a function Goal:
Randomized Coordinate Descent in 2D N S E W
1 N S E W
1 N S E W 2
1 2 3 N S E W
N S E W
N S E W 5
N S E W
N S E W 6 7 S O L V E D !
Part 3 PARALLEL COORDINATE DESCENT
Convergence of Randomized Coordinate Descent Strongly convex f Smooth or ‘simple’ nonsmooth f ‘difficult’ nonsmooth f Focus on d (big data = big d)
Parallelization Dream SerialParallel In reality we get something in between
Failure of naive parallelization 1a 1b 0
Failure of naive parallelization 1 1a 1b 0
Failure of naive parallelization 1 2a 2b
Failure of naive parallelization 1 2a 2b 2
Failure of naive parallelization 2 O O P S !
1 1a 1b 0 Idea: averaging updates may help S O L V E D !
Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...
Averaging may be too conservative WANT BAD!!!
Part 4 THE PROBLEM (IN MORE DETAIL)
The Problem Loss Regularizer f is convex f’ is smooth f(x) + (\nabla f(x))^T h \leq f(x+h) \leq f(x) + (\nabla f (x))^T h + \tfrac{1}{2}h^T A^T A h R is separableR i is convex and closed A = data matrix R(x) = \sum_{i=1}^d R_i(x^i) \sigma = \lambda_{\text{max}} (D^{-1/2} A^T A D^{-1/2})
Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO
Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13
Part 5 ALGORITHMS & THEORY
DISTRIBUTED COORDINATE DESCENT
Distribution of Data d = # features / variables / coordinates Data matrix
Choice of Coordinates
Random set of coordinates (‘sampling’)
Computing Updates to Selected Coordinates Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication)
Iteration Complexity implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT’13] # coordinates # nodes # coordinates updated / node
Expected Separable Overapproximation \mathbb{E}\left[f(x+h_{[{\color{blue}\hat{S}}]}) \right]\leq f(x) + \tfrac{\mathbb{E}[|{\color{blue}\hat{S}}|]}{d}\l eft((\nabla f(x))^Th + \tfrac{1}{2}h^T {\color{red}D} h\right)
There is no bad partitioning Depends on the data and the partitioning Theorem [FQRT’14] Theorem [RT’13] Depends on the data \tau\geq 2 \; \Rightarrow \; \beta_1 \leq {\color{red}\beta} \leq 2\beta_1
FAST DISTRIBUTED COORDINATE DESCENT 2
Iteration Complexity implies Theorem [FR’13, FQRT’14] # coordinates # nodes # coordinates updated / node k\geq \left(\frac{d}{c \tau}\right) \left(\sqrt{\frac{C_1+C_2}{{\color{blue}\epsilon}{\color{red}\rho}}}\right) + 1
The Algorithm Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication) t_k^i \leftarrow \arg \min_{t\in \mathbb{R}}\left\{ \nabla_i f(\theta_k^2 u_k + z_k) t + \frac{ s \theta_k D_{ii} }{ 2\tau } t^2 + R_i( z_k^i + t ) \right\}
Part 6 EXPERIMENTS
Experiment 1 Machine: Archer Supercomputer Problem: SVM dual, astro-ph dataset, n = 99,757, d = 29,882 Algorithms: Hydra 2 with c = 32 and tau = 10
4 Stepsizes for Hydra/Hydra 2
Experiment 2 Machine: 1 cluster node with 24 cores Problem: LASSO, n = 2 billion, d = 1 billion Algorithm: Hydra with c = 1 (=PCDM) P.R. and Martin Takáč, Parallel coordinate descent methods for big data optimization, arXiv: (to appear in Mathematical Programming) 16 th IMA Fox Prize in Numerical Analysis, 2013, 2 nd Prize (for M.T.)
Coordinate Updates
Iterations
Wall Time
Experiment 3 Machine: 128 nodes of Hector Supercomputer (4096 cores) Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB Algorithm: Hydra with c = 512 P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv: , 2013
LASSO: 3TB data nodes
Experiment 4 Machine: 128 nodes of Archer Supercomputer Problem: LASSO, n = 5 million, d = 50 billion, 5 TB (60,000 nnz per row of A) Algorithm: Hydra 2 with c = 256 Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, Machine Learning for Signal Processing, 2014
LASSO: 5TB data nodes