Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2.

Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2

Collaborators Olivier Fercoq (Edinburgh/Paris Tech) Zheng Qu (Edinburgh) Martin Takáč (Edinburgh/Lehigh) P.R. and M. Takáč, Distributed coordinate descent for big data optimization, arXiv:1310.2059, 2013 O. Fercoq, Z. Qu, P.R. and M. Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Machine Learning for Signal Processing, 2014 2

Part 1 OPTIMIZATION

Big Data Optimization d = BIG convex

God’s Algorithm = Teleportation

Iterative Procedure x0x0 x1x1 x2x2 x3x3

SPAM DETECTION

PAGE RANKING

FA’K’E DETECTION

RECOMMENDER SYSTEMS

GEOTAGGING

Part 2 RANDOMIZED COORDINATE DESCENT IN 2D

Find the minimizer 2D Optimization Contours of a function Goal:

Randomized Coordinate Descent in 2D N S E W

1 N S E W

1 N S E W 2

1 2 3 N S E W

1 2 3 4 N S E W

1 2 3 4 N S E W 5

1 2 3 4 5 6 N S E W

1 2 3 4 5 N S E W 6 7 S O L V E D !

Part 3 PARALLEL COORDINATE DESCENT

Convergence of Randomized Coordinate Descent Strongly convex f Smooth or ‘simple’ nonsmooth f ‘difficult’ nonsmooth f Focus on d (big data = big d)

Parallelization Dream SerialParallel In reality we get something in between

Failure of naive parallelization 1a 1b 0

Failure of naive parallelization 1 1a 1b 0

Failure of naive parallelization 1 2a 2b

Failure of naive parallelization 1 2a 2b 2

Failure of naive parallelization 2 O O P S !

1 1a 1b 0 Idea: averaging updates may help S O L V E D !

Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...

Averaging may be too conservative WANT BAD!!!

Part 4 THE PROBLEM (IN MORE DETAIL)

The Problem Loss Regularizer f is convex f’ is smooth f(x) + (\nabla f(x))^T h \leq f(x+h) \leq f(x) + (\nabla f (x))^T h + \tfrac{1}{2}h^T A^T A h R is separableR i is convex and closed A = data matrix R(x) = \sum_{i=1}^d R_i(x^i) \sigma = \lambda_{\text{max}} (D^{-1/2} A^T A D^{-1/2})

Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO

Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13

Part 5 ALGORITHMS & THEORY

DISTRIBUTED COORDINATE DESCENT

Distribution of Data d = # features / variables / coordinates Data matrix

Choice of Coordinates

Random set of coordinates (‘sampling’)

Computing Updates to Selected Coordinates Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication)

Iteration Complexity implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT’13] # coordinates # nodes # coordinates updated / node

Expected Separable Overapproximation \mathbb{E}\left[f(x+h_{[{\color{blue}\hat{S}}]}) \right]\leq f(x) + \tfrac{\mathbb{E}[|{\color{blue}\hat{S}}|]}{d}\l eft((\nabla f(x))^Th + \tfrac{1}{2}h^T {\color{red}D} h\right)

There is no bad partitioning Depends on the data and the partitioning Theorem [FQRT’14] Theorem [RT’13] Depends on the data \tau\geq 2 \; \Rightarrow \; \beta_1 \leq {\color{red}\beta} \leq 2\beta_1

FAST DISTRIBUTED COORDINATE DESCENT 2

Iteration Complexity implies Theorem [FR’13, FQRT’14] # coordinates # nodes # coordinates updated / node k\geq \left(\frac{d}{c \tau}\right) \left(\sqrt{\frac{C_1+C_2}{{\color{blue}\epsilon}{\color{red}\rho}}}\right) + 1

The Algorithm Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication) t_k^i \leftarrow \arg \min_{t\in \mathbb{R}}\left\{ \nabla_i f(\theta_k^2 u_k + z_k) t + \frac{ s \theta_k D_{ii} }{ 2\tau } t^2 + R_i( z_k^i + t ) \right\}

Part 6 EXPERIMENTS

Experiment 1 Machine: Archer Supercomputer Problem: SVM dual, astro-ph dataset, n = 99,757, d = 29,882 Algorithms: Hydra 2 with c = 32 and tau = 10

4 Stepsizes for Hydra/Hydra 2

Experiment 2 Machine: 1 cluster node with 24 cores Problem: LASSO, n = 2 billion, d = 1 billion Algorithm: Hydra with c = 1 (=PCDM) P.R. and Martin Takáč, Parallel coordinate descent methods for big data optimization, arXiv:1212.0873 (to appear in Mathematical Programming) 16 th IMA Fox Prize in Numerical Analysis, 2013, 2 nd Prize (for M.T.)

Coordinate Updates

Iterations

Wall Time

Experiment 3 Machine: 128 nodes of Hector Supercomputer (4096 cores) Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB Algorithm: Hydra with c = 512 P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013

LASSO: 3TB data + 128 nodes

Experiment 4 Machine: 128 nodes of Archer Supercomputer Problem: LASSO, n = 5 million, d = 50 billion, 5 TB (60,000 nnz per row of A) Algorithm: Hydra 2 with c = 256 Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, Machine Learning for Signal Processing, 2014

LASSO: 5TB data + 128 nodes

Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2.

Similar presentations

Presentation on theme: "Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2.

Similar presentations

Presentation on theme: "Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2."— Presentation transcript:

Similar presentations

About project

Feedback