Download presentation

Presentation is loading. Please wait.

Published byAlexander Jopling Modified about 1 year ago

1
Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2

2
Collaborators Olivier Fercoq (Edinburgh/Paris Tech) Zheng Qu (Edinburgh) Martin Takáč (Edinburgh/Lehigh) P.R. and M. Takáč, Distributed coordinate descent for big data optimization, arXiv:1310.2059, 2013 O. Fercoq, Z. Qu, P.R. and M. Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Machine Learning for Signal Processing, 2014 2

3
Part 1 OPTIMIZATION

4
Big Data Optimization d = BIG convex

5
God’s Algorithm = Teleportation

6
Iterative Procedure x0x0 x1x1 x2x2 x3x3

7
SPAM DETECTION

8
PAGE RANKING

9
FA’K’E DETECTION

10
RECOMMENDER SYSTEMS

11
GEOTAGGING

12
Part 2 RANDOMIZED COORDINATE DESCENT IN 2D

13
Find the minimizer 2D Optimization Contours of a function Goal:

14
Randomized Coordinate Descent in 2D N S E W

15
1 N S E W

16
1 N S E W 2

17
1 2 3 N S E W

18
1 2 3 4 N S E W

19
1 2 3 4 N S E W 5

20
1 2 3 4 5 6 N S E W

21
1 2 3 4 5 N S E W 6 7 S O L V E D !

22
Part 3 PARALLEL COORDINATE DESCENT

23
Convergence of Randomized Coordinate Descent Strongly convex f Smooth or ‘simple’ nonsmooth f ‘difficult’ nonsmooth f Focus on d (big data = big d)

24
Parallelization Dream SerialParallel In reality we get something in between

25
Failure of naive parallelization 1a 1b 0

26
Failure of naive parallelization 1 1a 1b 0

27
Failure of naive parallelization 1 2a 2b

28
Failure of naive parallelization 1 2a 2b 2

29
Failure of naive parallelization 2 O O P S !

30
1 1a 1b 0 Idea: averaging updates may help S O L V E D !

31
Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...

32
Averaging may be too conservative WANT BAD!!!

33
Part 4 THE PROBLEM (IN MORE DETAIL)

34
The Problem Loss Regularizer f is convex f’ is smooth f(x) + (\nabla f(x))^T h \leq f(x+h) \leq f(x) + (\nabla f (x))^T h + \tfrac{1}{2}h^T A^T A h R is separableR i is convex and closed A = data matrix R(x) = \sum_{i=1}^d R_i(x^i) \sigma = \lambda_{\text{max}} (D^{-1/2} A^T A D^{-1/2})

35
Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO

36
Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13

37
Part 5 ALGORITHMS & THEORY

38
DISTRIBUTED COORDINATE DESCENT

39
Distribution of Data d = # features / variables / coordinates Data matrix

40
Choice of Coordinates

41
Random set of coordinates (‘sampling’)

42
Computing Updates to Selected Coordinates Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication)

43
Iteration Complexity implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT’13] # coordinates # nodes # coordinates updated / node

44
Expected Separable Overapproximation \mathbb{E}\left[f(x+h_{[{\color{blue}\hat{S}}]}) \right]\leq f(x) + \tfrac{\mathbb{E}[|{\color{blue}\hat{S}}|]}{d}\l eft((\nabla f(x))^Th + \tfrac{1}{2}h^T {\color{red}D} h\right)

45
There is no bad partitioning Depends on the data and the partitioning Theorem [FQRT’14] Theorem [RT’13] Depends on the data \tau\geq 2 \; \Rightarrow \; \beta_1 \leq {\color{red}\beta} \leq 2\beta_1

46
FAST DISTRIBUTED COORDINATE DESCENT 2

47

48
Iteration Complexity implies Theorem [FR’13, FQRT’14] # coordinates # nodes # coordinates updated / node k\geq \left(\frac{d}{c \tau}\right) \left(\sqrt{\frac{C_1+C_2}{{\color{blue}\epsilon}{\color{red}\rho}}}\right) + 1

49
The Algorithm Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication) t_k^i \leftarrow \arg \min_{t\in \mathbb{R}}\left\{ \nabla_i f(\theta_k^2 u_k + z_k) t + \frac{ s \theta_k D_{ii} }{ 2\tau } t^2 + R_i( z_k^i + t ) \right\}

50
Part 6 EXPERIMENTS

51
Experiment 1 Machine: Archer Supercomputer Problem: SVM dual, astro-ph dataset, n = 99,757, d = 29,882 Algorithms: Hydra 2 with c = 32 and tau = 10

52
4 Stepsizes for Hydra/Hydra 2

53
Experiment 2 Machine: 1 cluster node with 24 cores Problem: LASSO, n = 2 billion, d = 1 billion Algorithm: Hydra with c = 1 (=PCDM) P.R. and Martin Takáč, Parallel coordinate descent methods for big data optimization, arXiv:1212.0873 (to appear in Mathematical Programming) 16 th IMA Fox Prize in Numerical Analysis, 2013, 2 nd Prize (for M.T.)

54
Coordinate Updates

55
Iterations

56
Wall Time

57
Experiment 3 Machine: 128 nodes of Hector Supercomputer (4096 cores) Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB Algorithm: Hydra with c = 512 P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013

58
LASSO: 3TB data + 128 nodes

59
Experiment 4 Machine: 128 nodes of Archer Supercomputer Problem: LASSO, n = 5 million, d = 50 billion, 5 TB (60,000 nnz per row of A) Algorithm: Hydra 2 with c = 256 Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, Machine Learning for Signal Processing, 2014

60
LASSO: 5TB data + 128 nodes

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google