Download presentation

Presentation is loading. Please wait.

Published byAlexander Jopling Modified over 2 years ago

1
Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2

2
Collaborators Olivier Fercoq (Edinburgh/Paris Tech) Zheng Qu (Edinburgh) Martin Takáč (Edinburgh/Lehigh) P.R. and M. Takáč, Distributed coordinate descent for big data optimization, arXiv:1310.2059, 2013 O. Fercoq, Z. Qu, P.R. and M. Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Machine Learning for Signal Processing, 2014 2

3
Part 1 OPTIMIZATION

4
Big Data Optimization d = BIG convex

5
God’s Algorithm = Teleportation

6
Iterative Procedure x0x0 x1x1 x2x2 x3x3

7
SPAM DETECTION

8
PAGE RANKING

9
FA’K’E DETECTION

10
RECOMMENDER SYSTEMS

11
GEOTAGGING

12
Part 2 RANDOMIZED COORDINATE DESCENT IN 2D

13
Find the minimizer 2D Optimization Contours of a function Goal:

14
Randomized Coordinate Descent in 2D N S E W

15
1 N S E W

16
1 N S E W 2

17
1 2 3 N S E W

18
1 2 3 4 N S E W

19
1 2 3 4 N S E W 5

20
1 2 3 4 5 6 N S E W

21
1 2 3 4 5 N S E W 6 7 S O L V E D !

22
Part 3 PARALLEL COORDINATE DESCENT

23
Convergence of Randomized Coordinate Descent Strongly convex f Smooth or ‘simple’ nonsmooth f ‘difficult’ nonsmooth f Focus on d (big data = big d)

24
Parallelization Dream SerialParallel In reality we get something in between

25
Failure of naive parallelization 1a 1b 0

26
Failure of naive parallelization 1 1a 1b 0

27
Failure of naive parallelization 1 2a 2b

28
Failure of naive parallelization 1 2a 2b 2

29
Failure of naive parallelization 2 O O P S !

30
1 1a 1b 0 Idea: averaging updates may help S O L V E D !

31
Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...

32
Averaging may be too conservative WANT BAD!!!

33
Part 4 THE PROBLEM (IN MORE DETAIL)

34
The Problem Loss Regularizer f is convex f’ is smooth f(x) + (\nabla f(x))^T h \leq f(x+h) \leq f(x) + (\nabla f (x))^T h + \tfrac{1}{2}h^T A^T A h R is separableR i is convex and closed A = data matrix R(x) = \sum_{i=1}^d R_i(x^i) \sigma = \lambda_{\text{max}} (D^{-1/2} A^T A D^{-1/2})

35
Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO

36
Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13

37
Part 5 ALGORITHMS & THEORY

38
DISTRIBUTED COORDINATE DESCENT

39
Distribution of Data d = # features / variables / coordinates Data matrix

40
Choice of Coordinates

41
Random set of coordinates (‘sampling’)

42
Computing Updates to Selected Coordinates Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication)

43
Iteration Complexity implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT’13] # coordinates # nodes # coordinates updated / node

44
Expected Separable Overapproximation \mathbb{E}\left[f(x+h_{[{\color{blue}\hat{S}}]}) \right]\leq f(x) + \tfrac{\mathbb{E}[|{\color{blue}\hat{S}}|]}{d}\l eft((\nabla f(x))^Th + \tfrac{1}{2}h^T {\color{red}D} h\right)

45
There is no bad partitioning Depends on the data and the partitioning Theorem [FQRT’14] Theorem [RT’13] Depends on the data \tau\geq 2 \; \Rightarrow \; \beta_1 \leq {\color{red}\beta} \leq 2\beta_1

46
FAST DISTRIBUTED COORDINATE DESCENT 2

48
Iteration Complexity implies Theorem [FR’13, FQRT’14] # coordinates # nodes # coordinates updated / node k\geq \left(\frac{d}{c \tau}\right) \left(\sqrt{\frac{C_1+C_2}{{\color{blue}\epsilon}{\color{red}\rho}}}\right) + 1

49
The Algorithm Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication) t_k^i \leftarrow \arg \min_{t\in \mathbb{R}}\left\{ \nabla_i f(\theta_k^2 u_k + z_k) t + \frac{ s \theta_k D_{ii} }{ 2\tau } t^2 + R_i( z_k^i + t ) \right\}

50
Part 6 EXPERIMENTS

51
Experiment 1 Machine: Archer Supercomputer Problem: SVM dual, astro-ph dataset, n = 99,757, d = 29,882 Algorithms: Hydra 2 with c = 32 and tau = 10

52
4 Stepsizes for Hydra/Hydra 2

53
Experiment 2 Machine: 1 cluster node with 24 cores Problem: LASSO, n = 2 billion, d = 1 billion Algorithm: Hydra with c = 1 (=PCDM) P.R. and Martin Takáč, Parallel coordinate descent methods for big data optimization, arXiv:1212.0873 (to appear in Mathematical Programming) 16 th IMA Fox Prize in Numerical Analysis, 2013, 2 nd Prize (for M.T.)

54
Coordinate Updates

55
Iterations

56
Wall Time

57
Experiment 3 Machine: 128 nodes of Hector Supercomputer (4096 cores) Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB Algorithm: Hydra with c = 512 P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv:1310.2059, 2013

58
LASSO: 3TB data + 128 nodes

59
Experiment 4 Machine: 128 nodes of Archer Supercomputer Problem: LASSO, n = 5 million, d = 50 billion, 5 TB (60,000 nnz per row of A) Algorithm: Hydra 2 with c = 256 Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, Machine Learning for Signal Processing, 2014

60
LASSO: 5TB data + 128 nodes

Similar presentations

OK

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on hiv/aids in hindi Ppt on four wheel car steering system Ppt on id ego superego iceberg Export pdf to ppt online conversion Ppt on water scarcity in the middle east Ppt on breast cancer treatment Ppt on c programming notes Ppt on synthesis and degradation of purines and pyrimidines size Ppt on bmc remedy Ppt on mind reading machine