Presentation is loading. Please wait.

Presentation is loading. Please wait.

Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2.

Similar presentations


Presentation on theme: "Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2."— Presentation transcript:

1 Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5,

2 Collaborators Olivier Fercoq (Edinburgh/Paris Tech) Zheng Qu (Edinburgh) Martin Takáč (Edinburgh/Lehigh) P.R. and M. Takáč, Distributed coordinate descent for big data optimization, arXiv: , 2013 O. Fercoq, Z. Qu, P.R. and M. Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, IEEE Machine Learning for Signal Processing,

3 Part 1 OPTIMIZATION

4 Big Data Optimization d = BIG convex

5 God’s Algorithm = Teleportation

6 Iterative Procedure x0x0 x1x1 x2x2 x3x3

7 SPAM DETECTION

8 PAGE RANKING

9 FA’K’E DETECTION

10 RECOMMENDER SYSTEMS

11 GEOTAGGING

12 Part 2 RANDOMIZED COORDINATE DESCENT IN 2D

13 Find the minimizer 2D Optimization Contours of a function Goal:

14 Randomized Coordinate Descent in 2D N S E W

15 1 N S E W

16 1 N S E W 2

17 1 2 3 N S E W

18 N S E W

19 N S E W 5

20 N S E W

21 N S E W 6 7 S O L V E D !

22 Part 3 PARALLEL COORDINATE DESCENT

23 Convergence of Randomized Coordinate Descent Strongly convex f Smooth or ‘simple’ nonsmooth f ‘difficult’ nonsmooth f Focus on d (big data = big d)

24 Parallelization Dream SerialParallel In reality we get something in between

25 Failure of naive parallelization 1a 1b 0

26 Failure of naive parallelization 1 1a 1b 0

27 Failure of naive parallelization 1 2a 2b

28 Failure of naive parallelization 1 2a 2b 2

29 Failure of naive parallelization 2 O O P S !

30 1 1a 1b 0 Idea: averaging updates may help S O L V E D !

31 Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...

32 Averaging may be too conservative WANT BAD!!!

33 Part 4 THE PROBLEM (IN MORE DETAIL)

34 The Problem Loss Regularizer f is convex f’ is smooth f(x) + (\nabla f(x))^T h \leq f(x+h) \leq f(x) + (\nabla f (x))^T h + \tfrac{1}{2}h^T A^T A h R is separableR i is convex and closed A = data matrix R(x) = \sum_{i=1}^d R_i(x^i) \sigma = \lambda_{\text{max}} (D^{-1/2} A^T A D^{-1/2})

35 Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO

36 Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13

37 Part 5 ALGORITHMS & THEORY

38 DISTRIBUTED COORDINATE DESCENT

39 Distribution of Data d = # features / variables / coordinates Data matrix

40 Choice of Coordinates

41 Random set of coordinates (‘sampling’)

42 Computing Updates to Selected Coordinates Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication)

43 Iteration Complexity implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT’13] # coordinates # nodes # coordinates updated / node

44 Expected Separable Overapproximation \mathbb{E}\left[f(x+h_{[{\color{blue}\hat{S}}]}) \right]\leq f(x) + \tfrac{\mathbb{E}[|{\color{blue}\hat{S}}|]}{d}\l eft((\nabla f(x))^Th + \tfrac{1}{2}h^T {\color{red}D} h\right)

45 There is no bad partitioning Depends on the data and the partitioning Theorem [FQRT’14] Theorem [RT’13] Depends on the data \tau\geq 2 \; \Rightarrow \; \beta_1 \leq {\color{red}\beta} \leq 2\beta_1

46 FAST DISTRIBUTED COORDINATE DESCENT 2

47

48 Iteration Complexity implies Theorem [FR’13, FQRT’14] # coordinates # nodes # coordinates updated / node k\geq \left(\frac{d}{c \tau}\right) \left(\sqrt{\frac{C_1+C_2}{{\color{blue}\epsilon}{\color{red}\rho}}}\right) + 1

49 The Algorithm Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication) t_k^i \leftarrow \arg \min_{t\in \mathbb{R}}\left\{ \nabla_i f(\theta_k^2 u_k + z_k) t + \frac{ s \theta_k D_{ii} }{ 2\tau } t^2 + R_i( z_k^i + t ) \right\}

50 Part 6 EXPERIMENTS

51 Experiment 1 Machine: Archer Supercomputer Problem: SVM dual, astro-ph dataset, n = 99,757, d = 29,882 Algorithms: Hydra 2 with c = 32 and tau = 10

52 4 Stepsizes for Hydra/Hydra 2

53 Experiment 2 Machine: 1 cluster node with 24 cores Problem: LASSO, n = 2 billion, d = 1 billion Algorithm: Hydra with c = 1 (=PCDM) P.R. and Martin Takáč, Parallel coordinate descent methods for big data optimization, arXiv: (to appear in Mathematical Programming) 16 th IMA Fox Prize in Numerical Analysis, 2013, 2 nd Prize (for M.T.)

54 Coordinate Updates

55 Iterations

56 Wall Time

57 Experiment 3 Machine: 128 nodes of Hector Supercomputer (4096 cores) Problem: LASSO, n = 1 billion, d = 0.5 billion, 3 TB Algorithm: Hydra with c = 512 P.R. and Martin Takáč, Distributed coordinate descent method for learning with big data, arXiv: , 2013

58 LASSO: 3TB data nodes

59 Experiment 4 Machine: 128 nodes of Archer Supercomputer Problem: LASSO, n = 5 million, d = 50 billion, 5 TB (60,000 nnz per row of A) Algorithm: Hydra 2 with c = 256 Olivier Fercoq, Zheng Qu, P.R. and Martin Takáč, Fast distributed coordinate descent for minimizing non-strongly convex losses, Machine Learning for Signal Processing, 2014

60 LASSO: 5TB data nodes


Download ppt "Peter Richtárik Distributed Coordinate Descent For Big Data Optimization Numerical Algorithms and Intelligent Software - Edinburgh – December 5, 2014 2."

Similar presentations


Ads by Google