Presentation is loading. Please wait.

Presentation is loading. Please wait.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Similar presentations


Presentation on theme: "Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013."— Presentation transcript:

1 Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013

2 Randomized Coordinate Descent in 2D

3 Find the minimizer 2D Optimization Contours of a function Goal:

4 Randomized Coordinate Descent in 2D N S E W

5 1 N S E W

6 1 N S E W 2

7 1 2 3 N S E W

8 1 2 3 4 N S E W

9 1 2 3 4 N S E W 5

10 1 2 3 4 5 6 N S E W

11 1 2 3 4 5 N S E W 6 7 S O L V E D !

12 Convergence of Randomized Coordinate Descent Strongly convex f Smooth or ‘simple’ nonsmooth f ‘difficult’ nonsmooth f Focus on d (big data = big d)

13 Parallelization Dream SerialParallel In reality we get something in between

14 How (not) to Parallelize Coordinate Descent

15 “Naive” parallelization Do the same thing as before, but with more or all coordinates and add up the updates

16 Failure of naive parallelization 1a 1b 0

17 Failure of naive parallelization 1 1a 1b 0

18 Failure of naive parallelization 1 2a 2b

19 Failure of naive parallelization 1 2a 2b 2

20 Failure of naive parallelization 2 O O P S !

21 1 1a 1b 0 Idea: averaging updates may help S O L V E D !

22 Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...

23 Averaging may be too conservative WANT BAD!!!

24 Minimizing Regularized Loss

25 Convex (smooth) Convex (smooth or nonsmooth) - separable - allow Loss Regularizer

26 Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO

27 Structure of f Considered in [BKBG, ICML 2011]

28 Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13

29 Distributed Coordinate Descent Method

30 I. Distribution of Data d = # features / variables / coordinates Data matrix

31 II. Choice of Coordinates

32 Random set of coordinates (‘sampling’)

33 III. Computing Updates to Selected Coordinates Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication)

34 Iteration Complexity implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT’13] # coordinates # nodes # coordinates updated / node

35 Bad partitioning at most doubles # of iterations spectral norm of the “partitioning” Theorem [RT’13]

36 Experiment 1 1 node (c = 1) LASSO problem n = 2 billions d = 1 billion

37 Coordinate Updates

38 Iterations

39 Wall Time

40 Experiment 2 128 nodes (c = 512, 4096 cores) LASSO problem n = 1 billion d = 0.5 billion data size = 3 TB

41 LASSO: 3TB data + 128 nodes


Download ppt "Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013."

Similar presentations


Ads by Google