Download presentation
Presentation is loading. Please wait.
Published byLynette Berry Modified over 9 years ago
1
Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013
2
Randomized Coordinate Descent in 2D
3
Find the minimizer 2D Optimization Contours of a function Goal:
4
Randomized Coordinate Descent in 2D N S E W
5
1 N S E W
6
1 N S E W 2
7
1 2 3 N S E W
8
1 2 3 4 N S E W
9
1 2 3 4 N S E W 5
10
1 2 3 4 5 6 N S E W
11
1 2 3 4 5 N S E W 6 7 S O L V E D !
12
Convergence of Randomized Coordinate Descent Strongly convex f Smooth or ‘simple’ nonsmooth f ‘difficult’ nonsmooth f Focus on d (big data = big d)
13
Parallelization Dream SerialParallel In reality we get something in between
14
How (not) to Parallelize Coordinate Descent
15
“Naive” parallelization Do the same thing as before, but with more or all coordinates and add up the updates
16
Failure of naive parallelization 1a 1b 0
17
Failure of naive parallelization 1 1a 1b 0
18
Failure of naive parallelization 1 2a 2b
19
Failure of naive parallelization 1 2a 2b 2
20
Failure of naive parallelization 2 O O P S !
21
1 1a 1b 0 Idea: averaging updates may help S O L V E D !
22
Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...
23
Averaging may be too conservative WANT BAD!!!
24
Minimizing Regularized Loss
25
Convex (smooth) Convex (smooth or nonsmooth) - separable - allow Loss Regularizer
26
Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO
27
Structure of f Considered in [BKBG, ICML 2011]
28
Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13
29
Distributed Coordinate Descent Method
30
I. Distribution of Data d = # features / variables / coordinates Data matrix
31
II. Choice of Coordinates
32
Random set of coordinates (‘sampling’)
33
III. Computing Updates to Selected Coordinates Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication)
34
Iteration Complexity implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT’13] # coordinates # nodes # coordinates updated / node
35
Bad partitioning at most doubles # of iterations spectral norm of the “partitioning” Theorem [RT’13]
36
Experiment 1 1 node (c = 1) LASSO problem n = 2 billions d = 1 billion
37
Coordinate Updates
38
Iterations
39
Wall Time
40
Experiment 2 128 nodes (c = 512, 4096 cores) LASSO problem n = 1 billion d = 0.5 billion data size = 3 TB
41
LASSO: 3TB data + 128 nodes
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.