Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013

Randomized Coordinate Descent in 2D

Find the minimizer 2D Optimization Contours of a function Goal:

Randomized Coordinate Descent in 2D N S E W

1 N S E W

1 N S E W 2

1 2 3 N S E W

1 2 3 4 N S E W

1 2 3 4 N S E W 5

1 2 3 4 5 6 N S E W

1 2 3 4 5 N S E W 6 7 S O L V E D !

Convergence of Randomized Coordinate Descent Strongly convex f Smooth or ‘simple’ nonsmooth f ‘difficult’ nonsmooth f Focus on d (big data = big d)

Parallelization Dream SerialParallel In reality we get something in between

How (not) to Parallelize Coordinate Descent

“Naive” parallelization Do the same thing as before, but with more or all coordinates and add up the updates

Failure of naive parallelization 1a 1b 0

Failure of naive parallelization 1 1a 1b 0

Failure of naive parallelization 1 2a 2b

Failure of naive parallelization 1 2a 2b 2

Failure of naive parallelization 2 O O P S !

1 1a 1b 0 Idea: averaging updates may help S O L V E D !

Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...

Averaging may be too conservative WANT BAD!!!

Minimizing Regularized Loss

Convex (smooth) Convex (smooth or nonsmooth) - separable - allow Loss Regularizer

Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO

Structure of f Considered in [BKBG, ICML 2011]

Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13

Distributed Coordinate Descent Method

I. Distribution of Data d = # features / variables / coordinates Data matrix

II. Choice of Coordinates

Random set of coordinates (‘sampling’)

III. Computing Updates to Selected Coordinates Random set of coordinates (‘sampling’) Current iterateNew iterate Update to i-th coordinate All nodes need to be able to compute this (communication)

Iteration Complexity implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT’13] # coordinates # nodes # coordinates updated / node

Bad partitioning at most doubles # of iterations spectral norm of the “partitioning” Theorem [RT’13]

Experiment 1 1 node (c = 1) LASSO problem n = 2 billions d = 1 billion

Coordinate Updates

Iterations

Wall Time

Experiment 2 128 nodes (c = 512, 4096 cores) LASSO problem n = 1 billion d = 0.5 billion data size = 3 TB

LASSO: 3TB data + 128 nodes

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Similar presentations

Presentation on theme: "Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013.

Similar presentations

Presentation on theme: "Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013."— Presentation transcript:

Similar presentations

About project

Feedback