Peter Richtarik Why parallelizing like crazy and being lazy can be good
I. Optimization
Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =
Western General Hospital ( Creutzfeldt-Jakob Disease) Arup (Truss Topology Design) Ministry of Defence dstl lab (Algorithms for Data Simplicity) Royal Observatory (Optimal Planet Growth)
Big Data digital images & videos transaction records government records health records defence internet activity (social media, wikipedia,...) scientific measurements (physics, climate models,...) BIG Volume BIG Velocity BIG Variety
God’s Algorithm = Teleportation
If You Are Not a God... x0x0 x1x1 x2x2 x3x3
II. Randomized Coordinate Descent Methods [the cardinal directions of big data optimization]
P. R. and M. Takáč Iteration complexity of randomized block coordinate descent methods for minimizing a composite function Mathematical Programming A, 2012 Yu. Nesterov Efficiency of coordinate descent methods on huge-scale optimization problems SIAM J Optimization, 2012
Find the minimizer of 2D Optimization Contours of function Goal:
Randomized Coordinate Descent in 2D N S E W
N S E W 1
1 N S E W 2
3 N S E W 12
3 N S E W 12 4
3 N S E W
3 N S E W
3 N S E W
3 N S E W S O L V E D !
1 Billion Rows & 100 Million Variables
Bridges are Indeed Optimal!
P. R. and M. Takáč Parallel coordinate descent methods for big data optimization ArXiv: , 2012 M. Takáč, A. Bijral, P. R. and N. Srebro Mini-batch primal and dual methods for SVMs ICML 2013
Failure of Naive Parallelization 1a 1b 0
Failure of Naive Parallelization 1a 1b 1 0
Failure of Naive Parallelization 1 2b 2a
Failure of Naive Parallelization 1 2b 2a 2
Failure of Naive Parallelization 2
Parallel Coordinate Descent
Theory
Reality
A Problem with Billion Variables
P. R. and M. Takáč Distributed coordinate descent methods for big data optimization Manuscript, 2013
Distributed Coordinate Descent 1.2 TB LASSO problem solved on the HECToR supercomputer with 2048 cores
III. Randomized Lock-Free Methods [optimization as lock breaking]
A Lock with 4 Dials Setup: Combination maximizing F opens the lock x = (x 1, x 2, x 3, x 4 )F(x) = F(x 1, x 2, x 3, x 4 ) A function representing the “quality” of a combination Optimization Problem: Find combination maximizing F
Optimization Algorithm
P. R. and M. Takáč Randomized lock-free gradient methods Manuscript, 2013 F. Niu, B. Recht, C. Re, and S. Wright HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent NIPS, 2011
A System of Billion Locks with Shared Dials # dials = n x1x1 x2x2 x3x3 x4x4 xnxn Lock 1) Nodes in the graph correspond to dials 2) Nodes in the graph also correspond to locks: each lock (=node) owns dials connected to it in the graph by an edge = # locks
How do we Measure the Quality of a Combination? F : R n R Each lock j has its own quality function F j depending on the dials it owns However, it does NOT open when F j is maximized The system of locks opens when is maximized F = F 1 + F F n
1) Randomly select a lock 2) Randomly select a dial belonging to the lock 3) Adjust the value on the selected dial based only on the info corresponding to the selected lock An Algorithm with (too much?) Randomization
IDLE Synchronous Parallelization J4 J7 J1 J5 J8 J2 time J6 J9 J3 Processor 1 Processor 2 Processor 3 WASTEFUL
Crazy (Lock-Free) Parallelization time J4J5J6J7J8J9J1J2J3 Processor 1 Processor 2 Processor 3 NO WASTE
Crazy Parallelization
Theoretical Result Average # dials in a lock Average # of dials common to 2 locks # Locks # Processors
Computational Insights
IV. Final Two Slides
Why parallelizing like crazy and being lazy can be good? Randomization Effectivity Tractability Efficiency Scalability (big data) Parallelism Distribution Asynchronicity Parallelization
Tools Probability Machine LearningMatrix Theory HPC