Download presentation

Presentation is loading. Please wait.

Published byBailee Goulder Modified over 2 years ago

1
Peter Richtarik Why parallelizing like crazy and being lazy can be good

3
I. Optimization

4
Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =

5
Western General Hospital ( Creutzfeldt-Jakob Disease) Arup (Truss Topology Design) Ministry of Defence dstl lab (Algorithms for Data Simplicity) Royal Observatory (Optimal Planet Growth)

6
Big Data digital images & videos transaction records government records health records defence internet activity (social media, wikipedia,...) scientific measurements (physics, climate models,...) BIG Volume BIG Velocity BIG Variety

7
God’s Algorithm = Teleportation

8
If You Are Not a God... x0x0 x1x1 x2x2 x3x3

9
II. Randomized Coordinate Descent Methods [the cardinal directions of big data optimization]

10
P. R. and M. Takáč Iteration complexity of randomized block coordinate descent methods for minimizing a composite function Mathematical Programming A, 2012 Yu. Nesterov Efficiency of coordinate descent methods on huge-scale optimization problems SIAM J Optimization, 2012

11
Find the minimizer of 2D Optimization Contours of function Goal:

12
Randomized Coordinate Descent in 2D N S E W

13
N S E W 1

14
1 N S E W 2

15
3 N S E W 12

16
3 N S E W 12 4

17
3 N S E W 12 4 5

18
3 N S E W 12 4 56

19
3 N S E W 12 4 56 7

20
3 N S E W 12 4 56 7 8 S O L V E D !

21
1 Billion Rows & 100 Million Variables

22
Bridges are Indeed Optimal!

23
P. R. and M. Takáč Parallel coordinate descent methods for big data optimization ArXiv:1212.0873, 2012 M. Takáč, A. Bijral, P. R. and N. Srebro Mini-batch primal and dual methods for SVMs ICML 2013

24
Failure of Naive Parallelization 1a 1b 0

25
Failure of Naive Parallelization 1a 1b 1 0

26
Failure of Naive Parallelization 1 2b 2a

27
Failure of Naive Parallelization 1 2b 2a 2

28
Failure of Naive Parallelization 2

29
Parallel Coordinate Descent

31
Theory

32
Reality

33
A Problem with Billion Variables

34
P. R. and M. Takáč Distributed coordinate descent methods for big data optimization Manuscript, 2013

36
Distributed Coordinate Descent 1.2 TB LASSO problem solved on the HECToR supercomputer with 2048 cores

37
III. Randomized Lock-Free Methods [optimization as lock breaking]

38
A Lock with 4 Dials Setup: Combination maximizing F opens the lock x = (x 1, x 2, x 3, x 4 )F(x) = F(x 1, x 2, x 3, x 4 ) A function representing the “quality” of a combination Optimization Problem: Find combination maximizing F

39
Optimization Algorithm

40
P. R. and M. Takáč Randomized lock-free gradient methods Manuscript, 2013 F. Niu, B. Recht, C. Re, and S. Wright HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent NIPS, 2011

41
A System of Billion Locks with Shared Dials # dials = n x1x1 x2x2 x3x3 x4x4 xnxn Lock 1) Nodes in the graph correspond to dials 2) Nodes in the graph also correspond to locks: each lock (=node) owns dials connected to it in the graph by an edge = # locks

42
How do we Measure the Quality of a Combination? F : R n R Each lock j has its own quality function F j depending on the dials it owns However, it does NOT open when F j is maximized The system of locks opens when is maximized F = F 1 + F 2 +... + F n

43
1) Randomly select a lock 2) Randomly select a dial belonging to the lock 3) Adjust the value on the selected dial based only on the info corresponding to the selected lock An Algorithm with (too much?) Randomization

44
IDLE Synchronous Parallelization J4 J7 J1 J5 J8 J2 time J6 J9 J3 Processor 1 Processor 2 Processor 3 WASTEFUL

45
Crazy (Lock-Free) Parallelization time J4J5J6J7J8J9J1J2J3 Processor 1 Processor 2 Processor 3 NO WASTE

46
Crazy Parallelization

50
Theoretical Result Average # dials in a lock Average # of dials common to 2 locks # Locks # Processors

51
Computational Insights

56
IV. Final Two Slides

57
Why parallelizing like crazy and being lazy can be good? Randomization Effectivity Tractability Efficiency Scalability (big data) Parallelism Distribution Asynchronicity Parallelization

58
Tools Probability Machine LearningMatrix Theory HPC

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google