Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

Similar presentations


Presentation on theme: "Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)"— Presentation transcript:

1 Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)

2 Optimization Problem

3 Problem Convex (smooth or nonsmooth) Convex (smooth or nonsmooth) - separable - allow Loss Regularizer

4 Regularizer: examples No regularizerWeighted L1 norm Weighted L2 norm Box constraints e.g., SVM dual e.g., LASSO

5 Loss: examples Quadratic loss L-infinity L1 regression Exponential loss Logistic loss Square hinge loss BKBG’11 RT’11b TBRS’13 RT ’13a FR’13

6 RANDOMIZED COORDINATE DESCENT IN 2D

7 Find the minimizer of 2D Optimization Contours of a function Goal:

8 Randomized Coordinate Descent in 2D N S E W

9 1 N S E W

10 1 N S E W 2

11 1 2 3 N S E W

12 1 2 3 4 N S E W

13 1 2 3 4 N S E W 5

14 1 2 3 4 5 6 N S E W

15 1 2 3 4 5 N S E W 6 7 S O L V E D !

16 CONTRIBUTIONS

17 Variants of Randomized Coordinate Descent Methods Block – can operate on “blocks” of coordinates – as opposed to just on individual coordinates General – applies to “general” (=smooth convex) functions – as opposed to special ones such as quadratics Proximal – admits a “nonsmooth regularizer” that is kept intact in solving subproblems – regularizer not smoothed, nor approximated Parallel – operates on multiple blocks / coordinates in parallel – as opposed to just 1 block / coordinate at a time Accelerated – achieves O(1/k^2) convergence rate for convex functions – as opposed to O(1/k) Efficient – complexity of 1 iteration is O(1) per processor on sparse problems – as opposed to O(# coordinates) : avoids adding two full vectors

18 Brief History of Randomized Coordinate Descent Methods + new long stepsizes

19 APPROX

20 “PROXIMAL” “PARALLEL” “ACCELERATED” APPROX

21 PCDM (R. & Takáč, 2012) = APPROX if we force

22 APPROX: Smooth Case Want this to be as large as possible Update for coordinate i Partial derivative of f

23 CONVERGENCE RATE

24 Convergence Rate average # coordinates updated / iteration # coordinates # iterations implies Theorem [FR’13b] Key assumption

25 Special Case: Fully Parallel Variant all coordinates are updated in each iteration # normalized weights (summing to n) # iterations implies

26 Special Case: Effect of New Stepsizes Average degree of separability “Average” of the Lipschitz constants With the new stepsizes (will mention later!), we have:

27 “EFFICIENCY” OF APPROX

28 Cost of 1 Iteration of APPROX Assume N = n (all blocks are of size 1) and that Sparse matrix Then the average cost of 1 iteration of APPROX is Scalar function: derivative = O(1) arithmetic ops = average # nonzeros in a column of A

29 Bottleneck: Computation of Partial Derivatives maintained

30 PRELIMINARY EXPERIMENTS

31 L1 Regularized L1 Regression Dorothea dataset: Gradient Method Nesterov’s Accelerated Gradient Method SPCDM APPROX

32 L1 Regularized L1 Regression

33 L1 Regularized Least Squares (LASSO) KDDB dataset: PCDM APPROX

34 Training Linear SVMs Malicious URL dataset:

35 Choice of Stepsizes: How (not) to Parallelize Coordinate Descent

36 Convergence of Randomized Coordinate Descent Strongly convex F (Simple Mehod) Smooth or ‘simple’ nonsmooth F (Accelerated Method) ‘Difficult’ nonsmooth F (Accelerated Method) or smooth F (Simple method) ‘Difficult’ nonsmooth F (Simple Method) Focus on n (big data = big n)

37 Parallelization Dream Depends on to what extent we can add up individual updates, which depends on the properties of F and the way coordinates are chosen at each iteration SerialParallel What do we actually get? WANT

38 “Naive” parallelization Do the same thing as before, but for MORE or ALL coordinates & ADD UP the updates

39 Failure of naive parallelization 1a 1b 0

40 Failure of naive parallelization 1 1a 1b 0

41 Failure of naive parallelization 1 2a 2b

42 Failure of naive parallelization 1 2a 2b 2

43 Failure of naive parallelization 2 O O P S !

44 1 1a 1b 0 Idea: averaging updates may help S O L V E D !

45 Averaging can be too conservative 1a 1b 0 1 2a 2b 2 a n d s o o n...

46 Averaging may be too conservative 2 WANT BAD!!! But we wanted:

47 What to do? Averaging: Summation: Update to coordinate i i-th unit coordinate vector Figure out when one can safely use:

48 ESO: Expected Separable Overapproximation

49 5 Models for f Admitting Small 1 2 3 Smooth partially separable f [RT’11b ] Nonsmooth max-type f [FR’13] f with ‘bounded Hessian’ [BKBG’11, RT’13a ]

50 5 Partially separable f with block smooth components [FR’13b] 5 Models for f Admitting Small 4 Partially separable f with smooth components [NC’13]

51 Randomized Parallel Coordinate Descent Method Random set of coordinates (sampling) Current iterateNew iteratei-th unit coordinate vector Update to i-th coordinate

52 ESO: Expected Separable Overapproximation Definition [RT’11b] 1. Separable in h 2. Can minimize in parallel 3. Can compute updates for only Shorthand: Minimize in h

53 Convergence Rate of APPROX average # coordinates updated / iteration # coordinates # iterations implies Theorem [FR’13b] Key assumption

54 Convergence Rate of PCDM: Convex f average # updated coordinates per iteration # coordinatesstepsize parameter error tolerance # iterations implies Theorem [RT’11b]

55 Convergence Rate of PCDM: strongly convex f implies Strong convexity constant of the regularizer Strong convexity constant of the loss f Theorem [RT’11b]

56 PART II. ADDITIONAL TOPICS

57 Partial Separability and Doubly Uniform Samplings

58 Serial uniform sampling Probability law:

59 -nice sampling Probability law: Good for shared memory systems

60 Doubly uniform sampling Probability law: Can model unreliable processors / machines

61 ESO for partially separable functions and doubly uniform samplings Theorem [RT’11b] 1 Smooth partially separable f [RT’11b ]

62 PCDM: Theoretical Speedup Much of Big Data is here! degree of partial separability # coordinates # coordinate updates / iter WEAK OR NO SPEEDUP: Non-separable (dense) problems LINEAR OR GOOD SPEEDUP: Nearly separable (sparse) problems

63

64 n = 1000 (# coordinates) Theory

65 Practice n = 1000 (# coordinates)

66 PCDM: Experiment with a 1 billion-by-2 billion LASSO problem

67 Optimization with Big Data * in a billion dimensional space on a foggy day Extreme* Mountain Climbing =

68 Coordinate Updates

69 Iterations

70 Wall Time

71 Distributed-Memory Coordinate Descent

72 Distributed -nice sampling Probability law: Machine 2Machine 1Machine 3 Good for a distributed version of coordinate descent

73 ESO: Distributed setting Theorem [RT’13b] 3 f with ‘bounded Hessian’ [BKBG’11, RT’13a ] spectral norm of the data

74 Bad partitioning at most doubles # of iterations spectral norm of the partitioning Theorem [RT’13b] # nodes # iterations = implies # updates/node

75

76 LASSO with a 3TB data matrix 128 Cray XE6 nodes with 4 MPI processes (c = 512) Each node: 2 x 16-cores with 32GB RAM = # coordinates

77 Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for L1-regularized loss minimization. JMLR 2011. Yurii Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341-362, 2012. [RT’11b] P.R. and Martin Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Prog., 2012. Rachael Tappenden, P.R. and Jacek Gondzio, Inexact coordinate descent: complexity and preconditioning, arXiv: 1304.5530, 2013. Ion Necoara, Yurii Nesterov, and Francois Glineur. Efficiency of randomized coordinate descent methods on optimization problems with linearly coupled constraints. Technical report, Politehnica University of Bucharest, 2012. Zhaosong Lu and Lin Xiao. On the complexity analysis of randomized block- coordinate descent methods. Technical report, Microsoft Research, 2013. References: serial coordinate descent

78 [BKBG’11] Joseph Bradley, Aapo Kyrola, Danny Bickson and Carlos Guestrin, Parallel Coordinate Descent for L1-Regularized Loss Minimization. ICML 2011 [RT’12] P.R. and Martin Takáč, Parallel coordinate descen methods for big data optimization. arXiv:1212.0873, 2012 Martin Takáč, Avleen Bijral, P.R., and Nathan Srebro. Mini-batch primal and dual methods for SVMs. ICML 2013 [FR’13a] Olivier Fercoq and P.R., Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:1309.5885, 2013 [RT’13a] P.R. and Martin Takáč, Distributed coordinate descent method for big data learning. arXiv:1310.2059, 2013 [RT’13b] P.R. and Martin Takáč, On optimal probabilities in stochastic coordinate descent methods. arXiv:1310.3438, 2013 References: parallel coordinate descent Good entry point to the topic (4p paper)

79 P.R. and Martin Takáč, Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings 2012. Rachael Tappenden, P.R. and Burak Buke, Separable approximations and decomposition methods for the augmented Lagrangian. arXiv:1308.6774, 2013. Indranil Palit and Chandan K. Reddy. Scalable and parallel boosting with MapReduce. IEEE Transactions on Knowledge and Data Engineering, 24(10):1904-1916, 2012. [FR’13b] Olivier Fercoq and P.R., Accelerated, Parallel and Proximal coordinate descent. arXiv:1312.5799, 2013 References: parallel coordinate descent


Download ppt "Accelerated, Parallel and PROXimal coordinate descent Moscow February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv:1312.5799)"

Similar presentations


Ads by Google