Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel and Distributed Block Coordinate Frank Wolfe

Similar presentations


Presentation on theme: "Parallel and Distributed Block Coordinate Frank Wolfe"β€” Presentation transcript:

1 Parallel and Distributed Block Coordinate Frank Wolfe
Yu-Xiang Wang Joint work with Veeru Sadhanala, Willie Neiswanger, Wei Dai, Suvrit Sra (MIT LIDS) and Eric Xing CMU ICML 2016

2 Frank-Wolfe and why Frank-Wolfe?
Frank and Wolfe (1956): Can we use LP to iteratively solve QP? min π‘₯ 𝑓(π‘₯) 𝑠.𝑑. π‘₯βˆˆπ’Ÿ Linear Oracle: min π‘ βˆˆπ’Ÿ 𝑠,𝛻𝑓(β‹…) Recent renaissance in Big ML. Projection free. Affine invariant Induce atomic structure ( sparse / low-rank) Duality gap for free. See, e.g., recent work from Jaggi, Llacoste-Julien, Schmidt, Takac, Grigas, Freund…

3 Standard Frank-Wolfe Algorithm
Let π‘₯ 0 βˆˆπ’Ÿ for π‘˜=0…𝐾 do Compute 𝑠 ≔ arg min π‘ βˆˆπ’Ÿ 𝑠,𝛻𝑓 π‘₯ π‘˜ Update π‘₯ π‘˜+1 ≔ 1βˆ’π›Ύ π‘₯ π‘˜ +𝛾𝑠, for 𝛾≔ 2 π‘˜+2 end for

4 Block Coordinate Frank-Wolfe (BCFW)
Lacoste-Julien et. al. (2013) min π‘₯ 𝑓(π‘₯) 𝑠.𝑑. π‘₯= π‘₯ 1 ,…, π‘₯ 𝑛 ∈ π’Ÿ 1 Γ—β‹― π’Ÿ 𝑛 =:π’Ÿ min π‘ βˆˆπ’Ÿ 𝑠,𝛻𝑓(β‹…) min 𝑠 1 ∈ π’Ÿ 𝑠 1 , 𝛻 1 𝑓(π‘₯) min 𝑠 𝑖 ∈ π’Ÿ 𝑖 𝑠 𝑖 , 𝛻 𝑖 𝑓(π‘₯) min 𝑠 𝑛 ∈ π’Ÿ 𝑛 𝑠 𝑛 , 𝛻 𝑛 𝑓(π‘₯) Algorithm: Randomly pick 𝑖 in {1,2,…,𝑛} Solve for 𝑠 𝑖 . Do block coordinate update.

5 Can we parallelize it? A mini-batch version of BCFW
Randomly pick subset π‘†βŠ‚[𝑛] Solve subroutine for each π‘–βˆˆπ‘† Update the parameter vector 𝑆 =:𝜏 What if we can solve this in parallel? Can we speed things up further?

6 Questions of interest Does it converge / convergence rate?
Yes Does it converge faster than BCFW? Sometimes. It depends on each problem at hand. Is it robust to delayed updates? Yes, very much so

7 β€œCloud” Oracle model Different types of randomization
Various system schemes Parallel and distributed.

8 β€œCloud” Oracle model A1. Updates received are i.i.d uniform over [𝑛]
A2. Is an approximate solution to (2) in expectation! 𝔼 𝑠 𝑆 , 𝛻 𝑆 𝑓 π‘˜ βˆ’ min 𝑠 β€² ∈ π’Ÿ 𝑆 𝑠 β€² , 𝛻 𝑆 𝑓 π‘˜ ≀ 𝛿 𝛾 π‘˜ 𝐢 𝑓 𝜏 2 Much weaker than what’s required previously!

9 Set Curvatures Curvature, Set Curvature
𝑓 𝑦 ≀𝑓 π‘₯ + 𝛾 𝑠 𝑆 βˆ’ π‘₯ 𝑆 , 𝛻 𝑆 𝑓 π‘₯ + 𝛾 𝐢 𝑓 𝑆 βˆ€π›Ύβˆˆ 0,1 , βˆ€π‘₯,π‘ βˆˆπ’Ÿ, 𝑦=π‘₯+𝛾 𝑠 𝑆 βˆ’ π‘₯ 𝑆 Expected Set Curvature 𝐢 𝑓 𝜏 ≔ 𝔼 𝑆: 𝑆 =𝜏 𝐢 𝑓 𝑆

10 Question 1: Does it converge?
For appropriately chosen stepsizes: 𝛾 π‘˜ = 2π‘›πœ‹ 𝜏 2 π‘˜+2𝑛 𝔼 DualityGap =O 𝑛 2 𝐢 𝑓 𝜏 1+𝛿 𝜏 2 π‘˜

11 Question 2: Does it converge faster?
𝔼 DualityGap =O 𝑛 2 𝐢 𝑓 𝜏 1+𝛿 𝜏 2 π‘˜ When 𝜏=1, this is BCFW. When 𝜏=𝑛, this reduces to batch FW. Boils down to the curvature constant. Ξ© 𝜏 𝑛 2 ≀ 𝐢 𝑓 𝜏 ≀𝑂 𝜏 2 𝑛 2 𝑂 1 π‘˜ 𝑂 1 πœπ‘˜ Hiding possible problem-specific constant that does not depend on 𝜏

12 A coupling condition 𝑓 𝑦 ≀𝑓 π‘₯ + π‘¦βˆ’π‘₯, 𝛻𝑓 π‘₯ + π‘¦βˆ’π‘₯ ⊀ 𝐻(π‘¦βˆ’π‘₯) 1.27 0.29
𝑓 𝑦 ≀𝑓 π‘₯ + π‘¦βˆ’π‘₯, 𝛻𝑓 π‘₯ + π‘¦βˆ’π‘₯ ⊀ 𝐻(π‘¦βˆ’π‘₯) 1.27 0.29 0.08 0.30 0.32 1.22 0.60 0.02 0.14 1.19 0.41 0.17 1.10 1.33 (c) Typical coupling In between; 𝑂 𝜏/ 𝑛 2 if SDD 1.00 0.00 𝐢 𝑓 𝜏 =𝑂 𝜏 𝑛 2 (a) No coupling 1.20 1.00 (b) High coupling 𝐢 𝑓 𝜏 =𝑂 𝜏 2 𝑛 2 Lower coupling implies faster convergence with larger minibatch.

13 Concrete examples Group fused lasso over an arbitrary graph
𝐢 𝑓 𝜏 =𝑂 𝜏 πœ† 2 Multiclass SVM (a special model) 𝐢 𝑓 𝜏 = 𝑂 𝑝 𝜏 𝑛 for πœβ‰€# of classes

14 Speed-up in simulation
Speed-up over BCFW Speedup on OCR dataset Speedup on Group fused lasso Measured in terms of # of iterations

15 Question 3: delayed updates?
Idea: 𝔼 DualityGap =O 𝑛 2 𝐢 𝑓 𝜏 1+𝛿 𝜏 2 π‘˜ Delay as a random variable πœ… Expected delay πœ…β‰”π”Όπœ…. Max-delay: β„™ πœ…< πœ… max =1 Theorem 6. Let 𝐿 𝜏 , 𝐷 𝜏 are coordinate-Lipschitz and diameter w.r.t subsets of blocks 𝛿≀ 4πœ…πœ 𝐿 1 𝐷 1 𝐷 𝜏 𝐢 𝑓 𝜏 . Or if πœ… max 𝜏=𝑂 𝑛 log 𝑛 . 𝛿=𝑂 𝜏 𝐿 1 𝐷 1 𝔼 𝐷 πœ…πœ 𝐢 𝑓 𝜏 β‰ˆ πœ… 𝐷 𝜏

16 Compare to Async SGD and Async BCD
Delay AP-BCFW (This work) AP-BCD (Liu et. al., 2013) Hogwild! (Niu et. al., 2011) Unbounded 𝑂(πœ…) Bounded Often 𝑂 πœ… 𝑂 exp πœ… max 𝑂( πœ… max 2 ) Open problem: Can we get similar bound for AP-BCD? Improved to πœ… recently for SGD. But require second moment bound Suvrit Sra, Adams Wei Yu, Mu Li, AdaDelay(AISTATS’16)

17 Proof idea of getting sublinear rate
A delay of πœ… max does not mean a block got updated for πœ… max times. Load balancing: In the past πœ… max iterations Throw πœ… max 𝜏 random balls into 𝑛 bins expected max load =𝑂 log 𝑛 if πœ… max πœβ‰€π‘› log 𝑛 Mitzenmacher, Michael. β€œThe power of two choices in randomized load balancing.” IEEE Transactions on Parallel and Distributed Systems (2001):

18 Effect of delay and straggler
Convergence with heavy-tailed delay (measured by number of iterations) Effect of a straggler worker

19 System implications and caveats
Heterogeneous workers No problem. Average performance. Heterogeneous blocks? This may break A1 (uniform over block) Need additional algorithmic tricks to enforce A1. Is it lock-free? Almost. Atomic operation over blocks (rather than over a β€œdouble” as in Hogwild).

20 Speed-up in real clock time
Real data experiments in OCR. For more complex subroutine solve

21 Summary Minibatch BCFW converges.
It converges provably faster than BCFW for problems with low coupling over blocks. It converges under delayed updates. Depends only on expected delay, sometimes sublinearly.

22 Open problems Solve problems with heterogeneous blocks without β€œpadding”. Can AP-BCD be improved to handle β€œdelay” better? Projection free Affine invariant Induce atomic structure (sparse / low-rank) Duality gap for free Robust to delay (?)


Download ppt "Parallel and Distributed Block Coordinate Frank Wolfe"

Similar presentations


Ads by Google