Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU)

Similar presentations


Presentation on theme: "Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU)"— Presentation transcript:

1 Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU)

2 Model Training Data TBs of data thousands of neurons per layer millions of edges between two layers several GBs of model size several layers

3 DNN model training could take weeks or even more

4 What if we can train the DNN model in one day? It is still a dream Fast training needs parallelism, even in a distributed fashion

5 Model Training Data Machine Model Parallelism Model is partitioned and trained in parallel

6 Network traffic bounded Non-linear speedup Training is still slow with large data sets Model Parallelism Model Training Data

7 Another dimension of parallelism, data parallelism, is required

8 Data Parallelism 1. Training data is partitioned, and multi-models are trained in parallel 2. Intermediate trained results (model parameters) are synchronized

9 Outline Problem statement Design goals Design Evaluation

10 It is not a good idea to combine model training and model synchronization

11 Separate the model training and model synchronization Build a dedicated system PS (Parameter Server) to synchronize the intermediate model parameters DistBlief (NIPS 2012) Parameter Server Application

12 Outline Problem statement Design goals Design Evaluation

13 How to build a scalable, reliable and still efficient parameter server?

14 p Model workers Data ∆p∆p p’p’ p’ = p + ∆p A Centralized Approach Parameter Server ∆p’∆p’ p’’ = p’ + ∆p’ Asynchronous Stochastic Gradient Descent (A- SGD)

15 Model workers Data ∆p∆p p’ = p + ∆p A Centralized Approach Parameter Server ∆p is vector or matrix with float type, rather than key-value pair p’ = p + ∆p is commutative and associate, which makes synchronization in bulk is possible

16 However, it is non-scalable if large-scale model workers exist

17 Model Workers Data Shards ∆pi∆pi ∆p1∆p1 ∆pn∆pn … … Depends on: The size of model parameters (240MB) The model update rate (3times/s, thus 720MB/s) The number of model workers (overloaded if n is large) GPU scenario Parameter Server

18 Model Workers Data Shards ∆pi∆pi ∆p1∆p1 ∆pn∆pn … … Model parameter partition helps Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho and E. P. Xing,Petuum: A Framework for Iterative-Convergent Distributed ML, Manuscript, arXiv: , communicated 30 Dec 2013).Petuum: A Framework for Iterative-Convergent Distributed ML

19 Parameter Server Model Workers Data Shards ∆pi∆pi ∆p1∆p1 ∆pn∆pn … … A local cache (model slaves) of model parameters helps parameter master parameter slaves

20 However, parameter master may still be the bottleneck A decentralized (peer-2-peer) system design is motivated

21 And, what if faults happened?

22 Parameter Server Model Workers Data Shards ∆pi∆pi ∆p1∆p1 ∆pn∆pn … … 1. Networking delay or down 2. Machine crash and restart 3. Software crash, data lost, job preempted

23 Again, it is not reliable without fault-tolerance support A fault-tolerant system design is motivated

24 How about performance if staleness (consistency) is required?

25 Staleness is required Parameter Server Model Workers Data Shards p p p ∆p1∆p1 p 1 = p + ∆p 1

26 Parameter Server Model Workers Data Shards ∆p2∆p2 p 1 = p + ∆p 1 p1p1 fast slower slowest

27 Staleness is required for fast model convergence Update by worker 1 Update by worker 2 Model synchronization With coordination initialization global optimal Without coordination (Worker 2 works on a over-staled model) initialization global optimal local optimal initialization global optimal local optimal

28 The working pace of each worker should be coordinated Parameter Server Model Workers Data Coordinator L-BFGS

29 However, a centralized coordinator is costly, and the system performance (parallelism) is not fully exploited Balance between the system performance and model convergence rate is motivated

30 Outline Problem statement Design goals Design Evaluation

31 1. Each worker machine has a local parameter server (model replica), and the system is responsible for parameter synchronization

32 … … … … Parameter Server System Architecture Reduced network traffic by only exchanging the accumulated updates (commutative and associative) Non-blocking of training Asynchronization

33 2. How to mutually exchange parameter updates between two connected local parameter servers, with fault- tolerance on network delay or even down?

34 … … … … Parameter Server Pairwise fault-tolerant update exchange protocol

35 Pairwise Protocol Invariants … … … … Pairwise fault-tolerant update exchange protocol p q r φqpφqp φrpφrp

36 Pairwise Protocol Invariants … … … … Pairwise fault-tolerant update exchange protocol p q r

37 Pairwise Protocol Details

38

39

40

41

42

43

44

45

46

47

48

49

50

51 3. How about flow control?

52 Straightforward, just control the timing of synchronization via such as timer, the version gap, or even dynamic adjusted

53 4. How about the fault-tolerance?

54 NOT based on redundancy (multiple copies) Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, Dave Andersen and Alex Smola. Parameter Server for Distributed Machine Learning, Big Learning Workshop, NIPS 2013Parameter Server for Distributed Machine LearningBig Learning WorkshopNIPS 2013

55

56 Temporary outage Scheduled failure Permanent failure

57 Dynamic adding or removing of model replicas has the same logic as fault tolerance

58 5. How local parameter servers are connected (topology)?

59 The right topology is hard to determine for system, which depends on the application, such as model size, update rate, network bandwidth, the number of neighbors, etc. Therefore, topology configuration is motivated

60 Further more, as workers leaves and joins in, the right topology would be adjusted. Therefore, topology re-configuration is necessary For example, increasingly added model replicas would be helpful for DNN training

61 Parameter Server Master-slave master Shortest propagation delay (one hop) But high workload in master … … … …

62 … … … … Parameter Server Tree-based topology Decentralized Longer propagation delay (multiple hops) Without bottleneck

63 Scalability is sensitive to topology

64 Topology affects staleness

65 6. And how to set the right staleness to balance the system performance and model convergence rate?

66 Application-defined staleness is supported, such as Best effort (no extra requirement) Maximal delayed time (block push if previous n pushes not complete) User-defined filters (only push significant update) SSP * (bound the max gap between the fastest and slowest worker) Bound the update version gap Bound the parameter value gap *. Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger and E. P. Xing,More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server, Advances in Neural Information Processing Systems 27 (NIPS 2013). More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

67 Outline Problem statement Design goals Design Evaluation

68

69 Recap Re-configurability is the king in system design The layered design is beautiful Pure p2p design Pairwise protocol Flow control Fault tolerance Node joining in or leaving Topology configurable Staleness configurable

70 Future work Parameter server design is not only for DNN, but also for general inference problems Generalized linear model with a single massive vector Topic model with sparse vectors Graphical model with plates The design is also works for areas other than machine learning The scenarios with structured data and the aggregation is both commutative and associative, such as Sensor network to get aggregated data

71 Related work Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large Scale Distributed Deep Networks, NIPS 2012Large Scale Distributed Deep Networks Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, G. R. Ganger and E. P. Xing,More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server, NIPS More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server Jinliang Wei, Wei Dai, Abhimanu Kumar, Xun Zheng, Qirong Ho and E. P. Xing, Consistent Bounded- Asynchronous Parameter Servers for Distributed ML, Manuscript, arXiv: , communicated 30 Dec 2013).Consistent Bounded- Asynchronous Parameter Servers for Distributed ML Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho and E. P. Xing,Petuum: A Framework for Iterative-Convergent Distributed ML, Manuscript, arXiv: , communicated 30 Dec 2013).Petuum: A Framework for Iterative-Convergent Distributed ML Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, Dave Andersen and Alex Smola. Parameter Server for Distributed Machine Learning, Big Learning Workshop, NIPS 2013Parameter Server for Distributed Machine LearningBig Learning WorkshopNIPS 2013

72 Thanks! and Questions?

73 Backu p


Download ppt "Distributed Parameter Synchronization in DNN Hucheng Zhou (MSRA) Zheng Zhang (MSRA) Minjie Wang (SJTU)"

Similar presentations


Ads by Google