Presentation is loading. Please wait.

Presentation is loading. Please wait.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Similar presentations


Presentation on theme: "Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant."— Presentation transcript:

1 Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012

2 Dynamic load balancing on 100,000 processor cores and beyond

3 HPDC’12, Delft Iterative Applications Applications repeatedly executing the same computation Static or slowly evolving execution characteristics Execution characteristics preclude static balancing Application characteristics (comm. pattern, sparsity,…) Execution environment (topology, asymmetry, …) Challenge: Load-balancing such applications

4 HPDC’12, Delft Overdecomposition Expose greater levels concurrency than supported by hardware Middleware (runtime) dynamically maps the concurrent tasks to hardware resources Abstraction supports continuous optimization and adaptation Improvements to load balancing New metrics (power, energy, graceful degradation, …) New features: fault tolerance, power/energy-awareness

5 HPDC’12, Delft Problem Statement Scalable load balancers for iterative overdecomposed applications We consider two alternatives: Persistence-based load balancing Work stealing How do these algorithms behave at scale? How do they compare?

6 HPDC’12, Delft Related Work Overdecomposition is a widely used approach Inspector-executor approaches employ start-time load balancers Hierarchical load balancers in the past typically do not consider localization Scalability of work stealing not well understood – largest prior demonstration was on 8192 cores No comparative evaluation of the two schemes

7 HPDC’12, Delft TASCEL: Task Scheduling Library Runtime library for task-parallel programs Manages task collections for execution on distributed memory machines Compatible with native MPI programs Phase-based switch between SPMD and non-SPMD modes of execution

8 HPDC’12, Delft TASCEL Execution Task: basic unit of migrateable execution Typical workflow: Create a task collection Seed it with one or more tasks Process tasks in the collection till termination detection Processing of task collections Manages concurrency, faults, … Trade-offs exposed through implementation specializations Dynamic load balancing schemes Fault tolerance protocols …

9 HPDC’12, Delft Load Balancers Greedy localized hierarchical persistence-based load balancing Retentive work stealing

10 HPDC’12, Delft 0 0 1 1 2 2 3 3 4 4 5 5 3 3 4 4 5 5 1 1 2 2 0 0 Greedy Localized Hierarchical Persistence- based LB Intuition: Satisfy local imbalance first

11 HPDC’12, Delft 0 0 1 1 2 2 3 3 4 4 5 5 3 3 4 4 5 5 1 1 2 2 0 0 Greedy Localized Hierarchical Persistence- based LB Intuition: Satisfy local imbalance first

12 HPDC’12, Delft Proc 1 Proc 2 Proc 3 … Proc n Local QueuesWork Pool Retentive Work Stealing

13 HPDC’12, Delft head split stail LocalRemote Retentive Work Stealing

14 HPDC’12, Delft head split addTask() : add task to local region getTask() : remove task from local region stail Buffer of locally executed tasks Retentive Work Stealing

15 HPDC’12, Delft head split releaseToShared() : move to shared portion acquireFromShared() : move to local portion stail Retentive Work Stealing

16 HPDC’12, Delft head split 1. Mark tasks stolen at stail and begin transfer itail ctail stail : beginning of tasks available to be stolen itail : number of tasks that have finished transfer ctail : past this marker it is safe to use buffer stail 2. Atomically increment itail on completion of transfer 3. Worker updates ctail when stail == itail ==itail ==ctail Retentive Work Stealing

17 HPDC’12, Delft Proc 1 Proc 2 Proc 3 … Proc n Seeded Local Queues Proc 1 Proc 2 Proc 3 Proc n Actual Executed Tasks Intuition: Stealing indicates poor initial balance Retentive Work Stealing

18 HPDC’12, Delft Retentive Work Stealing Active message based work stealing optimized for distributed memory Exploit persistence across work stealing iterations Each work stealing phase Track tasks executed by this worker in this iteration Seed with tasks executed by this worker for the next iteration

19 HPDC’12, Delft Experimental Setup Multi-threaded MPI; one core per node for active messages “Flat” execution – each core is an independent worker No. nodes Cores per node Memory per node Max cores in queue Hopper (Cray XE6)63842432GB146400 Intrepid (BG/P)4096044GB163840 Titan (Cray XK6)186881632GB298592

20 HPDC’12, Delft Hartree-Fock Benchmark Basis for several electronic structure theories Two-electron contribution Schwarz-screening: data dependent sparsity screening at runtime Tasks vary in size from milliseconds to seconds HF-Be512 (20)HF-Be512 (40) Total tasks2.2x10 10 1.4x10 9 Non-null tasks9.1x10 6 8.6x10 5

21 HPDC’12, Delft Hopper: Performance Persistence-based load balancing “converges” faster Retentive stealing also improves efficiency Stealing effective even with limited parallelism Persistence-based load balancing Retentive Stealing Efficiency Core count Core count Avg. tasks per core

22 HPDC’12, Delft Intrepid: Performance Much worse performance for the first iteration Converges to a better efficiency than on Hopper Persistence-based load balancing Retentive Stealing Efficiency Core count Core count Avg. tasks per core

23 HPDC’12, Delft Titan: Performance Similar behavior as on Intrepid Persistence-based load balancing Retentive Stealing Efficiency Core count Core count Avg. tasks per core

24 HPDC’12, Delft Intrepid: Num. Steals Retentive stealing stabilizes stealing costs Similar trends on all systems Core count Core count Num. steals Attempted steals Successful steals

25 HPDC’12, Delft Utilization HF-Be256 on 9600 cores on Hopper Initial stealing has high costs during ramp-down Retentive stealing does a better job reducing this cost Steal (13.6secs) StealRet-final (12.6secs) PLB (12.2secs) Utilization (%) Time Time Time

26 HPDC’12, Delft Summary of Insights Retentive work stealing can scale – demonstrated on up to 163,840 cores of Intrepid, 146,400 cores of Hopper, and 128,000 cores of Titan Retentive stealing and persistence-based load balancing perform comparably Retentive stealing incrementally improves balance Number of steals does not grow substantially with scale Greedy hierarchical persistence-based load balancer achieves good load balance quality as compared to a centralized scheme (details in paper)


Download ppt "Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant."

Similar presentations


Ads by Google