Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant Kale, UIUC HPDC 2012

Dynamic load balancing on 100,000 processor cores and beyond

HPDC’12, Delft Iterative Applications Applications repeatedly executing the same computation Static or slowly evolving execution characteristics Execution characteristics preclude static balancing Application characteristics (comm. pattern, sparsity,…) Execution environment (topology, asymmetry, …) Challenge: Load-balancing such applications

HPDC’12, Delft Overdecomposition Expose greater levels concurrency than supported by hardware Middleware (runtime) dynamically maps the concurrent tasks to hardware resources Abstraction supports continuous optimization and adaptation Improvements to load balancing New metrics (power, energy, graceful degradation, …) New features: fault tolerance, power/energy-awareness

HPDC’12, Delft Problem Statement Scalable load balancers for iterative overdecomposed applications We consider two alternatives: Persistence-based load balancing Work stealing How do these algorithms behave at scale? How do they compare?

HPDC’12, Delft Related Work Overdecomposition is a widely used approach Inspector-executor approaches employ start-time load balancers Hierarchical load balancers in the past typically do not consider localization Scalability of work stealing not well understood – largest prior demonstration was on 8192 cores No comparative evaluation of the two schemes

HPDC’12, Delft TASCEL: Task Scheduling Library Runtime library for task-parallel programs Manages task collections for execution on distributed memory machines Compatible with native MPI programs Phase-based switch between SPMD and non-SPMD modes of execution

HPDC’12, Delft TASCEL Execution Task: basic unit of migrateable execution Typical workflow: Create a task collection Seed it with one or more tasks Process tasks in the collection till termination detection Processing of task collections Manages concurrency, faults, … Trade-offs exposed through implementation specializations Dynamic load balancing schemes Fault tolerance protocols …

HPDC’12, Delft Load Balancers Greedy localized hierarchical persistence-based load balancing Retentive work stealing

HPDC’12, Delft 0 0 1 1 2 2 3 3 4 4 5 5 3 3 4 4 5 5 1 1 2 2 0 0 Greedy Localized Hierarchical Persistence- based LB Intuition: Satisfy local imbalance first

HPDC’12, Delft Proc 1 Proc 2 Proc 3 … Proc n Local QueuesWork Pool Retentive Work Stealing

HPDC’12, Delft head split stail LocalRemote Retentive Work Stealing

HPDC’12, Delft head split addTask() : add task to local region getTask() : remove task from local region stail Buffer of locally executed tasks Retentive Work Stealing

HPDC’12, Delft head split releaseToShared() : move to shared portion acquireFromShared() : move to local portion stail Retentive Work Stealing

HPDC’12, Delft head split 1. Mark tasks stolen at stail and begin transfer itail ctail stail : beginning of tasks available to be stolen itail : number of tasks that have finished transfer ctail : past this marker it is safe to use buffer stail 2. Atomically increment itail on completion of transfer 3. Worker updates ctail when stail == itail ==itail ==ctail Retentive Work Stealing

HPDC’12, Delft Proc 1 Proc 2 Proc 3 … Proc n Seeded Local Queues Proc 1 Proc 2 Proc 3 Proc n Actual Executed Tasks Intuition: Stealing indicates poor initial balance Retentive Work Stealing

HPDC’12, Delft Retentive Work Stealing Active message based work stealing optimized for distributed memory Exploit persistence across work stealing iterations Each work stealing phase Track tasks executed by this worker in this iteration Seed with tasks executed by this worker for the next iteration

HPDC’12, Delft Experimental Setup Multi-threaded MPI; one core per node for active messages “Flat” execution – each core is an independent worker No. nodes Cores per node Memory per node Max cores in queue Hopper (Cray XE6)63842432GB146400 Intrepid (BG/P)4096044GB163840 Titan (Cray XK6)186881632GB298592

HPDC’12, Delft Hartree-Fock Benchmark Basis for several electronic structure theories Two-electron contribution Schwarz-screening: data dependent sparsity screening at runtime Tasks vary in size from milliseconds to seconds HF-Be512 (20)HF-Be512 (40) Total tasks2.2x10 10 1.4x10 9 Non-null tasks9.1x10 6 8.6x10 5

HPDC’12, Delft Hopper: Performance Persistence-based load balancing “converges” faster Retentive stealing also improves efficiency Stealing effective even with limited parallelism Persistence-based load balancing Retentive Stealing Efficiency Core count Core count Avg. tasks per core

HPDC’12, Delft Intrepid: Performance Much worse performance for the first iteration Converges to a better efficiency than on Hopper Persistence-based load balancing Retentive Stealing Efficiency Core count Core count Avg. tasks per core

HPDC’12, Delft Titan: Performance Similar behavior as on Intrepid Persistence-based load balancing Retentive Stealing Efficiency Core count Core count Avg. tasks per core

HPDC’12, Delft Intrepid: Num. Steals Retentive stealing stabilizes stealing costs Similar trends on all systems Core count Core count Num. steals Attempted steals Successful steals

HPDC’12, Delft Utilization HF-Be256 on 9600 cores on Hopper Initial stealing has high costs during ramp-down Retentive stealing does a better job reducing this cost Steal (13.6secs) StealRet-final (12.6secs) PLB (12.2secs) Utilization (%) Time Time Time

HPDC’12, Delft Summary of Insights Retentive work stealing can scale – demonstrated on up to 163,840 cores of Intrepid, 146,400 cores of Hopper, and 128,000 cores of Titan Retentive stealing and persistence-based load balancing perform comparably Retentive stealing incrementally improves balance Number of steals does not grow substantially with scale Greedy hierarchical persistence-based load balancer achieves good load balance quality as compared to a centralized scheme (details in paper)

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Similar presentations

Presentation on theme: "Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Similar presentations

Presentation on theme: "Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant."— Presentation transcript:

Similar presentations

About project

Feedback