Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Charm++ Load Balancing Framework Gengbin Zheng gzheng@uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign http://charm.cs.uiuc.edu

2 Motivation Irregular or dynamic applications Initial static load balancing Initial static load balancing Application behaviors change dynamically Application behaviors change dynamically Difficult to implement with good parallel efficiency Difficult to implement with good parallel efficiency Versatile, automatic load balancers Application independent Application independent No/little user effort is needed in load balance No/little user effort is needed in load balance Based on Charm++ and Adaptive MPI Based on Charm++ and Adaptive MPI

3 Parallel Objects, Adaptive Runtime System Libraries and Tools Molecular Dynamics Computational Cosmology Rocket Simulation Protein FoldingQuantum Chemistry (QM/MM) Crack Propagation Dendritic Growth Space-time meshes

4 Load Balancing in Charm++ Viewing an application as a collection of communicating objects Object migration as mechanism for adjusting load Measurement based strategy Principle of persistent computation and communication structure. Principle of persistent computation and communication structure. Instrument cpu usage and communication Instrument cpu usage and communication Overload vs. underload processor

5 Load Balancing – graph partitioning LB View mapping of objects Weighted object graph in view of Load Balance Charm++ PE

6 Load Balancing Framework LB Framework

7 Centralized vs. Distributed Load Balancing Centralized Object load data are sent to processor 0 Object load data are sent to processor 0 Integrate to a complete object graph Integrate to a complete object graph Migration decision is broadcasted from processor 0 Migration decision is broadcasted from processor 0 Global barrier Global barrierDistributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier

8 Load Balancing Strategies

9 Strategy Example - GreedyCommLB Greedy algorithm Put the heaviest object to the most underloaded processor Put the heaviest object to the most underloaded processor Object load is its cpu load plus comm cost Communication cost is computed as α+βm Communication cost is computed as α+βm

10 Strategy Example - GreedyCommLB

13 64 processors1024 processors Min loadMax loadAve loadMin loadMax loadAve load ------------------- 13.95215.50514.38842.80145.97144.784 GreedyRefLB 14.10414.58914.35143.58545.19544.777 GreedyCommLB 13.74814.39614.02540.51946.92243.777 RecBisectBfLB 11.70113.77112.70935.90748.88943.953 MetisLB 14.06114.50614.34141.47748.07744.772 RefineLB 14.04314.97714.38842.80145.97144.783 RefineCommLB 14.01515.17614.38842.80145.97144.783 OrbLB 11.35012.41411.89131.26944.94038.200 Comparison of Strategies Jacobi1D program with 2048 chares on 64 pes and 10240 chares on 1024 pes

14 1000 processors Min loadMax loadAve load -------------- 00.3544900.197485 GreedyLB 0.1904240.2441350.197485 GreedyRefLB 0.1914030.2011790.197485 GreedyCommLB 0.1972620.1982380.197485 RefineLB 0.1933690.2001940.197485 RefineCommLB 0.1933690.2001940.197485 OrbLB 0.1796890.2207000.197485 Comparison of Strategies NAMD atpase Benchmark 327506 atoms Number of chares:31811 migratable:31107

15 User Interfaces Fully automatic load balancing Nothing needs to be changed in application code Nothing needs to be changed in application code Load balancing happens periodically and transparently Load balancing happens periodically and transparently +LBPeriod to control the load balancing interval +LBPeriod to control the load balancing interval User controlled load balancing Insert AtSync() calls at places ready for load balancing (hint) Insert AtSync() calls at places ready for load balancing (hint) LB pass control back to ResumeFromSync() after migration finishes LB pass control back to ResumeFromSync() after migration finishes

16 Migrating Objects Moving data Runtime packs object data into a message and send to its destination Runtime packs object data into a message and send to its destination Runtime unpacks the data and creates object Runtime unpacks the data and creates object User needs to write pup function for packing/unpacking object data User needs to write pup function for packing/unpacking object data

17 Compiler Interface Link time options -module: Link load balancers as modules -module: Link load balancers as modules Link multiple modules into binary Link multiple modules into binary Runtime options +balancer: Choose to invoke a load balancer +balancer: Choose to invoke a load balancer Can have multiple load balancers Can have multiple load balancers +balancer GreedyCommLB +balancer RefineLB

18 NAMD case study Molecular dynamics Atoms move slowly Initial load balancing can be as simple as round-robin Load balancing is only needed for once for a while, typically once every thousand steps Greedy balancer followed by Refine strategy

19 Load Balancing Steps Regular Timesteps Instrumented Timesteps Detailed, aggressive Load Balancing Refinement Load Balancing

20 Processor Utilization against Time on (a) 128 (b) 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. Load Balancing Aggressive Load Balancing Refinement Load Balancing

21 Processor Utilization across processors after (a) greedy load balancing and (b) refining Note that the underloaded processors are left underloaded (as they don’t impact perforamnce); refinement deals only with the overloaded ones Some overloaded processors

22 Profile view of a 3000 processor run of NAMD (White shows idle time)

23 Load Balance Research with Blue Gene Centralized load balancer Bottleneck for communication on processor 0 Bottleneck for communication on processor 0 Memory constraint Memory constraint Fully distributed load balancer Neighborhood balancing Neighborhood balancing Without global load information Without global load information Hierarchical distributed load balancer Divide into processor groups Divide into processor groups Different strategies at each level Different strategies at each level

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Similar presentations

Presentation on theme: "Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Similar presentations

Presentation on theme: "Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at."— Presentation transcript:

Similar presentations

About project

Feedback