Presentation is loading. Please wait.

Presentation is loading. Please wait.

Milind A. Bhandarkar (milind@cs.uiuc.edu) Adaptive MPI Milind A. Bhandarkar (milind@cs.uiuc.edu)

Similar presentations


Presentation on theme: "Milind A. Bhandarkar (milind@cs.uiuc.edu) Adaptive MPI Milind A. Bhandarkar (milind@cs.uiuc.edu)"— Presentation transcript:

1 Milind A. Bhandarkar (milind@cs.uiuc.edu)
Adaptive MPI Milind A. Bhandarkar

2 Motivation Many CSE applications exhibit dynamic behavior
Adaptive mesh refinement (AMR) Pressure-driven crack propagation in solid propellants Also, non-dedicated supercomputing platforms such as clusters affect processor availability These factors cause severe load imbalance Can the parallel language / runtime system help ?

3 Load Imbalance in Crack-propagation Application
(More and more cohesive elements activated between iterations 35 to 40.)

4 Load Imbalance in an AMR Application
(Mesh is refined at 25th time step)

5 Multi-partition Decomposition
Basic idea: Decompose the problem into a number of partitions, Independent of the number of processors # Partitions > # processors The system maps partitions to processors The system maps and re-maps objects as needed Re-mapping strategies help adapt to dynamic variations To make this work, need Load balancing framework & runtime support But, isn’t there a high overhead of multi-partitioning ?

6 “Overhead” of Multi-partition Decomposition
(Crack Propagation code, with 70k elements)

7 Charm++ Supports data driven objects.
Singleton objects, object arrays, groups, .. Many objects per processor, with method execution scheduled with availability of data. Supports object migration, with automatic forwarding. Excellent results with highly irregular & dynamic applications. Molecular dynamics application NAMD, speedup of 1250 on 2000 processors of ASCI-red. Brunner, Phillips, & Kale. “Scalable molecular dynamics”, Gordon Bell finalist, SC2000.

8 Charm++: System Mapped Objects
1 5 8 10 4 11 12 9 2 3 6 7 13 1 12 5 9 10 2 11 3 4 7 13 6 8

9 Load Balancing Framework

10 However… Many CSE applications are written in Fortran, MPI
Conversion to a parallel object-based language such as Charm++ is cumbersome Message-driven style requires split-phase transactions Often results in a complete rewrite How to convert existing MPI applications without extensive rewrite ?

11 Solution: Each partition implemented as a user-level thread associated with a message-driven object Communication library for these threads same in syntax and semantics as MPI But what about the overheads associated with threads ?

12 AMPI Threads Vs MD Objects (1D Decomposition)

13 AMPI Threads Vs MD Objects (3D Decomposition)

14 Thread Migration Thread stacks may contain references to local variables May not be valid upon migration to a different address space Solution: thread stacks should span the same virtual address space on any processor where they may migrate (Isomalloc) Split the virtual space into per-processor allocation pool Scalability issues Not important on 64-bit processors Constrained load balancing (limit the thread’s migratability to fewer processors)

15 AMPI Issues: Thread-safety
Multiple threads mapped to each processor “Process data” to be localized Make them instance variables of a “class” All subroutines become instance methods of this class AMPIzer: A source-to-source translator Based on Polaris front-end Recognize all global variables Put them in a thread-private area

16 AMPI Issues: Data Migration
Thread-private data needs to be migrated with the thread Developer has to write subroutines for packing and unpacking data Writing separate subroutines is error-prone Puppers (“pup”=“pack-unpack”) A subroutine to “show” the data to the runtime system Fortran90 generic procedures make writing the pupper easy

17 AMPI: Other Features Automatic checkpoint and restart
On different number of processors Number of chunks remain the same, but can be mapped to different number of processors No additional work is needed Same pupper used for migration is also used for checkpointing and restart

18 Adaptive Multi-MPI Integration of multiple MPI-based modules
Example: integrated rocket simulation ROCFLO, ROCSOLID, ROCBURN, ROCFACE Each module gets its own MPI_COMM_WORLD All COMM_worlds form MPI_COMM_UNIVERSE Point to point communication between different MPI_COMM_worlds using the same AMPI functions Communication across modules is also considered while balancing load

19 Experimental Results

20 AMR Application With Load Balancing
(Load balancer is activated at time steps 20, 40, 60, and 80.)

21 AMPI Load Balancing on Heterogeneous Clusters
(Experiment carried out on a cluster of Linux™ workstations.)

22 (This is a scaled problem.)
AMPI Vs MPI (This is a scaled problem.)

23 AMPI “Overhead”

24 AMPI: Status Over 70+ commonly used functions from MPI 1.1
All point-to-point communication functions All collective communications functions User-defined MPI data types C, C++, and Fortran (77/90) bindings Tested on Origin 2000, IBM SP, Linux and Solaris clusters Should run on any platform supported by Charm++ that has mmap


Download ppt "Milind A. Bhandarkar (milind@cs.uiuc.edu) Adaptive MPI Milind A. Bhandarkar (milind@cs.uiuc.edu)"

Similar presentations


Ads by Google