Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Evaluation of Adaptive MPI

Similar presentations


Presentation on theme: "Performance Evaluation of Adaptive MPI"— Presentation transcript:

1 Performance Evaluation of Adaptive MPI
Chao Huang1, Gengbin Zheng1, Sameer Kumar2, Laxmikant Kale1 1 University of Illinois at Urbana-Champaign 2 IBM T. J. Watson Research Center 9/16/2018 PPoPP 06

2 Motivation Challenges Adaptive MPI Applications with dynamic nature
Shifting workload, adaptive refinement, etc Traditional MPI implementations Limited support for such dynamic applications Adaptive MPI Virtual processes (VPs) via migratable objects Powerful run-time system that offers various novel features and performance benefits A mature implementation that is being used in real-world applications 9/16/2018 PPoPP 06

3 Outline Motivation Design and Implementation Features and Benefits
Adaptive Overlapping Automatic Load Balancing Communication Optimizations Flexibility and Overhead Conclusion 9/16/2018 PPoPP 06

4 Processor Virtualization
Basic idea of processor virtualization User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, number of VPs >> P, to allow for various optimizations System Implementation User View Before I go into the detail of Adaptive MPI, I would like to introduce the concept of processor virtualization. This idea is realized in Charm++, an object-based parallel programming framework developed by our group. There are a series of advantages I’ll talk about shortly, and I just listed a few here: Advantages Run program on any given number of processors Adaptive overlapping computation and communication Automatic load balancing, shrink/expand, etc 9/16/2018 PPoPP 06

5 AMPI: MPI with Virtualization
Each AMPI virtual process is implemented by a user-level thread embedded in a migratable object Real Processors MPI “processes” MPI processes RTS takes advantage of freedom of mapping VPs onto processors Explain that: Threads are light-weight, context-switch time about 1 usec. We use the threads to do things like blocking a process. Charm object communicates with other Charm object on behalf of the thread. This is how we implement point to point message passing. 9/16/2018 PPoPP 06

6 Outline Motivation Design and Implementation Features and Benefits
Adaptive Overlapping Automatic Load Balancing Communication Optimizations Flexibility and Overhead Conclusion 9/16/2018 PPoPP 06

7 Adaptive Overlap Problem: Gap between completion time and CPU overhead
Solution: Overlap between communication and computation Completion time and CPU overhead of 2-way ping-pong program on Turing (Apple G5) Cluster 9/16/2018 PPoPP 06

8 Timeline of 3D stencil calculation with different VP/P
Adaptive Overlap 1 VP/P 2 VP/P 4 VP/P Let’s look at the first advantage of processor virtualization: adaptive overlapping of communication and computation. These 3 graphs are timelines of a 3D stencil calculation. What the application does is … On the 1st graph, we show the timeline without virtualization… Sometimes when vp/p is not big enough, the overhead is the dominant factor Timeline of 3D stencil calculation with different VP/P 9/16/2018 PPoPP 06

9 Automatic Load Balancing
Challenge Dynamically varying applications Load imbalance impacts overall performance Solution Measurement-based load balancing Scientific applications are typically iteration-based The principle of persistence RTS collects CPU and network usage of VPs Load balancing by migrating threads (VPs) Threads can be packed and shipped as needed Different variations of load balancing strategies Load shifting, dynamic Heaps and stacks migrated with threads The variations of load balancing strategies include centralized, distributed, aggressive, based on refinement, random, etc. Say computation and communication instead, from time to time. 9/16/2018 PPoPP 06

10 Automatic Load Balancing
Application: Fractography3D Models fracture propagation in material 9/16/2018 PPoPP 06

11 Automatic Load Balancing
Wave propagate through a bar of material, making it elastic->plastic Zigzag lb Finish early 16 hours vs 22 hours CPU utilization of Fractography3D without vs. with load balancing 9/16/2018 PPoPP 06

12 Communication Optimizations
AMPI run-time has capability of Observing communication patterns Applying communication optimizations accordingly Switching between communication algorithms automatically Examples Streaming strategy for point-to-point communication Collectives optimizations 9/16/2018 PPoPP 06

13 Streaming Strategy Combining short messages to reduce per-message overhead Streaming strategy for point-to-point communication on NCSA IA-64 Cluster 9/16/2018 PPoPP 06

14 Optimizing Collectives
A number of optimization are developed to improve collective communication performance Asynchronous collective interface allows higher CPU utilization for collectives Computation is only a small proportion of the elapsed time Time breakdown of an all-to-all operation using Mesh library 9/16/2018 PPoPP 06

15 Virtualization Overhead
Compared with performance benefits, overhead is very small Usually offset by caching effect alone Better performance when features are applied Unless very fine-grain application Performance for point-to-point communication on NCSA IA-64 Cluster 9/16/2018 PPoPP 06

16 3D stencil calculation of size 2403 run on Lemieux.
Flexibility Running on arbitrary number of processors Runs with a specific number of MPI processes Big runs on a few processors 3D stencil calculation of size 2403 run on Lemieux. 9/16/2018 PPoPP 06

17 Outline Motivation Design and Implementation Features and Benefits
Adaptive Overlapping Automatic Load Balancing Communication Optimizations Flexibility and Overhead Conclusion 9/16/2018 PPoPP 06

18 Conclusion Adaptive MPI supports the following benefits
Adaptive overlap Automatic load balancing Communication optimizations Flexibility Automatic checkpoint/restart mechanism Shrink/expand AMPI is being used in real-world parallel applications and frameworks Rocket simulation at CSAR FEM Framework Portable to a variety of HPC platforms Ampi is a mature and portable system 9/16/2018 PPoPP 06

19 Future Work Performance Improvement Performance Analysis
Reducing overhead Intelligent communication strategy substitution Machine-topology specific load balancing Performance Analysis More direct support for AMPI programs 9/16/2018 PPoPP 06

20 Thank You! Download of AMPI is available at: http://charm.cs.uiuc.edu/
Parallel Programming Lab at University of Illinois 9/16/2018 PPoPP 06


Download ppt "Performance Evaluation of Adaptive MPI"

Similar presentations


Ads by Google