Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe.

Similar presentations


Presentation on theme: "Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe."— Presentation transcript:

1 Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe

2 2 CASC Umpire l Writing correct MPI programs is hard l Unsafe or erroneous MPI programs —Deadlock —Resource errors l Umpire —Automatically detect MPI programming errors —Dynamic software testing —Shared memory implementation

3 3 CASC MPI Runtime System MPI Application Umpire Manager Task 0Task 1Task 2 Task N-1 Interposition using MPI profiling layer Transactions via Shared Memory Task 0Task 1Task 2 Task N-1 Task 0 Task 1 Task 2 Task N-1... Umpire Architecture Verification Algorithms

4 4 CASC Collection system l Calling task —Use MPI profiling layer —Perform local checks —Communicate with manager if necessary –Call parameters –Return program counter (PC) –Call specific information (e.g. Buffer checksum) l Manager —Allocate Unix shared memory —Receive transactions from calling tasks

5 5 CASC Manager l Detects global programming errors l Unix shared memory communication l History queues —One per MPI task —Chronological lists of MPI operations l Resource registry —Communicators —Derived datatypes —Required for message matching l Perform verification algorithms

6 6 CASC Configuration Dependent Deadlock l Unsafe MPI programming practice l Code result depends on: —MPI implementation limitations —User input parameters l Classic example code: Task 0Task 1MPI_SendMPI_Recv

7 7 CASC Mismatched Collective Operations l Erroneous MPI programming practice l Simple example code: Tasks 0, 1, & 2Task 3 MPI_BcastMPI_Barrier MPI_BarrierMPI_Bcast l Possible code results: —Deadlock —Correct message matching —Incorrect message matching —Mysterious error messages

8 8 CASC Deadlock detection l MPI history queues —One per task in Manager —Track MPI messaging operations –Items added through transactions –Remove when safely matched l Automatically detect deadlocks —MPI operations only —Wait-for graph —Recursive algorithm —Invoke when queue head changes l Also support timeouts

9 9 CASC Deadlock Detection Example Bcast Barrier Bcast Barrier Bcast Barrier Task 0Task 1Task 2Task 3 Task 1:MPI_BcastTask 0:MPI_BcastTask 0:MPI_BarrierTask 2:MPI_BcastTask 3:MPI_BarrierERROR! Report it!Task 2:MPI_BarrierTask 1:MPI_Barrier

10 10 CASC Resource Tracking Errors l Many MPI features require resource allocations —Communicators, datatypes and requests —Detect “leaks” automatically l Simple “lost request” example: MPI_Irecv (..., &req); MPI_Wait (&req,…) l Complicated by assignment l Also detect errant writes to send buffers

11 11 CASC Conclusion l First automated MPI debugging tool —Detect deadlocks —Eliminates resource leaks —Assure correct non-blocking sends l Performance —Low overhead (21% for sPPM) —Located deadlock in code set-up l Limitations —MPI_Waitany and MPI_Cancel —Shared memory implementation —Prototype only

12 12 CASC Future Work l Further prototype testing l Improve user interface l Handle all MPI calls l Tool distribution —LLNL application group testing —Exploring mechanisms for wider availability l Detection of other errors —Datatype matching —Others? l Distributed memory implementation

13 13 CASC Work performed under the auspices of the U. S. Department of Energy by University of California Lawrence Livermore National Laboratory under Contract W-7405-Eng-48 UCRL-VG-139184


Download ppt "Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe."

Similar presentations


Ads by Google