1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.

Slides:



Advertisements
Similar presentations
MPI-3 Persistent WG (Point-to-point Persistent Channels) February 8, 2011 Tony Skjellum, Coordinator MPI Forum.
Advertisements

Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.
1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
MPI Message Passing Interface
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
The Building Blocks: Send and Receive Operations
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Advancing the “Persistent” Working Group MPI-4 Tony Skjellum June 5, 2013.
1 Implementing Master/Slave Algorithms l Many algorithms have one or more master processes that send tasks and receive results from slave processes l Because.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
MPI – An introduction by Jeroen van Hunen What is MPI and why should we use it? Simple example + some basic MPI functions Other frequently used MPI functions.
1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Reference: / Point-to-Point Communication.
Point-to-Point Communication Self Test with solution.
Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Lesson2 Point-to-point semantics Embarrassingly Parallel Examples.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.
Improving IPC by Kernel Design Jochen Liedtke Shane Matthews Portland State University.
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
1 Distributed Memory Computers and Programming. 2 Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
Parallel Programming with Java
1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
Rebooting the “Persistent” Working Group MPI-3.next Tony Skjellum December 5, 2012.
MPI Communications Point to Point Collective Communication Data Packaging.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
MPI Send/Receive Blocked/Unblocked Tom Murphy Director of Contra Costa College High Performance Computing Center Message Passing Interface BWUPEP2011,
1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Its.unc.edu 1 University of North Carolina - Chapel Hill ITS Research Computing Instructor: Mark Reed Point to Point Communication.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
Message Passing Interface (MPI) 2 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
MPI Send/Receive Blocked/Unblocked Josh Alexander, University of Oklahoma Ivan Babic, Earlham College Andrew Fitz Gibbon, Shodor Education Foundation Inc.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Programming Parallel Hardware using MPJ Express By A. Shafi.
Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.
Pitfalls: Time Dependent Behaviors CS433 Spring 2001 Laxmikant Kale.
Implementation and Optimization of MPI point-to-point communications on SMP-CMP clusters with RDMA capability.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.
MPI Point to Point Communication
Last Class: RPCs and RMI
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Principles of Message-Passing Programming.
More on MPI Nonblocking point-to-point routines Deadlock
A Message Passing Standard for MPP and Workstations
More on MPI Nonblocking point-to-point routines Deadlock
Introduction to High Performance Computing Lecture 16
Presentation transcript:

1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system architecture »Like C and Fortran, MPI provides the programmer with the tools to achieve high performance without sacrificing portability l Experiments with a Jacobi relaxation example

2 Tuning for MPI’s Send/Receive Protocols l Aggressive Eager »Performance problem: extra copies »Possible deadlock for inadequate eager buffering »Ensure that receives are posted before sends »MPI_Issend can be used to express “wait until receive is posted” l Rendezvous with sender push »Extra latency »Possible delays while waiting for sender to begin l Rendezvous with receiver pull »Possible delays while waiting for receiver to begin

3 Rendezvous Blocking l What happens once sender and receiver rendezvous? »Sender (push) or receiver (pull) may complete operation »May block other operations while completing l Performance tradeoff »If operation does not block (by checking for other requests), it adds latency or reduces bandwidth. l Can reduce performance if a receiver, having acknowledged a send, must wait for the sender to complete a separate operation that it has started.

4 Tuning for Rendezvous with Sender Push l Ensure receives posted before sends »better, ensure receives match sends before computation starts; may be better to do sends before receives l Ensure that sends have time to start transfers l Can use short control messages l Beware of the cost of extra messages »Intel i860 encouraged use of control messages with ready send (force type)

5 Tuning for Rendezvous with Receiver Pull l Place MPI_Isends before receives l Use short control messages to ensure matches l Beware of the cost of extra messages

6 Experiments with MPI Implementations l Multiparty data exchange l Jacobi iteration in 2 dimensions »Model for PDEs, Matrix-vector products »Algorithms with surface/volume behavior »Issues similar to unstructured grid problems (but harder to illustrate) l Others at

7 Jacobi Iteration l Simple parallel data structure l Processes exchange rows with neighbors

8 Background to Tests l Goals »Identify better performing idioms for the same communication operation »Understand these by understanding the underlying MPI process »Provide a starting point for evaluating additional options (there are many ways to write even simple codes)

9 Some Send/Receive Approaches l Based on operation hypothesis. Most of these are for polling mode. Each of the following is a hypothesis that the experiments test »Better to start receives first »Ensure recvs posted before sends »Ordered (no overlap) »Nonblocking operations, overlap effective »Use of Ssend, Rsend versions (EPCC/T3D can prefer Ssend over Send; uses Send for buffered send) »Manually advance automaton l Persistent operations

10 Scheduling Communications l Is it better to use MPI_Waitall or to schedule/order the requests? »Does the implementation complete a Waitall in any order or does it prefer requests as ordered in the array of requests? l In principle, it should always be best to let MPI schedule the operations. In practice, it may be better to order either the short or long messages first, depending on how data is transferred.

11 Some Example Results l Summarize some different approaches l More details at mpiexmpl/src3/runs.html mpiexmpl/src3/runs.html

12 Send and Recv l Simplest use of send and recv l Very poor performance on SP2 »Rendezvous sequentializes sends/receives l OK performance on T3D (implementation tends to buffer operations)

13 Better to start receives first l Irecv, Isend, Waitall l Ok performance

14 Ensure recvs posted before sends l Irecv, Sendrecv/Barrier, Rsend, Waitall

15 Use of Ssend versions l Ssend allows send to wait until receive ready »At least one implementation (T3D) gives better performance for Ssend than for Send

16 Nonblocking Operations, Overlap Effective l Isend, Irecv, Waitall l A variant uses Waitsome with computation

17 Persistent Operations l Potential saving »Allocation of MPI_Request »Validating and storing arguments »Fewer interactions with “polling” engine l Variations of example »sendinit, recvinit, startall, waitall »startall(recvs), sendrecv/barrier, startall(rsends), waitall l Some vendor implementations are buggy l Persistent operations may be slightly slower »if vendor optimizes for non-persistent operations

18 Polling Engine l Some implementations poll for incoming control message on “important” MPI calls »Each polling operation costs time »HP/Convex is one vendor implementation that does this l Operations that create persistent operations don’t poll

19 Summary of Results l Better to start sends before receives »Most implementations use rendezvous protocols for long messages (Cray, IBM, SGI) l Synchronous sends better on T3D »otherwise system buffers l MPI_Rsend can offer some performance gain on IBM SP »as long as receives can be guaranteed without extra messages