High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.

Slides:



Advertisements
Similar presentations
1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.
Advertisements

1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
1 Parallel Programming with MPI- Day 4 Science & Technology Support High Performance Computing Ohio Supercomputer Center 1224 Kinnear Road Columbus, OH.
Introduction to MPI Programming (Part III)‏ Michael Griffiths, Deniz Savas & Alan Real January 2006.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
High Performance Computing
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
Parallel Programming with Message-Passing Interface (MPI)
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
Its.unc.edu 1 Derived Datatypes Research Computing UNC - Chapel Hill Instructor: Mark Reed
Lecture 8 Objectives Material from Chapter 9 More complete introduction of MPI functions Show how to implement manager-worker programs Parallel Algorithms.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
Edgar Gabriel MPI derived datatypes Edgar Gabriel.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
MA471Fall 2003 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
CS 240A Models of parallel programming: Distributed memory and MPI.
1 Why Derived Data Types  Message data contains different data types  Can use several separate messages  performance may not be good  Message data.
Introduction to Parallel Programming with C and MPI at MCSR Part 1 The University of Southern Mississippi April 8, 2010.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
MPI Send/Receive Blocked/Unblocked Tom Murphy Director of Contra Costa College High Performance Computing Center Message Passing Interface BWUPEP2011,
An Introduction to Parallel Programming with MPI March 22, 24, 29, David Adams
1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.
MPI: the last episode By: Camilo A. Silva. Topics Modularity Data Types Buffer issues + Performance issues Compilation using MPICH2 Other topics: MPI.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
MPI Workshop - III Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3.
An Introduction to MPI (message passing interface)
1 HPCI Presentation Kulathep Charoenpornwattana. March 12, Outline Parallel programming with MPI Running MPI applications on Azul & Itanium Running.
Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.
MPI Send/Receive Blocked/Unblocked Josh Alexander, University of Oklahoma Ivan Babic, Earlham College Andrew Fitz Gibbon, Shodor Education Foundation Inc.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
Message Passing Interface Using resources from
Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.
An Introduction to Parallel Programming with MPI February 17, 19, 24, David Adams
3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:
Introduction to parallel computing concepts and technics
MPI Basics.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.
MPI Point to Point Communication
MPI Message Passing Interface
More on MPI Nonblocking point-to-point routines Deadlock
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
Parallel Programming with MPI- Day 4
Introduction to parallelism and the Message Passing Interface
More on MPI Nonblocking point-to-point routines Deadlock
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Message Passing Programming Based on MPI
5- Message-Passing Programming
Presentation transcript:

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division

High Performance Parallel Programming Lecture 8:Message Passing Interface (MPI) (part 2)

High Performance Parallel Programming Example problem One-dimensional smoothing: –each element set to average of it’s neigbours P1P2P3Pn array halo boundary Process

High Performance Parallel Programming Deadlock If we implement an algorithm like this for (iterations) update all cells; send boundary values to neighbours; recieve halo values from meighbours;

High Performance Parallel Programming Non-blocking communications Routine returns before the communication completes Seperate communication into phases Initiate non-blocking communication Do some work (perhaps invloving other communications) Wait for non-blocking communication to complete Can test before waiting (or instead of)

High Performance Parallel Programming Solution So our algorithm now looks like this for(iterations){ update boundary cells; initiate sending of boundary values; intiate receipt of halo values; update non-boundary cells; wait for completion of sending boundary values; wait for completion of receiving halo values; } –Deadlock cannot occur –Communication can occur simultaneously in each direction

High Performance Parallel Programming Non-blocking communication in MPI All the same arguments as blocking counterparts plus an extra argument This argument, request, is a handle which is used to test when the operation has completed. Same communication models as blocking mode MPI_Isend Standard send MPI_Issend Synchronous send MPI_Ibsend Buffered send MPI_Rsend Read send MPI_Irecv Receive

High Performance Parallel Programming Handles datatype - same as blocking MPI_Datatype or integer communicator - same as blocking MPI_Comm or integer request - MPI_Request or integer a request handle is allocated when a communication is initiated MPI_Issend(buf, count, datatype, dest, tag, comm, handle)

High Performance Parallel Programming Testing for completion Two types: WAIT type –block until the communication has completed –useful when data or buffer is required MPI_Wait(request, status) TEST type –return TRUE or FALSE value depending on completion –do not block –useful if data is not yet required MPI_Test(request, flag, status)

High Performance Parallel Programming Blocking and non-blocking A non-blocking routine followed by a wait is equivalent to a blocking routine Send and receive can be blocking or non-blocking A blocking send can be used with a non-blocking receive and vice-versa Non-blocking sends can use any mode –standard, synchronous, buffered, ready Synchronous mode affects completion, not initiation Cannot alter send buffer until send completed

High Performance Parallel Programming Multiple communications Sometimes have many non-blocking communications posted at the same time. MPI provides routines to test multiple communications Three types –test for all –test for any –test for some Each type comes in both wait and test versions

High Performance Parallel Programming MPI_Waitall Tests all of the specified communications –blocking: MPI_Waitall(count, array_of_requests, array_of_statuses) –non-blocking: MPI_Testall(count, array_of_requests, flag, array_of_statuses) Information about each communication is returned in array_of_statuses flag is set to true if all the communications have completed

High Performance Parallel Programming MPI_Waitany Tests if any communications have completed MPI_Waitany(count, array_of_requests, index, status) and MPI_Testany(count, array_of_requests, index, flag, status) The index in array_of_requests and status of the completed communication are returned in index and status If more that one has completed choice is arbitrary

High Performance Parallel Programming MPI_Waitsome Differs from Waitany in behaviour if more than one can complete Return status on all communications that can complete MPI_Waitsome(count, array_of_requests, outcount, array_of_indices, array_of_statuses) MPI_Testsome(count, array_of_requests, outcount, array_of_indices, array_of_statuses) Obey a fairness rule to help prevent starvation Note: all completion tests deallocate the request object when they return as complete. The handle is set to MPI_REQUEST_NULL

High Performance Parallel Programming Derived datatypes As discussed last lecture, there are occasions when we wish to pass data that doesn’t fit the basic model... e.g. –a matrix sub-block or matrix section ( a(5,:) ) (non-contiguous data items) –a structure (contiguous, differing type) –a set of variables ( n, set(n) ) (random) There are solutions using standard types but clumsy...

High Performance Parallel Programming Derived datatypes (cont.) Two stage process: –construct the datatype –Commit the datatype Datatype is constructed from basic datatypes using –MPI_Type_contiguous –MPI_Type_vector –MPI_Type_hvector –MPI_Type_indexed –MPI_Type_hindexed –MPI_Type_struct

High Performance Parallel Programming Derived datatypes (cont.) Once the new datatype is constructed it must be committed. MPI_Type_commit(datatype) After use a datatype can be de-allocated MPI_Type_free(datatype) –Any messages in progress are unaffected when a type is freed –Datatypes derived from datatype are also unaffected

High Performance Parallel Programming Derived datatypes - Type Maps Any datatype is specified by its type map A type map is a list of the form - Displacements may be positive, zero, or negative Displacements are from the start of the communication buffer basic datatype 0 basic datatype 1... basic datatype n displacement of datatype 0 displacement of datatype 1... displacement of datatype n

High Performance Parallel Programming MPI_TYPE_VECTOR MPI_Type_vector(count, blocklength, stride, oldtype, newtype) e.g. MPI_Datatype new; MPI_Type_vector(2, 3, 5, MPI_DOUBLE, new MPI_DOUBLE new blocklength=3 stride=5 count=2

High Performance Parallel Programming MPI_Type_struct MPI_Type_struct(count, array_of_blocklengths, array_of_displacements, array_of_types, newtype) e.g. MPI_INT MPI_DOUBLE newtype

High Performance Parallel Programming MPI_Type_struct - example int blocklen[2], extent; MPI_Aint disp[2]; MPI_Datatype type[2], new; struct { MPI_INT int; MPI_DOUBLE dble[3]; } msg; disp[0] = 0; MPI_Type_extent(MPI_INT,&extent); disp[1] = extent; type[0] = MPI_INT; type[1] = MPI_DOUBLE blocklen[0] = 1; blocklen[1] = 3; MPI_Type_struct(2, &blocklen, &disp, &type, new); MPI_Type_commit(&new);

High Performance Parallel Programming Derived datatypes - other routines MPI_Type_size(datatype, size) –returns the total size of all the data items in datatype MPI_Type_extent(datatype, extent) –returns the distance between the lower and upper bounds MPI_Type_lb(datatype, lb) –returns the lower bound of the datatype (offset in bytes) MPI_Type_ub(datatype, ub) –returns the upper bound of the datatype

High Performance Parallel Programming Matching rules A send and receive are correctly matched if the type maps of the specified datatypes with the displacements ignored match according to the rules for basic datatypes. The number of basic elements received can be found using MPI_Get_elements. MPI_Get_count returns the number of received elements of the specified datatype –may not be a whole number –if so returns MPI_UNDEFINED

High Performance Parallel Programming Virtual Topologies Convenient process naming Naming scheme to fit communication pattern Simplifies writing of code Can allow MPI to optimise communications Creating a topology produces a new communicator MPI provides mapping functions Mapping functions compute processor ranks based on the topology naming scheme

High Performance Parallel Programming Example - a 2D torus 0 (0,0) 3 (1,0) 6 (2,0) 9 (3,0) 1 (0,1) 4 (1,1) 7 (2,1) 10 (3,1) 2 (0,2) 5 (1,2) 8 (2,2) 11 (3,2)

High Performance Parallel Programming Topology Types Cartesian topologies –each process is connected to it’s neighbours in a virtual grid –boundaries can be cyclic, or not –processes are identified by cartesian coordinates Graph topologies –general connected graphs –I’m not going to cover them

High Performance Parallel Programming Creating a cartesian topology MPI_Cart_create(comm_old, ndims, dims, periods, reorder, comm_cart) ndims number of dimensions dims number of processors in each dimension periods true or false specifying cyclic reorder false => data already distributed - use existing ranks true => MPI can reorder ranks

High Performance Parallel Programming Cartesian mapping functions MPI_Cart_rank(comm, coords, rank) –used to determine the rank of a process with the specified coordinates MPI_Cart_coords(comm, rank, maxdims, coords) –converts process rank to grid coords MPI_Cart_shift(comm, direction, disp, rank_source, rank_dest) –provides the correct ranks for a shift –these can then be used in sends and receives –direction is the dimension in which the shift occurs –no support for diagonal shifts

High Performance Parallel Programming Cartesian partitioning It is possible to create a partition of a cartesian topology Often used to create communicators for row (or slice) operations MPI_Cart_sub(comm, remain_dims, new_comm) If comm defines a 2x3x4 grid and remain_dims = (true, false, true) then MPI_Cart_sub will create 3 new communicators each with 8 processors in a 2x4 grid Note that only one communicator is returned - the one which contains the calling process.

High Performance Parallel Programming Local notes for ping-pong me for an account (or see me or Shaoib or Srikumar). We are having some queue problems but will fix asap Remember to add /usr/local/mpi/bin to your PATH Use mpicc to compile (don’t add -lmpi) You need ssh to connect to charm The other nodes are on a private lan (switch)

High Performance Parallel Programming mpich After the MPI standard was announced a portable implementation, mpich, was produced by ANL. It consists of: –libraries and include files - libmpi, mpi.h –compilers - mpicc, mpif90. These know about things like where relevant include and library files are –runtime loader - mpirun Has arguments -np, and -machinefile implements SPMD paradigm by starting a copy of program on each node. The program must therefore do do any differentitation itself (using MPI_Comm_size() and MPI_Comm_rank() functions). NOTE: our version gets # CPUs and their addresses from PBS (ie, don't use -np and/or -machinefile)

High Performance Parallel Programming PBS PBS is a batch system - jobs get submitted to a queue The job is a shell script to execute your program The shell script can contain job management instructions (note that these instructions can also be in the command line) PBS will allocate your job to some other computer, log in as you, and execute your script, ie your script must contain cd's or aboslute references to access files (or globus objects) Useful PBS commands: –qsub - submits a job –qstat - monitors status –qdel - deletes a job from a queue

High Performance Parallel Programming PBS directives Some PBS directives to insert at the start of your shell script: –#PBS -q –#PBS -e (stderr location) –#PBS -o (stdout location) –#PBS -eo (combines stderr and stdout) –#PBS -t (maximum time) –#PBS -l = (eg -l nodes=2)

High Performance Parallel Programming charm charm.hpc.unimelb.edu.au is a dual Pentium (PII, 266MHz, 128MB RAM) and is the front end for the PC farm. It's running Red Hat Linux. Behind charm are sixteen PCs (all 200MHz MMX, with 64MB RAM). Their DNS designations are pc-i11.hpc.unimelb.edu.au,..., pc-i18.hpc.unimelb.edu.au and pc-j11.hpc.unimelb.edu.au,..., pc-j18.hpc.unimelb.edu.au. OpenPBS is the batch system that is implemented on charm. There are four batch queues implemented on charm: –pque all nodes –exclusive all nodes –pquei pc-i* nodes only –pquej pc-j* nodes only

High Performance Parallel Programming Thursday: More MPI