ECE 1747H : Parallel Programming Message Passing (MPI)

Slides:



Advertisements
Similar presentations
MPI Basics Introduction to Parallel Programming and Cluster Computing University of Washington/Idaho State University MPI Basics Charlie Peck Earlham College.
Advertisements

CS 140: Models of parallel programming: Distributed memory and MPI.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
CS 240A: Models of parallel programming: Distributed memory and MPI.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.
CS 179: GPU Programming Lecture 20: Cross-system communication.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Parallel & Cluster Computing MPI Basics Paul Gray, University of Northern Iowa David Joiner, Shodor Education Foundation Tom Murphy, Contra Costa College.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Director of Contra Costa College High Performance Computing Center
2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
9-2.1 “Grid-enabling” applications Part 2 Using Multiple Grid Computers to Solve a Single Problem MPI © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
CS 240A Models of parallel programming: Distributed memory and MPI.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Parallel Programming with MPI By, Santosh K Jena..
Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
An Introduction to MPI (message passing interface)
NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.
Distributed Processing Systems (InterProcess Communication) (Message Passing) Distributed Processing Systems (InterProcess Communication) (Message Passing)
Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
1 Parallel and Distributed Processing Lecture 5: Message-Passing Computing Chapter 2, Wilkinson & Allen, “Parallel Programming”, 2 nd Ed.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
MPI: Message Passing Interface An Introduction S. Lakshmivarahan School of Computer Science.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Introduction to parallel computing concepts and technics
CS4402 – Parallel Computing
Introduction to MPI.
MPI Message Passing Interface
Send and Receive.
CS 584.
An Introduction to Parallel Programming with MPI
Send and Receive.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
Lab Course CFD Parallelisation Dr. Miriam Mehl.
Introduction to parallelism and the Message Passing Interface
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
MPI (continue) An example for designing explicit message passing programs Emphasize on the difference between shared memory code and distributed memory.
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Hello, world in MPI #include <stdio.h> #include "mpi.h"
MPI (continue) An example for designing explicit message passing programs Emphasize on the difference between shared memory code and distributed memory.
Hello, world in MPI #include <stdio.h> #include "mpi.h"
MPI Message Passing Interface
CS 584 Lecture 8 Assignment?.
Presentation transcript:

ECE 1747H : Parallel Programming Message Passing (MPI)

Explicit Parallelism Same thing as multithreading for shared memory. Explicit parallelism is more common with message passing. –User has explicit control over processes. –Good: control can be used to performance benefit. –Bad: user has to deal with it.

Distributed Memory - Message Passing proc1proc2proc3procN mem1mem2mem3memN network

Distributed Memory - Message Passing A variable x, a pointer p, or an array a[] refer to different memory locations, depending of the processor. In this course, we discuss message passing as a programming model (can be on any hardware)

What does the user have to do? This is what we said for shared memory: –Decide how to decompose the computation into parallel parts. –Create (and destroy) processes to support that decomposition. –Add synchronization to make sure dependences are covered. Is the same true for message passing?

Another Look at SOR Example for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=0; j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }

Shared Memory proc1proc2proc3procN grid 11 temp

Message-Passing Data Distribution (only middle processes) proc2proc grid temp

Is this going to work? Same code as we used for shared memory for( i=from; i<to; i++ ) for( j=0; j<n; j++ ) temp[i][j] = 0.25*( grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); No, we need extra boundary elements for grid.

Data Distribution (only middle processes) proc2proc grid temp

Is this going to work? Same code as we used for shared memory for( i=from; i<to; i++) for( j=0; j<n; j++ ) temp[i][j] = 0.25*( grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); No, on the next iteration we need boundary elements from our neighbors.

Data Communication (only middle processes) proc2proc3 grid

Is this now going to work? Same code as we used for shared memory for( i=from; i<to; i++ ) for( j=0; j<n; j++ ) temp[i][j] = 0.25*( grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); No, we need to translate the indices.

Index Translation for( i=0; i<n/p; i++) for( j=0; j<n; j++ ) temp[i][j] = 0.25*( grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); Remember, all variables are local.

Index Translation is Optional Allocate the full arrays on each processor. Leave indices alone. Higher memory use. Sometimes necessary (see later).

What does the user need to do? Divide up program in parallel parts. Create and destroy processes to do above. Partition and distribute the data. Communicate data at the right time. (Sometimes) perform index translation. Still need to do synchronization? –Sometimes, but many times goes hand in hand with data communication.

Message Passing Systems Provide process creation and destruction. Provide message passing facilities (send and receive, in various flavors) to distribute and communicate data. Provide additional synchronization facilities.

MPI (Message Passing Interface) Is the de facto message passing standard. Available on virtually all platforms. Grew out of an earlier message passing system, PVM, now outdated.

MPI Process Creation/Destruction MPI_Init( int argc, char **argv ) Initiates a computation. MPI_Finalize() Terminates a computation.

MPI Process Identification MPI_Comm_size( comm, &size ) Determines the number of processes. MPI_Comm_rank( comm, &pid ) Pid is the process identifier of the caller.

MPI Basic Send MPI_Send(buf, count, datatype, dest, tag, comm) buf: address of send buffer count: number of elements datatype: data type of send buffer elements dest: process id of destination process tag: message tag (ignore for now) comm: communicator (ignore for now)

MPI Basic Receive MPI_Recv(buf, count, datatype, source, tag, comm, &status) buf: address of receive buffer count: size of receive buffer in elements datatype: data type of receive buffer elements source: source process id or MPI_ANY_SOURCE tag and comm: ignore for now status: status object

MPI Matrix Multiply (w/o Index Translation) main(int argc, char *argv[]) { MPI_Init (&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &p); from = (myrank * n)/p; to = ((myrank+1) * n)/p; /* Data distribution */... /* Computation */... /* Result gathering */... MPI_Finalize(); }

MPI Matrix Multiply (w/o Index Translation) /* Data distribution */ if( myrank != 0 ) { MPI_Recv( &a[from], n*n/p, MPI_INT, 0, tag, MPI_COMM_WORLD, &status ); MPI_Recv( &b, n*n, MPI_INT, 0, tag, MPI_COMM_WORLD, &status ); } else { for( i=1; i<p; i++ ) { MPI_Send( &a[from], n*n/p, MPI_INT, i, tag, MPI_COMM_WORLD ); MPI_Send( &b, n*n, MPI_INT, I, tag, MPI_COMM_WORLD ); }

MPI Matrix Multiply (w/o Index Translation) /* Computation */ for ( i=from; i<to; i++) for (j=0; j<n; j++) { C[i][j]=0; for (k=0; k<n; k++) C[i][j] += A[i][k]*B[k][j]; }

MPI Matrix Multiply (w/o Index Translation) /* Result gathering */ if (myrank!=0) MPI_Send( &c[from], n*n/p, MPI_INT, 0, tag, MPI_COMM_WORLD); else for (i=1; i<p; i++) MPI_Recv( &c[from], n*n/p, MPI_INT, i, tag, MPI_COMM_WORLD, &status);

MPI Matrix Multiply (with Index Translation) main(int argc, char *argv[]) { MPI_Init (&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &p); from = (myrank * n)/p; to = ((myrank+1) * n)/p; /* Data distribution */... /* Computation */... /* Result gathering */... MPI_Finalize(); }

MPI Matrix Multiply (with Index Translation) /* Data distribution */ if( myrank != 0 ) { MPI_Recv( &a, n*n/p, MPI_INT, 0, tag, MPI_COMM_WORLD, &status ); MPI_Recv( &b, n*n, MPI_INT, 0, tag, MPI_COMM_WORLD, &status ); } else { for( i=1; i<p; i++ ) { MPI_Send( &a[from], n*n/p, MPI_INT, i, tag, MPI_COMM_WORLD ); MPI_Send( &b, n*n, MPI_INT, I, tag, MPI_COMM_WORLD ); }

MPI Matrix Multiply (with Index Translation) /* Computation */ for ( i=0; i<n/p; i++) for (j=0; j<n; j++) { C[i][j]=0; for (k=0; k<n; k++) C[i][j] += A[i][k]*B[k][j]; }

MPI Matrix Multiply (with Index Translation) /* Result gathering */ if (myrank!=0) MPI_Send( &c, n*n/p, MPI_INT, 0, tag, MPI_COMM_WORLD); else for( i=1; i<p; i++ ) MPI_Recv( &c[from], n*n/p, MPI_INT, i, tag, MPI_COMM_WORLD, &status);

Running a MPI Program mpirun Interacts with a daemon process on the hosts. Causes a Unix process to be run on each of the hosts. May only run in interactive mode (batch mode may be blocked by ssh)

ECE1747 Parallel Programming Message Passing (MPI) Global Operations

What does the user need to do? Divide program in parallel parts. Create and destroy processes to do above. Partition and distribute the data. Communicate data at the right time. (Sometimes) perform index translation. Still need to do synchronization? –Sometimes, but many times goes hand in hand with data communication.

MPI Process Creation/Destruction MPI_Init( int *argc, char **argv ) Initiates a computation. MPI_Finalize() Finalizes a computation.

MPI Process Identification MPI_Comm_size( comm, &size ) Determines the number of processes. MPI_Comm_rank( comm, &pid ) Pid is the process identifier of the caller.

MPI Basic Send MPI_Send(buf, count, datatype, dest, tag, comm) buf: address of send buffer count: number of elements datatype: data type of send buffer elements dest: process id of destination process tag: message tag (ignore for now) comm: communicator (ignore for now)

MPI Basic Receive MPI_Recv(buf, count, datatype, source, tag, comm, &status) buf: address of receive buffer count: size of receive buffer in elements datatype: data type of receive buffer elements source: source process id or MPI_ANY_SOURCE tag and comm: ignore for now status: status object

Global Operations (1 of 2) So far, we have only looked at point-to- point or one-to-one message passing facilities. Often, it is useful to have one-to-many or many-to-one message communication. This is what MPI’s global operations do.

Global Operations (2 of 2) MPI_Barrier MPI_Bcast MPI_Gather MPI_Scatter MPI_Reduce MPI_Allreduce

Barrier MPI_Barrier(comm) Global barrier synchronization, as before: all processes wait until all have arrived.

Broadcast MPI_Bcast(inbuf, incnt, intype, root, comm) inbuf: address of input buffer (on root); address of output buffer (elsewhere) incnt: number of elements intype: type of elements root: process id of root process

Before Broadcast proc0proc1proc2proc3 inbuf root

After Broadcast proc0proc1proc2proc3 inbuf root

Scatter MPI_Scatter(inbuf, incnt, intype, outbuf, outcnt, outtype, root, comm) inbuf: address of input buffer incnt: number of input elements intype: type of input elements outbuf: address of output buffer outcnt: number of output elements outtype: type of output elements root: process id of root process

Before Scatter proc0proc1proc2proc3 inbuf outbuf root

After Scatter proc0proc1proc2proc3 inbuf outbuf root

Gather MPI_Gather(inbuf, incnt, intype, outbuf, outcnt, outtype, root, comm) inbuf: address of input buffer incnt: number of input elements intype: type of input elements outbuf: address of output buffer outcnt: number of output elements outtype: type of output elements root: process id of root process

Before Gather proc0proc1proc2proc3 inbuf outbuf root

After Gather proc0proc1proc2proc3 inbuf outbuf root

Broadcast/Scatter/Gather Funny thing: these three primitives are sends and receives at the same time (a little confusing sometimes). Perhaps un-intended consequence: requires global agreement on layout of array.

MPI Matrix Multiply (w/o Index Translation) main(int argc, char *argv[]) { MPI_Init (&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &p); for( i=0; i<p; i++ ) { from[i] = (i * n)/p; to[i] = ((i+1) * n)/p; } /* Data distribution */... /* Computation */... /* Result gathering */... MPI_Finalize(); }

MPI Matrix Multiply (w/o Index Translation) /* Data distribution */ if( myrank != 0 ) { MPI_Recv( &a[from[myrank]], n*n/p, MPI_INT, 0, tag, MPI_COMM_WORLD, &status ); MPI_Recv( &b, n*n, MPI_INT, 0, tag, MPI_COMM_WORLD, &status ); } else { for( i=1; i<p; i++ ) { MPI_Send( &a[from[i]], n*n/p, MPI_INT, i, tag, MPI_COMM_WORLD ); MPI_Send( &b, n*n, MPI_INT, I, tag, MPI_COMM_WORLD ); }

MPI Matrix Multiply (w/o Index Translation) /* Computation */ for ( i=from[myrank]; i<to[myrank]; i++) for (j=0; j<n; j++) { C[i][j]=0; for (k=0; k<n; k++) C[i][j] += A[i][k]*B[k][j]; }

MPI Matrix Multiply (w/o Index Translation) /* Result gathering */ if (myrank!=0) MPI_Send( &c[from[myrank]], n*n/p, MPI_INT, 0, tag, MPI_COMM_WORLD); else for( i=1; i<p; i++ ) MPI_Recv( &c[from[i]], n*n/p, MPI_INT, i, tag, MPI_COMM_WORLD, &status);

MPI Matrix Multiply Revised (1 of 2) main(int argc, char *argv[]) { MPI_Init (&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &p); from = (myrank * n)/p; to = ((myrank+1) * n)/p; MPI_Scatter (a, n*n/p, MPI_INT, a, n*n/p, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast (b,n*n, MPI_INT, 0, MPI_COMM_WORLD);...

MPI Matrix Multiply Revised (2 of 2)... for (i=from; i<to; i++) for (j=0; j<n; j++) { C[i][j]=0; for (k=0; k<n; k++) C[i][j] += A[i][k]*B[k][j]; } MPI_Gather (C[from], n*n/p, MPI_INT, c[from], n*n/p, MPI_INT, 0, MPI_COMM_WORLD); MPI_Finalize(); }

SOR Sequential Code for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=0; j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }

MPI SOR Allocate grid and temp arrays. Use MPI_Scatter to distribute initial values, if any (requires non-local allocation). Use MPI_Gather to return the results to process 0 (requires non-local allocation). Focusing only on communication within the computational part...

Data Communication (only middle processes) proc2proc3 grid

MPI SOR for some number of timesteps/iterations { for (i=from; i<to; i++ ) for( j=0; j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=from; i<to; i++ ) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; /* here comes communication */ }

MPI SOR Communication if (myrank != 0) { MPI_Send (grid[from], n, MPI_DOUBLE, myrank-1, tag, MPI_COMM_WORLD); MPI_Recv (grid[from-1], n, MPI_DOUBLE, myrank-1, tag, MPI_COMM_WORLD, &status); } if (myrank != p-1) { MPI_Send (grid[to-1], n, MPI_DOUBLE, myrank+1, tag, MPI_COMM_WORLD); MPI_Recv (grid[to], n, MPI_DOUBLE, myrank+1, tag, MPI_COMM_WORLD, &status); }

No Barrier Between Loop Nests? Not necessary. Anti-dependences do not need to be covered in message passing. Memory is private, so overwrite does not matter.

SOR: Terminating Condition Real versions of SOR do not run for some fixed number of iterations. Instead, they test for convergence. Possible convergence criterion: difference between two successive iterations is less than some delta.

SOR Sequential Code with Convergence for( ; diff > delta; ) { for (i=0; i<n; i++ ) for( j=0; j<n, j++ ) { … } diff = 0; for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { diff = max(diff, fabs(grid[i][j] - temp[i][j])); grid[i][j] = temp[i][j]; }

Reduction MPI_Reduce(inbuf, outbuf, count, type, op, root, comm) inbuf: address of input buffer outbuf: address of output buffer count: number of elements in input buffer type: datatype of input buffer elements op: operation (MPI_MIN, MPI_MAX, etc.) root: process id of root process

Global Reduction MPI_Allreduce(inbuf, outbuf, count, type, op, comm) inbuf: address of input buffer outbuf: address of output buffer count: number of elements in input buffer type: datatype of input buffer elements op: operation (MPI_MIN, MPI_MAX, etc.) no root process

MPI SOR Code with Convergence for( ; diff > delta; ) { for (i=from; i<to; i++ ) for( j=0; j<n, j++ ) { … } mydiff = 0.0; for( i=from; i<to; i++ ) for( j=0; j<n; j++ ) { mydiff=max(mydiff,fabs(grid[i][j]-temp[i][j]); grid[i][j] = temp[i][j]; } MPI_Allreduce (&mydiff, &diff, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);... }