© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Slides:



Advertisements
Similar presentations
VSCSE Summer School Programming Heterogeneous Parallel Computing Systems Lecture 6: Basic CUDA and MPI.
Advertisements

MPI Basics Introduction to Parallel Programming and Cluster Computing University of Washington/Idaho State University MPI Basics Charlie Peck Earlham College.
CS 140: Models of parallel programming: Distributed memory and MPI.
Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
Reference: / MPI Program Structure.
Tutorial on MPI Experimental Environment for ECE5610/CSC
High Performance Computing
MPI_Gatherv CISC372 Fall 2006 Andrew Toy Tom Lynch Bill Meehan.
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
CS 240A: Models of parallel programming: Distributed memory and MPI.
SOME BASIC MPI ROUTINES With formal datatypes specified.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
Lecture 8 Objectives Material from Chapter 9 More complete introduction of MPI functions Show how to implement manager-worker programs Parallel Algorithms.
Message Passing Interface. Message Passing Interface (MPI) Message Passing Interface (MPI) is a specification designed for parallel applications. The.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
CS 179: GPU Programming Lecture 20: Cross-system communication.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming AMANO, Hideharu. Parallel Programming Message Passing  PVM  MPI Shared Memory  POSIX thread  OpenMP  CUDA/OpenCL Automatic Parallelizing.
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
PARALLEL COMPUTING Petr Štětka Jakub Vlášek Department of Applied Electronics and Telecommunications, Faculty of electrical engineering, University of.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Director of Contra Costa College High Performance Computing Center
ECE 1747H : Parallel Programming Message Passing (MPI)
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
MA471Fall 2003 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
CS 240A Models of parallel programming: Distributed memory and MPI.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
Parallel Programming with MPI By, Santosh K Jena..
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
12.1 Parallel Programming Types of Parallel Computers Two principal types: 1.Single computer containing multiple processors - main memory is shared,
An Introduction to MPI (message passing interface)
NORA/Clusters AMANO, Hideharu Textbook pp. 140-147.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 / CS483 Applied Parallel Programming Lecture 27:
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Message Passing Interface Using resources from
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to parallel computing concepts and technics
MPI Basics.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.
Introduction to MPI.
© David Kirk/NVIDIA and Wen-mei W
Introduction to MPI CDP.
Heterogeneous Supercomputing in Blue Waters
Introduction to Message Passing Interface (MPI)
CS 5334/4390 Spring 2017 Rogelio Long
Lecture 14: Inter-process Communication
© David Kirk/NVIDIA and Wen-mei W
Introduction to parallelism and the Message Passing Interface
Introduction to Parallel Computing with MPI
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Some codes for analysis and preparation for programming
CS 584 Lecture 8 Assignment?.
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming Lecture 26: Joint CUDA-MPI Programming

Objective To become proficient in writing simple joint MPI-CUDA heterogeneous applications –Understand the key sections of the application –Simplified code and efficient data movement using GMAC –One-way communication To become familiar with a more sophisticated MPI application that requires two-way data- exchange © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 2

Blue Waters Computing System Sonexion: 26 PBs >1 TB/sec 100 GB/sec 10/40/100 Gb Ethernet Switch 10/40/100 Gb Ethernet Switch Spectra Logic: 300 PBs 120+ Gb/sec WAN IB Switch © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 3

Cray XE6 Nodes Dual-socket Node –Two AMD Interlagos chips 16 core modules, 64 threads 313 GFs peak performance 64 GBs memory –102 GB/sec memory bandwidth –Gemini Interconnect Router chip & network interface Injection Bandwidth (peak) –9.6 GB/sec per direction HT3 Blue Waters contains 22,640 Cray XE6 compute nodes. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 4

Cray XK7 Nodes Dual-socket Node –One AMD Interlagos chip 8 core modules, 32 threads GFs peak performance 32 GBs memory –51 GB/s bandwidth –One NVIDIA Kepler chip 1.3 TFs peak performance 6 GBs GDDR5 memory –250 GB/sec bandwidth –Gemini Interconnect Same as XE6 nodes PCIe Gen2 HT3 Blue Waters contains 3,072 Cray XK7 compute nodes. © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 5

Gemini Interconnect Network Blue Waters 3D Torus Size 23 x 24 x 24 InfiniBand SMW GigE Login Servers Network(s) Boot Raid Fibre Channel Infiniband Compute Nodes Cray XE6 Compute Cray XK7 Accelerator Service Nodes Operating System Boot System Database Login Gateways Network Login/Network Lustre File System LNET Routers Y X Z Interconnect Network Lustre © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 6

Blue Waters and Titan Computing Systems NCSAORNL System AttributeBlue WatersTitan VendorsCray/AMD/NVIDIACray/AMD/NVIDIA ProcessorsInterlagos/KeplerInterlagos/Kepler Total Peak Performance (PF) Total Peak Performance (CPU/GPU)7.1/42.6/24.5 Number of CPU Chips 48,35218,688 Number of GPU Chips3,07218,688 Amount of CPU Memory (TB) Interconnect3D Torus3D Torus Amount of On-line Disk Storage (PB) Sustained Disk Transfer (TB/sec)> Amount of Archival Storage Sustained Tape Transfer (GB/sec)1007 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 7

Science AreaNumber of Teams CodesStruct Grids Unstruct Grids Dense Matri x Sparse Matrix N- Body Mont e Carlo FF T PICSignific ant I/O Climate and Weather3CESM, GCRM, CM1/WRF, HOMME XXXXX Plasmas/Magnetosphere2H3D(M),VPIC, OSIRIS, Magtail/UPIC XXXX Stellar Atmospheres and Supernovae5PPM, MAESTRO, CASTRO, SEDONA, ChaNGa, MS- FLUKSS XXXXXX Cosmology2Enzo, pGADGET XXX Combustion/Turbulence2PSDNS, DISTUF XX General Relativity2Cactus, Harm3D, LazEV XX Molecular Dynamics4AMBER, Gromacs, NAMD, LAMMPS XXX Quantum Chemistry2SIAL, GAMESS, NWChem XXXXX Material Science3NEMOS, OMEN, GW, QMCPACK XXXX Earthquakes/Seismology2AWP-ODC, HERCULES, PLSQR, SPECFEM3D XXXX Quantum Chromo Dynamics1Chroma, MILC, USQCD XXX Social Networks1EPISIMDEMICS Evolution1Eve Engineering/System of Systems1GRIPS,Revisit X Computer Science1 XXXXX © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 8

CUDA-based cluster Each node contains N GPUs … … GPU 0 GPU N PCIe CPU 0 CPU M Host Memory … … GPU 0 GPU N PCIe CPU 0 CPU M Host Memory 9 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

MPI Model Many processes distributed in a cluster Each process computes part of the output Processes communicate with each other Processes can synchronize © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 10 Node

MPI Initialization, Info and Sync int MPI_Init(int *argc, char ***argv) –Initialize MPI MPI_COMM_WORLD –MPI group with all allocated nodes int MPI_Comm_rank (MPI_Comm comm, int *rank) –Rank of the calling process in group of comm int MPI_Comm_size (MPI_Comm comm, int *size) –Number of processes in the group of comm 11 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

Vector Addition: Main Process 12 int main(int argc, char *argv[]) { int vector_size = 1024 * 1024 * 1024; int pid=-1, np=-1; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &pid); MPI_Comm_size(MPI_COMM_WORLD, &np); if(np < 3) { if(0 == pid) printf(“Nedded 3 or more processes.\n"); MPI_Abort( MPI_COMM_WORLD, 1 ); return 1; } if(pid < np - 1) compute_node(vector_size / (np - 1)); else data_server(vector_size); MPI_Finalize(); return 0; } © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

MPI Sending Data int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) –buf : Initial address of send buffer (choice) –count : Number of elements in send buffer (nonnegative integer) –datatype : Datatype of each send buffer element (handle) –dest : Rank of destination (integer) –tag : Message tag (integer) –comm : Communicator (handle) 13 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

MPI Sending Data int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) –Buf: Initial address of send buffer (choice) –Count: Number of elements in send buffer (nonnegative integer) –Datatype: Datatype of each send buffer element (handle) –Dest: Rank of destination (integer) –Tag: Message tag (integer) –Comm: Communicator (handle) 14 Node © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

MPI Receiving Data int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) –Buf: Initial address of receive buffer (choice) –Count: Maximum number of elements in receive buffer (integer) –Datatype:Datatype of each receive buffer element (handle) –Source: Rank of source (integer) –Tag: Message tag (integer) –Comm: Communicator (handle) –Status: Status object (Status) 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

MPI Receiving Data int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) –Buf: Initial address of receive buffer (choice) –Count: Maximum number of elements in receive buffer (integer) –Datatype:Datatype of each receive buffer element (handle) –Source: Rank of source (integer) –Tag: Message tag (integer) –Comm: Communicator (handle) –Status: Status object (Status) 16 Node © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

Vector Addition: Server Process (I) 17 void data_server(unsigned int vector_size) { int np, num_nodes = np – 1, first_node = 0, last_node = np - 2; unsigned int num_bytes = vector_size * sizeof(float); float *input_a = 0, *input_b = 0, *output = 0; /* Set MPI Communication Size */ MPI_Comm_size(MPI_COMM_WORLD, &np); /* Allocate input data */ input_a = (float *)malloc(num_bytes); input_b = (float *)malloc(num_bytes); output = (float *)malloc(num_bytes); if(input_a == NULL || input_b == NULL || output == NULL) { printf(“Server couldn't allocate memory\n"); MPI_Abort( MPI_COMM_WORLD, 1 ); } /* Initialize input data */ random_data(input_a, vector_size, 1, 10); random_data(input_b, vector_size, 1, 10); © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

Vector Addition: Server Process (II) 18 /* Send data to compute nodes */ float *ptr_a = input_a; float *ptr_b = input_b; for(int process = 1; process < last_node; process++) { MPI_Send(ptr_a, vector_size / num_nodes, MPI_FLOAT, process, DATA_DISTRIBUTE, MPI_COMM_WORLD); ptr_a += vector_size / num_nodes; MPI_Send(ptr_b, vector_size / num_nodes, MPI_FLOAT, process, DATA_DISTRIBUTE, MPI_COMM_WORLD); ptr_b += vector_size / num_nodes; } /* Wait for nodes to compute */ MPI_Barrier(MPI_COMM_WORLD); © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

Vector Addition: Server Process (III) 19 /* Wait for previous communications */ MPI_Barrier(MPI_COMM_WORLD); /* Collect output data */ MPI_Status status; for(int process = 0; process < num_nodes; process++) { MPI_Recv(output + process * num_points / num_nodes, num_points / num_comp_nodes, MPI_REAL, process, DATA_COLLECT, MPI_COMM_WORLD, &status ); } /* Store output data */ store_output(output, dimx, dimy, dimz); /* Release resources */ free(input); free(output); } © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

Vector Addition: Compute Process (I) 20 void compute_node(unsigned int vector_size ) { int np; unsigned int num_bytes = vector_size * sizeof(float); float *input_a, *input_b, *output; MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD, &np); int server_process = np - 1; /* Alloc host memory */ input_a = (float *)malloc(num_bytes); input_b = (float *)malloc(num_bytes); output = (float *)malloc(num_bytes); /* Get the input data from server process */ MPI_Recv(input_a, vector_size, MPI_FLOAT, server_process, DATA_DISTRIBUTE, MPI_COMM_WORLD, &status); MPI_Recv(input_b, vector_size, MPI_FLOAT, server_process, DATA_DISTRIBUTE, MPI_COMM_WORLD, &status); © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

Vector Addition: Compute Process (II) 21 /* Compute the partial vector addition */ for(int i = 0; i < vector_size; ++i) { output[i] = input_a[i] + input_b[i]; } /* Send the output */ MPI_Send(output, vector_size, MPI_FLOAT, server_process, DATA_COLLECT, MPI_COMM_WORLD); /* Release memory */ free(input_a); free(input_b); free(output); } © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign

QUESTIONS? © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 22