Introduction to MPI, OpenMP, Threads Gyan Bhanot IAS Course 10/12/04 and 10/13/04.

Introduction to MPI, OpenMP, Threads Gyan Bhanot gyan@ias.edugyan@ias.edu, gyan@us.ibm.comgyan@us.ibm.com IAS Course 10/12/04 and 10/13/04

Download tar file from clustermgr.csb.ias.edu: ~gyan/course/all.tar.gz Has many MPI codes +.doc files with information on optimization and parallelization for the IAS cluster

P655 Cluster  Type: qcpu to get machine specs

IAS Cluster Characteristics (qcpu,pmcycles) IBM P655 cluster Each node has it's own copy of AIX – which is IBM’s Unix OS Clustermgr: 2 CPU PWR4, 64KB L1 Inst Cache, 32 KB L1 Data Cache, 128 B L1 Data Cache Line Size 1536 KB L2 Cache, data TLB: Size = 1024, associativity = 4, instruction TLB: Size = 1024, associativity = 4, freq = 1200 MHz node1 to node6: 8 CPUs/node, PWR4 P655, 64 KB L1 Inst Cache, 32 KB L1 Data Cache, 128 B Data Cache Line Size, 1536 KB L2 Cache Size, data TLB: Size = 1024, associativity = 4, instruction TLB: Size = 1024, associativity = 4, freq = 1500 MHz Distributed-memory architecture, shared-memory within each node. Shared file-system : GPFS, Lots of disk space. Run pingpong tests to determine Latency and Bandwidth

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

/*----------------------*/ /* Parallel hello world */ /*----------------------*/ #include #include #include Int main(int argc, char * argv[]) { int taskid, ntasks; double pi; /*------------------------------------*/ /* establish the parallel environment */ /*------------------------------------*/ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); /*------------------------------------*/ /* say hello from each MPI task */ /*------------------------------------*/ printf("Hello from task %d.\n", taskid); if (taskid == 0) pi = 4.0*atan(1.0); else pi = 0.0; /*------------------------------------*/ /* do a broadcast from node 0 to all */ /*------------------------------------*/ MPI_Bcast(&pi, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD); printf("node %d: pi = %.10lf\n", taskid, pi); MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); return(0); }

Hello from task 0. node 0: pi = 3.1415926536 Hello from task 1. Hello from task 2. Hello from task 3. node 1: pi = 3.1415926536 node 2: pi = 3.1415926536 node 3: pi = 3.1415926536 OUTPUT FROM hello.c on 4 processors 1. Why is the order messed up? 2. What would you do to fix it?

Answer: 1.The control flow on different processors is not ordered – they all run their own copy of the executable independently. Thus, when each writes output it does so independently of the others – which makes the output unordered. 2. To fix it : export MP_STDOUTMODE=ordered Then the output will look like the following: Hello from task 0. node 0: pi = 3.1415926536 Hello from task 1. node 1: pi = 3.1415926536 Hello from task 2. node 2: pi = 3.1415926536 Hello from task 3. node 3: pi = 3.1415926536

Pingpong Code on 4 procs of P655 cluster /* This program times blocking send/receives, and reports the */ /* latency and bandwidth of the communication system. It is */ /* designed to run with an even number of mpi tasks.This program

msglen = 32000 bytes, elapsed time = 0.3494 msec msglen = 40000 bytes, elapsed time = 0.4000 msec msglen = 48000 bytes, elapsed time = 0.4346 msec msglen = 56000 bytes, elapsed time = 0.4490 msec msglen = 64000 bytes, elapsed time = 0.5072 msec msglen = 72000 bytes, elapsed time = 0.5504 msec msglen = 80000 bytes, elapsed time = 0.5503 msec msglen = 100000 bytes, elapsed time = 0.6499 msec msglen = 120000 bytes, elapsed time = 0.7484 msec msglen = 140000 bytes, elapsed time = 0.8392 msec msglen = 160000 bytes, elapsed time = 0.9485 msec msglen = 240000 bytes, elapsed time = 1.2639 msec msglen = 320000 bytes, elapsed time = 1.5975 msec msglen = 400000 bytes, elapsed time = 1.9967 msec msglen = 480000 bytes, elapsed time = 2.3739 msec msglen = 560000 bytes, elapsed time = 2.7295 msec msglen = 640000 bytes, elapsed time = 3.0754 msec msglen = 720000 bytes, elapsed time = 3.4746 msec msglen = 800000 bytes, elapsed time = 3.7441 msec msglen = 1000000 bytes, elapsed time = 4.6994 msec latency = 50.0 microseconds bandwidth = 212.79 MBytes/sec (approximate values for MPI_Isend/MPI_Irecv/MPI_Waitall) 3. How do you find the Bandwidth and Latency from this data?

Y = X/B + L : B = Bandwidth (Bytes/sec), L = Latency

5. Monte Carlo to Compute π Main Idea Consider unit Square with Embedded Circle Generate Random Points inside Square Out of N trials, m points are inside circle Then π ~ 4m/N Error ~ 1/N Simple to Parallelize

Moldeling Method: 0 1 1 0 THROW MANY DARTS FRACTION INSIDE CIRCLE = π/4

MPI PROGRAM DEFINES WORKING NODES

EACH NODE COMPUTES ESTIMATE OF PI INDEPENDENTLY

NODE 0 COMPUTES AVERAGES AND WRITES OUTPUT

#include #include #include #include "MersenneTwister.h" void mcpi(int, int, int); int monte_carlo(int, int); //========================================= // Main Routine //========================================= int main(int argc, char * argv[]) { int ntasks, taskid, nworkers; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &ntasks); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); if (taskid == 0) { printf(" #cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s\n");} /*--------------------------------------------------*/ /* do monte-carlo with a variable number of workers */ /*--------------------------------------------------*/ for (nworkers=ntasks; nworkers>=1; nworkers = nworkers/2) { mcpi(nworkers, taskid, ntasks);} MPI_Finalize(); return 0;}

//============================================================ // Routine to split tasks into groups and distribute the work //============================================================ void mcpi(int nworkers, int taskid, int ntasks) { MPI_Comm comm; int worker, my_hits, total_hits, my_trials; int total_trials = 6400000; double tbeg, tend, elapsed, rate; double pi_estimate, est_error, abs_error; /*---------------------------------------------*/ /* make a group consisting of just the workers */ /*---------------------------------------------*/ if (taskid < nworkers) worker = 1; else worker = 0; MPI_Comm_split(MPI_COMM_WORLD, worker, taskid, &comm);

if (worker) { /*------------------------------------------*/ /* divide the work among all of the workers */ my_trials = total_trials / nworkers; MPI_Barrier(comm); tbeg = MPI_Wtime(); /* each worker gets a unique seed, and works independently */ my_hits = monte_carlo(taskid, my_trials); /* add the hits from each worker to get total_hits */ MPI_Reduce(&my_hits, &total_hits, 1, MPI_INT, MPI_SUM, 0, comm); tend = MPI_Wtime(); elapsed = tend - tbeg; rate = 1.0e-6*double(total_trials)/elapsed; /* report the results including elapsed times and rates */ if (taskid == 0) { pi_estimate = 4.0*double(total_hits)/double(total_trials); est_error = pi_estimate/sqrt(double(total_hits)); abs_error = fabs(M_PI - pi_estimate); printf("%6d %9d %9.5lf %9.5lf %9.5lf %8.3lf %9.2lf\n", nworkers, total_trials, pi_estimate, est_error, abs_error, elapsed, rate); } } MPI_Barrier(MPI_COMM_WORLD); }

//========================================= // Monte Carlo worker routine: return hits //========================================= int monte_carlo(int taskid, int trials) { int hits = 0; int xseed, yseed; double xr, yr; xseed = 1 * (taskid + 1); yseed = 1357 * (taskid + 1); MTRand xrandom( xseed ); MTRand yrandom( yseed ); for (int t=0; t<trials; t++) { xr = xrandom(); yr = yrandom(); if ( (xr*xr + yr*yr) < 1.0 ) hits++; } return hits; }

Run code in ~gyan/course/src/mpi/pi Poe pi –procs 4 –hfile hf using one node many processors #cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s Speedup 4 6400000 3.14130 0.00140 0.00029 0.134 47.77 3.98 2 6400000 3.14144 0.00140 0.00016 0.267 23.96 1.997 1 6400000 3.14187 0.00140 0.00027 0.533 12.00 1.0

Run code in ~gyan/course/src/mpi/pi Poe pi –procs 4 –hfile hf using many nodes one processor #cpus #trials pi(est) err(est) err(abs) time(s) Mtrials/s Speedup 4 6400000 3.14130 0.00140 0.00029 0.270 23.75 1.98 2 6400000 3.14144 0.00140 0.00016 0.536 11.94 0.99 1 6400000 3.14187 0.00140 0.00027 0.534 12.00 1.00

Generic Parallelization Problem You are given a problem with N = K x L x M variables distributed among P = xyz processors in block form (each processor has KLM/xyz variables) Each variable does F flops/variable before communicating B bytes/variable from each face to the near neighbor processor on the processor grid. Let the processing speed be f flops/s with a compute efficiency c. The computations and communciations do not overlap. Let the latency and bandwidth be d and b respectively, with a communication efficiency of e, a. Write a formula for the time between communication events. b. Write a formula for the time between computation events. c. Let K=L=M=N^(1/3) and x=y=z=p^(1/3). Write a formula for the efficiency E = [time to compute]/[time to compute + time to communicate]. Explore this as a function of F, B, P and N. Use c~e~1. To make this relevant to BG/L use, d = 10 microsec, b = 1.4Gb/s, f = 2.8GFlops/s.

Solution to Generic Parallelization Problem A_compute = Amount of computation = NF/P t_compute = A_compute/(fc) = (NF/(Pcf) Amount of data communicated by each processor = A_communicate = B[(KL/(xy) + LM/(yz) + MK/(zx)] t_communicate = 6d + A_communicate/(be) = 6d + (B/(be))[KL/(xy) + LM/(yz) + MK/(zx)]  6d + 6(B/(be))[N/P]^(2/3) for symmetric case

E = 1/[1 + 6d(fc/F)x + 6(c/e)*(fB/Fb)x^(1/3)] x = P/N 1.Weak Scaling: x = constant as P  ∞ Latency dominates when second term in denominator is biggest. x > 5e(-6) (F/c) and x > 5e(-6) (B/e)^(3/2) using BG/L parameters a. F = B = 1 (transaction processing) latency bound for N/P < 200,000 b. F = 100, B = 1 latency bound if N/P < 2000 c. F = 1000, B = 100 latency bound if N/P < 200

c = e = 0.5; b = 1.4Gb/s, f = 2.8 GF/s – BG/L params E = 1/[1 + 6d(fc/F)[P/N] + 6(c/e)*(fB/Fb)[P/N]^(1/3)] Few Nodes Many Variables Nodes = Variables F=1 F=10 F=100

E = 1/[1 + 6d(fc/F)[P/N] + 6(c/e)*(fB/Fb)[P/N]^(1/3)] = 1/[E1 + EL + EB] E1 EB EL  Latency Dominated   Compute Dominated   Bandwidth Affected 

Strong Scaling N fixed, P  ∞ For any N, eventually P will win! Dominated by Latency and Bandwidth in Various Regimes Large Values of N scale better Chart next slide

N=1000000 N=100000 Lesson: In the Best of Cases, Strong Scaling is Very Hard to Achieve

Introduction to MPI, OpenMP, Threads Gyan Bhanot IAS Course 10/12/04 and 10/13/04.

Similar presentations

Presentation on theme: "Introduction to MPI, OpenMP, Threads Gyan Bhanot IAS Course 10/12/04 and 10/13/04."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to MPI, OpenMP, Threads Gyan Bhanot IAS Course 10/12/04 and 10/13/04.

Similar presentations

Presentation on theme: "Introduction to MPI, OpenMP, Threads Gyan Bhanot IAS Course 10/12/04 and 10/13/04."— Presentation transcript:

Similar presentations

About project

Feedback