The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin.

The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin

Single processor limitations ● Processor clock speed is limited ● Physical size of processor limits speed because signal speed cannot exceed speed of light ● Single processor speed is limited by integrated circuits feature size (propagation delays and thermal problems) ● Memory (size and speed - especially latency) ● The amount of logic on a processor chip is limited by real estate considerations (die size / transistor size) ● Algorithm limitations The reason behind parallel programming

Parallel computing: a solution ● Increase parallelism within processor (multi operand functional units like vector units) ● Increase parallelism on chip (multiple processors on chip) ● Multi processor computers ● Multi computer systems using a communication network (latency and bandwidth considerations)

Parallel computing paradigms ● memory taxonomy: ● SMP Shared Memory Parallelism ● One processor can “ see ” another's memory ● Cray X-MP, single node NEC SX-3/4/5/6 ● DMP Distributed Memory Parallelism ● Processors exchange “ messages ” ● Cray T3D, IBM SP, ES-40, ASCI machines ● hardware taxonomy : SISD (Single Instruction Single Data) SIMD (Single Instruction Multiple Data) MISD (Multiple Instruction Single Data) MIMD (Multiple Instruction Multiple Data) ● programmer taxonomy ● programmer taxonomy : SPMD : Single Program Multiple Data MPMD : Multiple Program Multiple Data

SMP architectures Network / crossbarBus topology Cpu Mem NODE Cpu Mem NODE

SMP OpenMP (microtasking / autotasking) OpenMP works at the loop level (small granularity often at the loop level), multiple CPUs execute the same code in a shared memory space OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization. OpenMP uses the fork-join model of parallel execution:

OpenMP Basic features (FORTRAN “comments”) PROGRAM VEC_ADD_SECTIONS INTEGER ni, I, n PARAMETER (ni=1000) REAL A(ni), B(ni), C(ni) ! Some initializations n=4 DO I = 1, ni A(I) = I * 1.0 B(I) = A(I) ENDDO ! At the Fortran level: call omp_set_num_threads(n) !$OMP PARALLEL SHARED(A,B,C), PRIVATE(I) !$omp do DO I = 1, ni C(I) = A(I) + B(I) ENDDO !$omp enddo !$OMP END PARALLEL END 2 ways to initiate threads: At the shell level: n=4 export OMP_NUM_THREADS=n Parallel region

OpenMP !$omp parallel !$omp do do n=1, omp_get_max_threads () call itf_phy_slb ( n, F_stepno,obusval, cobusval, $ pvptr, cvptrp,cvptrm, ndim, chmt_ntr, $ trp,trm, tdu,tdv,tdt,kmm,ktm, $ LDIST_DIM, l_nk) enddo !$omp enddo !$omp end parallel !$omp critical jdo = jdo + 1 !$omp end critical !$omp single call vexp (expf_8,xmass_8,nij) !$omp end single

SMP: General remarks ● Shared memory parallelism at the loop level can often be implemented after the fact if what is desired is a moderate level of parallelism ● It can be also done to a lesser extent at the thread level in some cases but reentrancy, data scope (thread local vs global) and race conditions can be a problem. ● Does NOT scale all that well ● Limited to the real estate of a node

DMP architecture Node High speed interconnect (network / crossbar)... Cpu Mem NODE

2D domain decomposition: regular horizontal block partitioning Gni Lnj 1 1 1 Lni 1 1 1 Gnj Lni Lni+1 Lnj Lnj+1 global indexing local indexing Pe (0,0) Pe (0,1)Pe (1,1) Pe (1,0) N S W E PE topology: npex=2, npey=2 PE #0PE #1 PE #2PE #3Rank PE matrix

High level operations ● Halo exchange ● What is a halo ? ● Why and when is it necessary to exchange a halo ? ● Data transpose ● What is a data transpose ? ● Why and when is it necessary to transpose data ? ● Collective and Reduction operations

2D array layout with halos MiniMaxi 1 Lni 1 Lnj Minj Maxj Inner halo Outer halo Private data N S W E

● Need to access neighboring data in order to perform local computation ● In general any stencil type discrete operator dfdx(i) = (f(i+1) - f(i-1)) / (x(i+1)-x(i-1)) ● Halo width depends on the operator Halo exchange: Why and when? 1 Lni

Halo exchange 051 How many neighbor PEs must local PE exchange data with to get data from the shaded area (outer halo)? Local pe South North East West North West North East South West South East PE topology: npex=3, npey=3

Data Transposition 051 PE topology: npex=4, npey=4 X Y Z npex npey X Y Z T2 npex npey X Y Z T1 npex npey

What is MPI ? ● A Message Passing Interface ● Communications through messages can be ● Cooperative send / receive (democratic) ● One sided get / put (autocratic) ● Bindings defined for FORTRAN, C, C++ ● For parallel computers, clusters, heterogeneous networks ● Full featured (but can be used in simple fashion) Time Message length T w = cost / word ● Include 'mpif.h' ● Call MPI_INIT(ierr) ● Call MPI_FINALIZE(ierr) ● Call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr) ● Call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr) ● Call MPI_SEND(buffer,count,datatype,destination,tag,comm,ierr) ● Call MPI_RECV(buffer,count,datatype,source,tag,comm,status,ierr) MPI_gather, MPI_allgather MPI_scatter, MPI_alltoall MPI_bcast, MPI_reduce, MPI_allreduce mpi_sum mpi_min, mpi_max

The RPN_COMM toolkit Michel Valin ● NO INCLUDE FILE NEEDED (like mpif.h) ● Higher level of abstraction ● Initialization / termination of communications ● Topology determination ● Point to point operations ● Halo exchange ● (Direct message to NSWE neighbor) ● Collective operations ● Transpose ● Gather / distribute ● Data reduction ● Equivalent calls to most frequently used MPI routines ● MPI_[something] => RPN_COMM_[something]

Partitioning Global Data Gni=62 Gnj=25 PE topology: npex=4, npey=3 lni=16 lnj=8 lni=16 lnj=9 lni=15 lnj=9 lni=16 lnj=8 lni=15 lnj=9 lni=16 lnj=9 lni=15 lnj=8 lni=16 lnj=9 lni=16 lnj=9 lni=14 lnj=9 lni=16 lnj=7 lni=16 lnj=9 lni=16 lnj=9 lni=14 lnj=7 Valin (Gni + npex – 1) / npex Thomas Dimensions of largest subdomain NOT affected checktopo -gni 62 -gnj 25 -gnk 58 -npx 4 -npy 2 -pil 7 -hblen 10

DMP Scalability Scaling up with an optimum subdomain dimension Size: 500 x 50 on vector processor systems Size: 100 x 50 on cache systems Scaling up on a fixed size problem sze Time to solution should remain the same Time to solution should decrease linearly with the # of CPUs

MC2 Performance on NEC SX4 and Fujitsu VPP700 Flop Rate / PE (MFlops/sec.) Number of PEs SX4: npx=2 VPP700: npx=1 Grid: 513 x 433 x 41

IFS Performance on NEC SX4 and Fujitsu VPP700 Amdahl's law for parallel programming The speedup factor is influenced very much by the residual serial (non parallelizable) work. As the number of processors grows, so does the damage caused by non parallelizable work.

Scalability: limiting factors Any algorithms requiring global communications –One should THINK LOCAL SL transport on a global configuration lat-lon grid point model – numerical poles (GEM) 2-time-level fully implicit discretization leading to an elliptic problem: direct solver requires data transpose Any algorithms producing inherent load imbalance

DMP - General remarks ● More difficult but more powerful programming paradigm ● Easily combined with SMP (on all MPI processes) ● Distributed memory parallelism does not happen, it must be DESIGNED. ● One does not parallelizes a code, the code must be rebuilt (and often redesigned) taking into account the constraints imposed upon the dataflow by message passing. Array dimensioning and loop indexing are likely to be VERY HEAVVILY IMPACTED. ● One may get lucky and HPF or an automatic parallelizing compiler will solve the problem (if one believes in miracles, Santa Claus, the tooth fairy or all of them).

Web sites and Books ● http://pollux.cmc.ec.gc.ca/~armnmfv/MPI_workshop ● http://www.llnl.gov/, OpenMP, threads, MPI,... ● http://hpcf.nersc.gov/ ● http://www.idris.fr/, en français, OpenMP, MPI, F90 ● Using MPI, Gropp et al, ISBN 0-262-57204-8 ● MPI, The Compl. Ref., Snir et al, ISBN 0-262-69184-1 ● MPI, The Compl. Ref. vol 2, Gropp et al, ISBN 0-262-57123-4

Basic MPI program program hello implicit none include 'mpif.h' integer noprocs, nid, error call MPI_Init(error) call MPI_Comm_size(MPI_COMM_WORLD, noprocs, error) call MPI_Comm_rank(MPI_COMM_WORLD, nid, error) write(6,*)'Hello from processor', nid, ' of',noprocs call MPI_Finalize(error) stop end mpirun -np 3 basic_Linux Hello from processor 0 of 3 Hello from processor 1 of 3 FORTRAN STOP Hello from processor 2 of 3 FORTRAN STOP

Communication costs examples ● Machinelatency (  s)data (  s / word) ● IBM SP240.11 ● Intel Paragon121.07 ● CM-582.44 ● Ncube-21542.4 ● Ethernet WS15005 ● 100 Base T WS1500.5 ● NEC SX610.004 ● IBM p .04

Technical: I/O - an Important Issue 0123456789 10 1112... 0123 456 7 891011 012 345 678 91011 0000 000 0000 Preprocessor MC2NTR gone All computation performed within the DM main program Global distribute/collect removed Scaling up with a subdomain size of 500 x 50 on vector processor system Block partitioning the PEs

The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin.

Similar presentations

Presentation on theme: "The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin.

Similar presentations

Presentation on theme: "The Need for Speed: Parallelization in GEM Michel Desgagné Recherche en Prévision Numérique Environment Canada - MSC/RPN Many thanks to Michel Valin."— Presentation transcript:

Similar presentations

About project

Feedback