# ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004.

## Presentation on theme: "ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004."— Presentation transcript:

ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004

ECMWF Slide 2 Outline What is parallel computing? Why do we need it? Types of computer Parallel Computing today Parallel Programming Languages OpenMP and Message Passing Terminology

ECMWF Slide 3 What is Parallel Computing? The simultaneous use of more than one processor or computer to solve a problem

ECMWF Slide 4 Why do we need Parallel Computing? Serial computing is too slow Need for large amounts of memory not accessible by a single processor

ECMWF Slide 5 An IFS operational T L 511L60 forecast model takes about one hour wall time for a 10 day forecast using 288 CPUs of our IBM Cluster 1600 1.3 GHz system (total 1920 CPUs). How long would this model take using a fast PC with sufficient memory? e.g. 3.2 GHz Pentium 4

ECMWF Slide 6 Ans. About 8 days This PC would need about 25 Gbytes of memory. 8 days is too long for a 10 day forecast! 2-3 hours is too long …

ECMWF Slide 7 CPUs Wall time 64 11355 128 5932 192 4230 256 3375 320 2806 384 2338 448 2054 512 1842 Amdahls Law: Wall Time = S + P/N CPUS IFS Forecast Model (T L 511L60) Serial =574 secs Parallel=690930 secs (Calculated using Excels LINEST function) (Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).

ECMWF Slide 8 IFS Forecast Model (T L 511L60) CPUs Wall time SpeedUp Efficiency 1 691504 1 100.0 64 11355 61 95.2 128 5932 117 91.1 192 4230 163 85.1 256 3375 205 80.0 320 2806 246 77.0 384 2338 296 77.0 448 2054 337 75.1 512 1842 375 73.3

ECMWF Slide 9

ECMWF Slide 10 IFS model would be inefficient on large numbers of CPUs. But OK up to 512.

ECMWF Slide 11 Types of Parallel Computer P=Processor M=Memory S=Switch Shared MemoryDistributed Memory P M P … P M P M S …

ECMWF Slide 12 IBM Cluster 1600 ( at ECMWF) P=Processor M=Memory S=Switch … S P M P … P M P … Node

ECMWF Slide 13 IBM Cluster 1600s at ECMWF (hpca + hpcb)

ECMWF Slide 14 ECMWF supercomputers 1979 CRAY 1AVector CRAY XMP-2 CRAY XMP-4 CRAY YMP-8 CRAY C90-16 Fujitsu VPP700 Fujitsu VPP5000 2002IBM p690Scalar + MPI +Shared Memory Parallel } } Vector + MPI Parallel Vector + Shared Memory Parallel

ECMWF Slide 15 ECMWFs first Supercomputer CRAY-1A 1979

ECMWF Slide 16 Where have 25 years gone?

ECMWF Slide 17 Types of Processor DO J=1,1000 A(J)=B(J) + C ENDDO LOAD B(J) FADD C STORE A(J) INCR J TEST SCALAR PROCESSOR VECTOR PROCESSOR LOADV B->V1 FADDV B,C->V2 STOREV V2->A Single instruction processes one element Single instruction processes many elements

ECMWF Slide 18 Parallel Computing Today Vector Systems NEC SX6 CRAY X-1 Fujitsu VPP5000 Scalar Systems IBM Cluster 1600 FujitsuPRIMEPOWER HPC2500 HP Integrity rx2600 Itanium2 Cluster Systems (typically installed by an Integrator) Virgina Tech, Apple G5 / Infiniband NCSA, Dell PowerEdge 1750, P4 Xeon / Myrinet LLNL, MCR Linux Cluster Xeon / Quadrics LANL, Linux Networx AMD Opteron / Myrinet

ECMWF Slide 19 The TOP500 project started in 1993 Top 500 sites reported Report produced twice a year -EUROPE in JUNE -USA in NOV Performance based on LINPACK benchmark http://www.top500.org/

ECMWF Slide 20 Top 500 Supercomputers

ECMWF Slide 21 Where is ECMWF in Top 500 R max R peak R max – Gflop/sec using Linpack Benchmark R peak – Peak Hardware Gflop/sec (that will never be reached!)

ECMWF Slide 22 What performance do Meteorological Applications achieve? Vector computers -About 30 to 50 percent of peak performance -Relatively more expensive -Also have front-end scalar nodes Scalar computers -About 5 to 10 percent of peak performance -Relatively less expensive Both Vector and Scalar computers are being used in Met Centres around the world Is it harder to parallelize than vectorize? -Vectorization is mainly a compiler responsibility -Parallelization is mainly the users responsibility

ECMWF Slide 23 http://www.top500.org/ORSC/2003/ Overview of Recent Supercomputers Aad J. van der Steen and Jack J. Dongarra

ECMWF Slide 24

ECMWF Slide 25

ECMWF Slide 26 Parallel Programming Languages? High Performance Fortran (HPF) directive based extension to Fortran works on both shared and distributed memory systems not widely used (more popular in Japan?) not suited to applications using irregular grids http://www.crpc.rice.edu/HPFF/home.html OpenMP directive based support for Fortran 90/95 and C/C++ shared memory programming only http://www.openmp.org

ECMWF Slide 27 Most Parallel Programmers use… Fortran 90/95, C/C++ with MPI for communicating between tasks (processes) -works for applications running on shared and distributed memory systems Fortran 90/95, C/C++ with OpenMP -For applications that need performance that is satisfied by a single node (shared memory) Hybrid combination of MPI/OpenMP -ECMWFs IFS uses this approach

ECMWF Slide 28 the myth of automatic parallelization (2 common versions) Compilers can do anything (but we may have to wait a while) -Automatic parallelization makes it possible (or will soon make it possible) to port any application to a parallel machine and see wonderful speedups without any modifications to the source Compilers cant do anything (now or never) -Automatic parallelization is useless. Itll never work on real code. If you want to port an application to a parallel machine, you have to restructure it extensively. This is a fundamental limitation and will never be overcome

ECMWF Slide 29 Terminology Cache, Cache line NUMA false sharing Data decomposition Halo, halo exchange FLOP Load imbalance Synchronization

ECMWF Slide 30 THANKYOU

ECMWF Slide 31 Cache P M C P=Processor C=Cache M=Memory M P C1C1 C1C1 C2C2 P

ECMWF Slide 32 IBM node = 8 CPUs + 3 levels of \$ P C1C1 C1C1 C2C2 PP C1C1 C1C1 C2C2 PP C1C1 C1C1 C2C2 PP C1C1 C1C1 C2C2 P C3C3 Memory

ECMWF Slide 33 Cache is … Small and fast memory Cache line typically 128 bytes Cache line has state (copy,exclusive owner) Coherency protocol Mapping, sets, ways Replacement strategy Write thru or not Important for performance -Single stride access of always the best!!! -Try to avoid writes to same cache line from different Cpus But dont lose sleep over this

ECMWF Slide 34 IFS blocking in grid space ( IBM p690 / T L 159L60 ) Optimal use of cache / subroutine call overhead

Similar presentations