ECMWF Slide 1 Introduction to Parallel Computing George Mozdzynski March 2004
ECMWF Slide 2 Outline What is parallel computing? Why do we need it? Types of computer Parallel Computing today Parallel Programming Languages OpenMP and Message Passing Terminology
ECMWF Slide 3 What is Parallel Computing? The simultaneous use of more than one processor or computer to solve a problem
ECMWF Slide 4 Why do we need Parallel Computing? Serial computing is too slow Need for large amounts of memory not accessible by a single processor
ECMWF Slide 5 An IFS operational T L 511L60 forecast model takes about one hour wall time for a 10 day forecast using 288 CPUs of our IBM Cluster GHz system (total 1920 CPUs). How long would this model take using a fast PC with sufficient memory? e.g. 3.2 GHz Pentium 4
ECMWF Slide 6 Ans. About 8 days This PC would need about 25 Gbytes of memory. 8 days is too long for a 10 day forecast! 2-3 hours is too long …
ECMWF Slide 7 CPUs Wall time Amdahls Law: Wall Time = S + P/N CPUS IFS Forecast Model (T L 511L60) Serial =574 secs Parallel= secs (Calculated using Excels LINEST function) (Named after Gene Amdahl) If F is the fraction of a calculation that is sequential, and (1-F) is the fraction that can be parallelised, then the maximum speedup that can be achieved by using N processors is 1/(F+(1-F)/N).
ECMWF Slide 8 IFS Forecast Model (T L 511L60) CPUs Wall time SpeedUp Efficiency
ECMWF Slide 9
ECMWF Slide 10 IFS model would be inefficient on large numbers of CPUs. But OK up to 512.
ECMWF Slide 11 Types of Parallel Computer P=Processor M=Memory S=Switch Shared MemoryDistributed Memory P M P … P M P M S …
ECMWF Slide 12 IBM Cluster 1600 ( at ECMWF) P=Processor M=Memory S=Switch … S P M P … P M P … Node
ECMWF Slide 13 IBM Cluster 1600s at ECMWF (hpca + hpcb)
ECMWF Slide 14 ECMWF supercomputers 1979 CRAY 1AVector CRAY XMP-2 CRAY XMP-4 CRAY YMP-8 CRAY C90-16 Fujitsu VPP700 Fujitsu VPP IBM p690Scalar + MPI +Shared Memory Parallel } } Vector + MPI Parallel Vector + Shared Memory Parallel
ECMWF Slide 15 ECMWFs first Supercomputer CRAY-1A 1979
ECMWF Slide 16 Where have 25 years gone?
ECMWF Slide 17 Types of Processor DO J=1,1000 A(J)=B(J) + C ENDDO LOAD B(J) FADD C STORE A(J) INCR J TEST SCALAR PROCESSOR VECTOR PROCESSOR LOADV B->V1 FADDV B,C->V2 STOREV V2->A Single instruction processes one element Single instruction processes many elements
ECMWF Slide 18 Parallel Computing Today Vector Systems NEC SX6 CRAY X-1 Fujitsu VPP5000 Scalar Systems IBM Cluster 1600 FujitsuPRIMEPOWER HPC2500 HP Integrity rx2600 Itanium2 Cluster Systems (typically installed by an Integrator) Virgina Tech, Apple G5 / Infiniband NCSA, Dell PowerEdge 1750, P4 Xeon / Myrinet LLNL, MCR Linux Cluster Xeon / Quadrics LANL, Linux Networx AMD Opteron / Myrinet
ECMWF Slide 19 The TOP500 project started in 1993 Top 500 sites reported Report produced twice a year -EUROPE in JUNE -USA in NOV Performance based on LINPACK benchmark
ECMWF Slide 20 Top 500 Supercomputers
ECMWF Slide 21 Where is ECMWF in Top 500 R max R peak R max – Gflop/sec using Linpack Benchmark R peak – Peak Hardware Gflop/sec (that will never be reached!)
ECMWF Slide 22 What performance do Meteorological Applications achieve? Vector computers -About 30 to 50 percent of peak performance -Relatively more expensive -Also have front-end scalar nodes Scalar computers -About 5 to 10 percent of peak performance -Relatively less expensive Both Vector and Scalar computers are being used in Met Centres around the world Is it harder to parallelize than vectorize? -Vectorization is mainly a compiler responsibility -Parallelization is mainly the users responsibility
ECMWF Slide 23 Overview of Recent Supercomputers Aad J. van der Steen and Jack J. Dongarra
ECMWF Slide 24
ECMWF Slide 25
ECMWF Slide 26 Parallel Programming Languages? High Performance Fortran (HPF) directive based extension to Fortran works on both shared and distributed memory systems not widely used (more popular in Japan?) not suited to applications using irregular grids OpenMP directive based support for Fortran 90/95 and C/C++ shared memory programming only
ECMWF Slide 27 Most Parallel Programmers use… Fortran 90/95, C/C++ with MPI for communicating between tasks (processes) -works for applications running on shared and distributed memory systems Fortran 90/95, C/C++ with OpenMP -For applications that need performance that is satisfied by a single node (shared memory) Hybrid combination of MPI/OpenMP -ECMWFs IFS uses this approach
ECMWF Slide 28 the myth of automatic parallelization (2 common versions) Compilers can do anything (but we may have to wait a while) -Automatic parallelization makes it possible (or will soon make it possible) to port any application to a parallel machine and see wonderful speedups without any modifications to the source Compilers cant do anything (now or never) -Automatic parallelization is useless. Itll never work on real code. If you want to port an application to a parallel machine, you have to restructure it extensively. This is a fundamental limitation and will never be overcome
ECMWF Slide 29 Terminology Cache, Cache line NUMA false sharing Data decomposition Halo, halo exchange FLOP Load imbalance Synchronization
ECMWF Slide 30 THANKYOU
ECMWF Slide 31 Cache P M C P=Processor C=Cache M=Memory M P C1C1 C1C1 C2C2 P
ECMWF Slide 32 IBM node = 8 CPUs + 3 levels of $ P C1C1 C1C1 C2C2 PP C1C1 C1C1 C2C2 PP C1C1 C1C1 C2C2 PP C1C1 C1C1 C2C2 P C3C3 Memory
ECMWF Slide 33 Cache is … Small and fast memory Cache line typically 128 bytes Cache line has state (copy,exclusive owner) Coherency protocol Mapping, sets, ways Replacement strategy Write thru or not Important for performance -Single stride access of always the best!!! -Try to avoid writes to same cache line from different Cpus But dont lose sleep over this
ECMWF Slide 34 IFS blocking in grid space ( IBM p690 / T L 159L60 ) Optimal use of cache / subroutine call overhead