Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL

Parallel Scaling of MPI Codes A practical talk on using MPI with focus on: Distribution of work within a parallel program Placement of computation within a parallel computer Performance costs of various types of communication Understanding scaling performance terminology

Topics Introduction Load Balance Synchronization Simple stuff File I/O Performance profiling

Let’s introduce these topics through a familiar example: Sharks and Fish II Sharks and Fish II : N 2 force summation in parallel E.g. 4 CPUs evaluate force for a global collection of 125 fish Domain decomposition: Each CPU is “in charge” of ~31 fish, but keeps a fairly recent copy of all the fishes positions (replicated data) Is it not possible to uniformly decompose problems in general, especially in many dimensions Luckily this problem has fine granularity and is 2D, let’s see how it scales 31 32

Sharks and Fish II : Program Data: n_fish is global my_fish is local fish i = {x, y, …} Dynamics: MPI_Allgatherv(myfish_buf, len[rank],.. for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j a i += g * mass j * ( fish i – fish j ) / r ij } Move fish

Running on a machine ~seaborg.nersc.gov 100 fish can move 1000 steps in 1 task  5.459s 32 tasks  2.756s 1000 fish can move 1000 steps in 1 task  511.14s 32 tasks  20.815s What’s the “best” way to run? –How many fish do we really have? –How large a computer do we have? –How much “computer time” i.e. allocation do we have? –How quickly, in real wall time, do we need the answer? Sharks and Fish II: How fast? x 24.6 speedup x 1.98 speedup

Scaling: Good 1 st Step: Do runtimes make sense? 1 Task 32 Tasks … Running fish_sim for 100-1000 fish on 1-32 CPUs we see

Scaling: Walltimes Walltime is (all)important but let’s define some other scaling metrics

Scaling: definitions Scaling studies involve changing the degree of parallelism. Will we be change the problem also? Strong scaling –Fixed problem size Weak scaling – Problem size grows with additional resources Speed up = T s /T p (n) Efficiency = T s /(n*T p (n)) Be aware there are multiple definitions for these terms

Scaling: Speedups

Scaling: Efficiencies Remarkably smooth! Often algorithm and architecture make efficiency landscape quite complex

Scaling: Analysis Why does efficiency drop? –Serial code sections  Amdahl’s law –Surface to Volume  Communication bound –Algorithm complexity or switching –Communication protocol switching  Whoa!

Scaling: Analysis In general, changing problem size and concurrency expose or remove compute resources. Bottlenecks shift. In general, first bottleneck wins. Scaling brings additional resources too. –More CPUs (of course) –More cache(s) –More memory BW in some cases

Scaling: Superlinear Speedup # CPUs (OMP)

Scaling: Communication Bound 64 tasks, 52% comm 192 tasks, 66% comm 768 tasks, 79% comm MPI_Allreduce buffer size is 32 bytes. Q: What resource is being depleted here? A: Small message latency 1)Compute per task is decreasing 2)Synchronization rate is increasing 3)Surface : Volume ratio is increasing

Topics Introduction Load Balance Synchronization Simple stuff File I/O

Load Balance : cartoon Universal App Unbalanced: Balanced: Time saved by load balance

Load Balance : performance data MPI ranks sorted by total communication time Communication Time: 64 tasks show 200s, 960 tasks show 230s

Load Balance: ~code while(1) { do_flops(N i ); MPI_Alltoall(); MPI_Allreduce(); } 960 x 64 x

Load Balance: real code Sync Flops Exchange Time  MPI Rank 

Load Balance : analysis The 64 slow tasks (with more compute work) cause 30 seconds more “communication” in 960 tasks This leads to 28800 CPU*seconds (8 CPU*hours) of unproductive computing All imbalance requires is one slow task and a synchronizing collective! Pair well problem size and concurrency. Parallel computers allow you to waste time faster!

Load Balance : FFT Q: When is imbalance good? A: When is leads to a faster Algorithm.

Dynamical Load Balance: Motivation Time  MPI Rank  Sync Flops Exchange

Load Balance: Summary Imbalance most often a byproduct of data decomposition Must be addressed before further MPI tuning can happen Good software exists for graph partitioning / remeshing Dynamical load balance may be required for adaptive codes For regular grids consider padding or contracting

Scaling of MPI_Barrier() four orders of magnitude

Synchronization: definition MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; For a code running on N tasks what is the distribution of the T2’s? The average and width of this distribution tell us how synchronizing e.g. MPI_Allreduce is Completions semantics of MPI functions 1)Local : leave based on local logic (MPI_Comm_rank) 2)Partially synchronizing : leave after messaging M<N tasks (MPI_Bcast, MPI_Reduce) 3)Fully synchronizing : leave after every else enters (MPI_Barrier, MPI_Allreduce) How synchronizing is MPI_Allreduce?

seaborg.nersc.gov It’s very hard to discuss synchronization outside of the context a particular parallel computer So we will examine parallel application scaling on an IBM SP which is largely applicable to other clusters

Colony Switch PGFS seaborg.nersc.gov basics ResourceSpeedBytes Registers 3 ns2560 B L1 Cache 5 ns 32 KB L2 Cache 45 ns 8 MB Main Memory300 ns 16 GB Remote Memory 19 us 7 TB GPFS 10 ms 50 TB HPSS 5 s 9 PB 380 x HPS S CSS0 CSS1 6080 dedicated CPUs, 96 shared login CPUs Hierarchy of caching, speeds not balanced Bottleneck determined by first depleted resource 16 way SMP NHII Node Main Memory GPFS IBM SP

Colony Switch PGFS MPI on the IBM SP HPS S CSS0 CSS1 16 way SMP NHII Node Main Memory GPFS 2-4096 way concurrency MPI-1 and ~MPI-2 GPFS aware MPI-IO Thread safety Ranks on same node bypass the switch

Seaborg : point to point messaging 16 way SMP NHII Node Main Memory GPFS 16 way SMP NHII Node Main Memory GPFS Intranode Internode Switch bandwidth is often stated in optimistic terms interconnect

MPI: seaborg.nersc.gov Intra and Inter Node Communication MP_EUIDEVICE (fabric) Bandwidth (MB/sec) Latency (usec) css0500 / 3509 / 21 css1XX csss500 / 3509 / 21 Lower latency  can satisfy more syncs/sec What is the benefit of two adapters? Can a single 16 way SMP NHII Node Main Memory GPFS css0 css1 csss

Inter-Node Bandwidth  csss css0  Tune message size to optimize throughput Aggregate messages when possible

MPI Performance is often Hierarchical message size and task placement are key Intra Inter

MPI: Latency not always 1 or 2 numbers The set of all possibly latencies describes the interconnect from the application perspective

Synchronization: measurement MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; How synchronizing is MPI_Allreduce? For a code running on N tasks what is the distribution of the T2’s? Let’s measure this…

Synchronization: MPI Collectives 2048 tasks Beyond load balance there is a distribution on MPI timings intrinsic to the MPI Call

Synchronization: Architecture t is the frequency kernel process scheduling Unix : cron et al. …and from the machine itself

Intrinsic Synchronization : Alltoall

Architecture makes a big difference!

This leads to variability in Execution Time

Synchronization : Summary As a programmer you can control –Which MPI calls you use (it’s not required to use them all). –Message sizes, Problem size (maybe) –The temporal granularity of synchronization Language Writers and System Architects control –How hard is it to do last two above –The intrinsic amount of noise in the machine

Simple Stuff Parallel programs are easier to mess up than serial ones. Here are some common pitfalls.

What’s wrong here?

Is MPI_Barrier time bad? Probably. Is it avoidable? ~three cases: 1)The stray / unknown / debug barrier 2)The barrier which is masking compute balance 3)Barriers used for I/O ordering Often very easy to fix MPI_Barrier

Parallel File I/O : Strategies MPI Disk Some strategies fall down at scale

Parallel File I/O: Metadata A parallel file system is great, but it is also another place to create contention. Avoid uneeded disk I/O, know your file system Often avoid file per task I/O strategies when running at scale

Performance Profiling Most of the tables and graphs in this talk were generated using IPM (http://ipm-hpc.sf.net)http://ipm-hpc.sf.net On seaborg do “module load ipm” and run as you normally would and you get brief summary to stdout. More detailed performance profiles are generated from an XML record written by IPM. In at most ~24 hours after your job completes you should be able to find a performance summary of your job online https://www.nersc.gov/nusers/status/llsum/

How to use IPM : basics 1) Do “module load ipm”, then run normally 2) Upon completion you get Maybe that’s enough. If so you’re done. Have a nice day. ##IPMv0.85################################################################ # # command :../exe/pmemd -O -c inpcrd -o res (completed) # host : s05405 mpi_tasks : 64 on 4 nodes # start : 02/22/05/10:03:55 wallclock : 24.278400 sec # stop : 02/22/05/10:04:17 %comm : 32.43 # gbytes : 2.57604e+00 total gflop/sec : 2.04615e+00 total # ###########################################################################

Want more detail? IPM_REPORT=full ##IPMv0.85##################################################################### # # command :../exe/pmemd -O -c inpcrd -o res (completed) # host : s05405 mpi_tasks : 64 on 4 nodes # start : 02/22/05/10:03:55 wallclock : 24.278400 sec # stop : 02/22/05/10:04:17 %comm : 32.43 # gbytes : 2.57604e+00 total gflop/sec : 2.04615e+00 total # # [total] min max # wallclock 1373.67 21.4636 21.1087 24.2784 # user 936.95 14.6398 12.68 20.3 # system 227.7 3.55781 1.51 5 # mpi 503.853 7.8727 4.2293 9.13725 # %comm 32.4268 17.42 41.407 # gflop/sec 2.04614 0.0319709 0.02724 0.04041 # gbytes 2.57604 0.0402507 0.0399284 0.0408173 # gbytes_tx 0.665125 0.0103926 1.09673e-05 0.0368981 # gbyte_rx 0.659763 0.0103088 9.83477e-07 0.0417372 #

Want more detail? IPM_REPORT=full # PM_CYC 3.00519e+11 4.69561e+09 4.50223e+09 5.83342e+09 # PM_FPU0_CMPL 2.45263e+10 3.83223e+08 3.3396e+08 5.12702e+08 # PM_FPU1_CMPL 1.48426e+10 2.31916e+08 1.90704e+08 2.8053e+08 # PM_FPU_FMA 1.03083e+10 1.61067e+08 1.36815e+08 1.96841e+08 # PM_INST_CMPL 3.33597e+11 5.21245e+09 4.33725e+09 6.44214e+09 # PM_LD_CMPL 1.03239e+11 1.61311e+09 1.29033e+09 1.84128e+09 # PM_ST_CMPL 7.19365e+10 1.12401e+09 8.77684e+08 1.29017e+09 # PM_TLB_MISS 1.67892e+08 2.62332e+06 1.16104e+06 2.36664e+07 # # [time] [calls] # MPI_Bcast 352.365 2816 69.93 22.68 # MPI_Waitany 81.0002 185729 16.08 5.21 # MPI_Allreduce 38.6718 5184 7.68 2.49 # MPI_Allgatherv 14.7468 448 2.93 0.95 # MPI_Isend 12.9071 185729 2.56 0.83 # MPI_Gatherv 2.06443 128 0.41 0.13 # MPI_Irecv 1.349 185729 0.27 0.09 # MPI_Waitall 0.606749 8064 0.12 0.04 # MPI_Gather 0.0942596 192 0.02 0.01 ###############################################################################

Detailed profiling based on message size per MPI call per MPI call & buffer size

Summary: Introduction Load Balance Synchronization Simple stuff File I/O Happy Scaling!

Other sources of information: MPI Performance: http://www-unix.mcs.anl.gov/mpi/tutorial/perf/mpiperf/ Seaborg MPI Scaling: http://www.nersc.gov/news/reports/technical/seaborg_scaling/ MPI Synchronization : Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin, "The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q", in Proc. SuperComputing, Phoenix, November 2003. Domain decomposition: http://www.ddm.org/ google://”space filling”&”decomposition” etc. Metis : http://www-users.cs.umn.edu/~karypis/metis

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Similar presentations

Presentation on theme: "Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.

Similar presentations

Presentation on theme: "Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL."— Presentation transcript:

Similar presentations

About project

Feedback