Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming Heterogeneous (GPU) Systems Jeffrey Vetter Presented to Extreme.

Similar presentations


Presentation on theme: "Programming Heterogeneous (GPU) Systems Jeffrey Vetter Presented to Extreme."— Presentation transcript:

1 Programming Heterogeneous (GPU) Systems Jeffrey Vetter Presented to Extreme Scale Computing Training Program ANL: St. Charles, IL 2 August 2013

2 TH-2 System Compute Nodes have Tflop/s per node – 16,000 nodes – Intel Xeon cpus – Intel Xeon phis Operations Nodes – 4096 FT CPUs as operations nodes Proprietary interconnect TH2 express 1PB memory (host memory only) Global shared parallel storage is 12.4 PB Cabinets: = 162 compute/communication/storage cabinets – ~750 m2 NUDT and Inspur

3 3 SYSTEM SPECIFICATIONS: Peak performance of 27.1 PF 24.5 GPU CPU 18,688 Compute Nodes each with: 16-Core AMD Opteron CPU NVIDIA Tesla K20x GPU GB memory 512 Service and I/O nodes 200 Cabinets 710 TB total system memory Cray Gemini 3D Torus Interconnect 8.9 MW peak power ORNLs Titan Hybrid System: Cray XK7 with AMD Opteron and NVIDIA Tesla processors 4,352 ft 2

4 Keeneland – Full Scale System ProLiant SL250 G8 (2CPUs, 3GPUs) S6500 Chassis (4 Nodes) Rack (6 Chassis) M2090 Xeon E Mellanox 384p FDR InfiniBand Switch Integrated with NICS Datacenter Lustre and XSEDE Full PCIeG3 X16 bandwidth to all GPUs 166 GFLOPS 665 GFLOPS 2327 GFLOPS 32/18 GB 9308 GFLOPS GFLOPS GFLOPS J.S. Vetter, R. Glassbrook et al., Keeneland: Bringing heterogeneous GPU computing to the computational science community, IEEE Computing in Science and Engineering, 13(5):90-5, 2011, Keeneland System (11 Compute Racks)

5 Contemporary HPC Architectures DateSystemLocationCompCommPeak (PF) Power (MW) 2009Jaguar; Cray XT5ORNLAMD 6cSeastar Tianhe-1ANSC TianjinIntel + NVIDIAProprietary NebulaeNSCS Shenzhen Intel + NVIDIAIB Tsubame 2TiTechIntel + NVIDIAIB K ComputerRIKEN/KobeSPARC64 VIIIfxTofu Titan; Cray XK6ORNLAMD + NVIDIAGemini Mira; BlueGeneQANLSoCProprietary Sequoia; BlueGeneQLLNLSoCProprietary Blue Waters; CrayNCSA/UIUCAMD + (partial) NVIDIA Gemini StampedeTACCIntel + MICIB Tianhe-2NSCC-GZ (Guangzhou) Intel + MICProprietary54~20

6 6 Emerging Computing Architectures Heterogeneous processing – Many cores – Fused, configurable memory Memory – 3D Stacking – New devices (PCRAM, ReRAM) Interconnects – Collective offload – Scalable topologies Storage – Active storage – Non-traditional storage architectures (key-value stores) Improving performance and programmability in face of increasing complexity – Power, resilience HPC (all) computer design is more fluid now than in the past two decades.

7 7Managed by UT-Battelle for the U.S. Department of Energy AMD Llanos fused memory hierarchy K. Spafford, J.S. Meredith, S. Lee, D. Li, P.C. Roth, and J.S. Vetter, The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Architectures, in ACM Computing Frontiers (CF). Cagliari, Italy: ACM, Note: Both SB and Llano are consumer parts, not server parts. K. Spafford, J.S. Meredith, S. Lee, D. Li, P.C. Roth, and J.S. Vetter, The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Architectures, in ACM Computing Frontiers (CF). Cagliari, Italy: ACM, Note: Both SB and Llano are consumer parts, not server parts.

8 8 Future Directions in Heterogeneous Computing Over the next decade: Heterogeneous computing will continue to increase in importance Manycore Hardware features – Transactional memory – Random Number Generators – Scatter/Gather – Wider SIMD/AVX Synergies with BIGDATA, mobile markets, graphics Top 10 list of features to include from application perspective. Now is the time!

9 Applications must use a mix of programming models

10 Communication MPI Profiling

11 Georgia Tech / Computational Science and Engineering / Vetter11 Communication – MPI MPI dominates HPC MPI dominates HPC Communication can severely restrict performance and scalability Communication can severely restrict performance and scalability Developer has explicit control of MPI in application Developer has explicit control of MPI in application –Communication computation overlap –Collectives MPI tools provide wealth of information MPI tools provide wealth of information –Statistics – number and size of message sent in certain time –Tracing – event based log per task of all communication events

12 Georgia Tech / Computational Science and Engineering / Vetter12 MPI Provides the MPI Profiling Layer MPI Spec provides the MPI Profiling Layer to allow interposition between application and MPI runtime MPI Spec provides the MPI Profiling Layer to allow interposition between application and MPI runtime PERUSE is a recent attempt to provide more detailed information from the runtime for performance measurement PERUSE is a recent attempt to provide more detailed information from the runtime for performance measurement –http://www.mpi-peruse.org/http://www.mpi-peruse.org/

13 Georgia Tech / Computational Science and Engineering / Vetter13 MPI Performance Tools Provide Varying Levels of Detail MPIP (http://mpip.sourceforge.net/) MPIP (http://mpip.sourceforge.net/)http://mpip.sourceforge.net/ –Statistics on Counts, sizes, min, max for Point-to-point and collective operations MPI IO counts, sizes, min, max –Lightweight Has scaled to 64k processors on BGL No large tracefiles Low perturbation –Callsite specific information Tau, Vampir, Intel Tracing Tool, Paraver Tau, Vampir, Intel Tracing Tool, Paraver –Statistical and tracing information –Varying levels of complexity, perturbation, and tracefile size Paraver (http://www.bsc.es/plantillaA.php?cat_id=486) Paraver (http://www.bsc.es/plantillaA.php?cat_id=486)http://www.bsc.es/plantillaA.php?cat_id=486 –Covered in detail here

14 MPI Profiling

15 Georgia Tech / Computational Science and Engineering / Vetter15 Why do these systems have different performance on POP?

16 Georgia Tech / Computational Science and Engineering / Vetter16 MPI Performance Profiling: mpiP mpiP Basics mpiP Basics –Easy to use tool Statistical-based MPI profiling library Requires relinking but no source level changes Compiling with -g is recommended –Provides average times for each MPI call site –Has been shown to be very useful for scaling analysis

17 Georgia Tech / Computational Science and Engineering / Vetter17 MPIP example POP MPI performance POP MPI Command Version : MPIP Build date : Jul , Start time : Stop time : MPIP env var : Collector Rank : Collector PID : Final Output Dir MPI Task Assignment : 0 MPI Task Assignment : 1 h0107.nersc.gov

18 Georgia Tech / Computational Science and Engineering / Vetter18 More mpiP Output for POP MPI Time (seconds) Task AppTime MPITime MPI% * * Callsites: ID Lev File Line Parent_Funct MPI_Call ID Lev File Line Parent_Funct MPI_Call 1 0 global_reductions.f 0 ?? Wait 1 0 global_reductions.f 0 ?? Wait 2 0 stencils.f 0 ?? Waitall 2 0 stencils.f 0 ?? Waitall 3 0 communicate.f 3122.MPI_Send Cart_shift 3 0 communicate.f 3122.MPI_Send Cart_shift 4 0 boundary.f 3122.MPI_Send Isend 4 0 boundary.f 3122.MPI_Send Isend 5 0 communicate.f 0.MPI_Send Type_commit 5 0 communicate.f 0.MPI_Send Type_commit 6 0 boundary.f 0.MPI_Send Isend 6 0 boundary.f 0.MPI_Send Isend

19 Georgia Tech / Computational Science and Engineering / Vetter19 Still More mpiP Output for POP Aggregate Time (top twenty, descending, milliseconds) Call Site Time App% MPI% Waitall e Waitall e Wait e Waitall Allreduce Bcast Isend Isend Barrier Irecv Irecv Cart_create Cart_coords Type_commit

20 Georgia Tech / Computational Science and Engineering / Vetter20 Remaining mpiP Output for POP Aggregate Time (top twenty, descending, milliseconds) Isend Bcast Barrier Cart_shift Irecv Isend Callsite statistics (all, milliseconds): Name Site Rank Count Max Mean Min App% MPI% Allreduce Allreduce Allreduce 1 *


Download ppt "Programming Heterogeneous (GPU) Systems Jeffrey Vetter Presented to Extreme."

Similar presentations


Ads by Google