Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

Similar presentations


Presentation on theme: "Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation."— Presentation transcript:

1 Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation 09/06/08

2 Bigger Systems Higher Complexity Ranger is BIG! Ranger’s Architecture is Multi-Level Parallel and Asymmetric 3936 nodes 62976 cores 2 large Switches

3 Understand the Implications of the Multi-Level Parallel Architecture Optimize Operational Methods and Applications Maximize the Yield from Ranger and other Big TeraGrid Machines yet to come! Get the Most out of the New Generation of Supercomputers!

4 Outline Introduction General description of the Experiment Layout of the Node Layout of the Interconnect (NEM and Magnum switches) Experiments : Ping-Pong and Barrier cost On Node Experiments On NEM Experiment Switch Experiment Conclusion Implications for System Management Implications for Users NEM: Network Express Module

5 Parameter Selection for Experiments Ranger Nodes have 4 quad-core Sockets : 16 cores per Node Natural Setups Pure MPI : 16 tasks per Node Hybrid : 4 tasks per Node 1 task per Node Tests are selected accordingly: 1, 4 and 16 tasks 16 MPI Tasks 4 MPI Tasks 4Threads/Task 1 MPI Tasks 16 Threads/Task MPI Task on Core Master Thread of MPI Task Slave Thread of MPI Task Master Thread of MPI Task In Large-scale calculations with 16 tasks per Node, communication could/should be bundled Measure with one Task per Node

6 Experiment 1 : Ping-Pong with MPI MPI processes reside on : –same Node –same Chassis (connected by one NEM) –different Chassis (connected by Magnum switch) Messages are sent forth and back (Ping-Pong) –Communication Distance is varied (Node, NEM, Magnum) –Communication Volume is varied Message Size : 32 Bytes --- 16 MB Number of processes sending/receiving simultaneously Effective Bandwidth per Communication Channel –Timing taken from multiple runs on a dedicated system Node : 16 Cores Chassis : 12 Nodes Total : 328 Chassis, 3936 Nodes

7 Experiment 2 : MPI Barrier Cost MPI processes reside on : –same Node –same Chassis (connected by a NEM) –different Chassis (connected by Magnum switch) Synchronize on Barriers –Communication Distance is varied (Node, NEM, Magnum) –Communication Volume is varied Number of processes executing Barrier Barrier Cost measured in Clock Periods (CP) –Timing taken from multiple runs on a dedicated system Node : 16 Cores Chassis : 12 Nodes Total : 328 Chassis, 3936 Nodes

8 Node Architecture 4 quad-core CPUs (Sockets) per node Memory local to Sockets 3-way HyperTransport ‘’Missing’’ connection 0---3 CPU 0 3 2 1 PCI Express Bridge Asymmetry - Local vs. Remote Memory 0---3 requires one additional “hop” - PCI Connection Note: Accessing local memory on both Sockets 0 and 3 is slower with extra HT hop (Cache Coherence)

9 Network Architecture Each Chassis (12 Blades) is connected to a Network Express Module (NEM) Each NEM is connected to a Line Card in the Magnum Switch The Switch connects the Line Cards through a Backplane HCANEMLine CardNEMHCA 7 hops 5 hops 3 hops 1 hop Number of Hops / Latency 1 Hop 1.57  sec : Blades in the same Chassis 3 Hops 2.04  sec : NEMs connected to the same Line Card 5/7 Hops 2.45/2.85  sec : Connection through the Magnum switch

10 On-Node : Ping-Pong Socket 0 ping-pongs with Sockets 1, 2 and 3 1, 2, 4 simultaneous communications (quad-core) Bandwidth scales with number of communications Missing Connection : Communication between 0 and 3 is slower Maximum Bandwidth : 1100 MB/s 700 MB/s 300 MB/s

11 On-Node : Barrier Cost (2 Cores) One Barrier : 0---1, 0---2, 0---3 Cost : 1600 - 1900 CPs Asymmetry: Communication between 0 and 3 is slower

12 On-Node : Barrier Cost (Multiple Cores, 2 Sockets) Barriers per Socket : 1, 2, 4 Cost :1700, 3200, 6800 CPs Barriers per Socket : 1, 2, 4 Cost :1700, 3200, 6800 CPs

13 On-NEM: Ping-Pong 2-12 Nodes in the same Chassis 1 MPI Process per Node (1-6 communication pairs) Perfect Scaling for up to 6 simultaneous communications Maximum Bandwidth : 6 x 900 MB/s

14 On-NEM: Barrier Scaling Barriers per Node : 1, 4, 16 Cost : start at 5000/15000 CPs and increase up to 20000/27000/32000 CPs

15 NEM-to-NEM: Ping-Pong Maximum Distance : 7 hops 1 MPI Process per Node (1-12 communication pairs) Maximum Performance : 2 x 900 up to 12 x 450 MB/s

16 Switch : Barrier Scaling Communication between 1-12 Nodes on 2 Chassis Barriers per Node : 1, 4, 16 Two Runs: System was not clean during this test Results similar to On-NEM test

17 Communication pattern reveals Asymmetry on the Node level –No Direct HT Connection between Cores 0 and 3 Max. Bandwidth : On-NEM: 6 x 900 MB/s NEM-to-NEM : 2 x 900 MB/s --- 12 x 450 MB/s 16-way Nodes: NUMA *, Multi-Level Interconnect: low-latency, high-bandwidth Further Investigation necessary to achieve theoretical 12 x 900 MB/s Ranger

18 Conclusions Aggregate Communication and I/O on Node (SMP) level –Reduce total number of Communications –Reduce Traffic through Magnum switches –On 16-way Node : 15 compute tasks and a single Communication task? –Use of MPI with OpenMP? Apply Load-Balancing –Asymmetry on Node Level –Multi-Level Interconnect (Node, NEM, Magnum switches) Use full Chassis (12 Nodes, 192 Cores) –Use extremely low-latency Connections through NEM (< 1.6 μsecs) Take Advantage of the Architecture at all Levels Applications should be cognizant of various SMP/Network levels More topology aware scheduling is under investigation More topology aware scheduling is under investigation


Download ppt "Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation."

Similar presentations


Ads by Google