Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.
Structure of Computer Systems
Thoughts on Shared Caches Jeff Odom University of Maryland.
Today’s topics Single processors and the Memory Hierarchy
The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
Background Computer System Architectures Computer System Software.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.
Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus Gabrielle Allen, Thomas Dramlitsch, Ian Foster,
Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
The hybird approach to programming clusters of multi-core architetures.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Computer System Architectures Computer System Software
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
Network Aware Resource Allocation in Distributed Clouds.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,
Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.
Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
6/29/1999PDPTA'991 Performance Prediction for Large Scale Parallel Systems Yuhong Wen and Geoffrey C. Fox Northeast Parallel Architecture Center (NPAC)
Outline Why this subject? What is High Performance Computing?
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Distributed Processors
CS 147 – Parallel Processing
Multi-Processing in High Performance Computer Architecture:
Multi-Processing in High Performance Computer Architecture:
CMSC 611: Advanced Computer Architecture
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
Introduction, background, jargon
Presentation transcript:

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation 09/06/08

Bigger Systems Higher Complexity Ranger is BIG! Ranger’s Architecture is Multi-Level Parallel and Asymmetric 3936 nodes cores 2 large Switches

Understand the Implications of the Multi-Level Parallel Architecture Optimize Operational Methods and Applications Maximize the Yield from Ranger and other Big TeraGrid Machines yet to come! Get the Most out of the New Generation of Supercomputers!

Outline Introduction General description of the Experiment Layout of the Node Layout of the Interconnect (NEM and Magnum switches) Experiments : Ping-Pong and Barrier cost On Node Experiments On NEM Experiment Switch Experiment Conclusion Implications for System Management Implications for Users NEM: Network Express Module

Parameter Selection for Experiments Ranger Nodes have 4 quad-core Sockets : 16 cores per Node Natural Setups Pure MPI : 16 tasks per Node Hybrid : 4 tasks per Node 1 task per Node Tests are selected accordingly: 1, 4 and 16 tasks 16 MPI Tasks 4 MPI Tasks 4Threads/Task 1 MPI Tasks 16 Threads/Task MPI Task on Core Master Thread of MPI Task Slave Thread of MPI Task Master Thread of MPI Task In Large-scale calculations with 16 tasks per Node, communication could/should be bundled Measure with one Task per Node

Experiment 1 : Ping-Pong with MPI MPI processes reside on : –same Node –same Chassis (connected by one NEM) –different Chassis (connected by Magnum switch) Messages are sent forth and back (Ping-Pong) –Communication Distance is varied (Node, NEM, Magnum) –Communication Volume is varied Message Size : 32 Bytes MB Number of processes sending/receiving simultaneously Effective Bandwidth per Communication Channel –Timing taken from multiple runs on a dedicated system Node : 16 Cores Chassis : 12 Nodes Total : 328 Chassis, 3936 Nodes

Experiment 2 : MPI Barrier Cost MPI processes reside on : –same Node –same Chassis (connected by a NEM) –different Chassis (connected by Magnum switch) Synchronize on Barriers –Communication Distance is varied (Node, NEM, Magnum) –Communication Volume is varied Number of processes executing Barrier Barrier Cost measured in Clock Periods (CP) –Timing taken from multiple runs on a dedicated system Node : 16 Cores Chassis : 12 Nodes Total : 328 Chassis, 3936 Nodes

Node Architecture 4 quad-core CPUs (Sockets) per node Memory local to Sockets 3-way HyperTransport ‘’Missing’’ connection CPU PCI Express Bridge Asymmetry - Local vs. Remote Memory requires one additional “hop” - PCI Connection Note: Accessing local memory on both Sockets 0 and 3 is slower with extra HT hop (Cache Coherence)

Network Architecture Each Chassis (12 Blades) is connected to a Network Express Module (NEM) Each NEM is connected to a Line Card in the Magnum Switch The Switch connects the Line Cards through a Backplane HCANEMLine CardNEMHCA 7 hops 5 hops 3 hops 1 hop Number of Hops / Latency 1 Hop 1.57  sec : Blades in the same Chassis 3 Hops 2.04  sec : NEMs connected to the same Line Card 5/7 Hops 2.45/2.85  sec : Connection through the Magnum switch

On-Node : Ping-Pong Socket 0 ping-pongs with Sockets 1, 2 and 3 1, 2, 4 simultaneous communications (quad-core) Bandwidth scales with number of communications Missing Connection : Communication between 0 and 3 is slower Maximum Bandwidth : 1100 MB/s 700 MB/s 300 MB/s

On-Node : Barrier Cost (2 Cores) One Barrier : 0---1, 0---2, Cost : CPs Asymmetry: Communication between 0 and 3 is slower

On-Node : Barrier Cost (Multiple Cores, 2 Sockets) Barriers per Socket : 1, 2, 4 Cost :1700, 3200, 6800 CPs Barriers per Socket : 1, 2, 4 Cost :1700, 3200, 6800 CPs

On-NEM: Ping-Pong 2-12 Nodes in the same Chassis 1 MPI Process per Node (1-6 communication pairs) Perfect Scaling for up to 6 simultaneous communications Maximum Bandwidth : 6 x 900 MB/s

On-NEM: Barrier Scaling Barriers per Node : 1, 4, 16 Cost : start at 5000/15000 CPs and increase up to 20000/27000/32000 CPs

NEM-to-NEM: Ping-Pong Maximum Distance : 7 hops 1 MPI Process per Node (1-12 communication pairs) Maximum Performance : 2 x 900 up to 12 x 450 MB/s

Switch : Barrier Scaling Communication between 1-12 Nodes on 2 Chassis Barriers per Node : 1, 4, 16 Two Runs: System was not clean during this test Results similar to On-NEM test

Communication pattern reveals Asymmetry on the Node level –No Direct HT Connection between Cores 0 and 3 Max. Bandwidth : On-NEM: 6 x 900 MB/s NEM-to-NEM : 2 x 900 MB/s x 450 MB/s 16-way Nodes: NUMA *, Multi-Level Interconnect: low-latency, high-bandwidth Further Investigation necessary to achieve theoretical 12 x 900 MB/s Ranger

Conclusions Aggregate Communication and I/O on Node (SMP) level –Reduce total number of Communications –Reduce Traffic through Magnum switches –On 16-way Node : 15 compute tasks and a single Communication task? –Use of MPI with OpenMP? Apply Load-Balancing –Asymmetry on Node Level –Multi-Level Interconnect (Node, NEM, Magnum switches) Use full Chassis (12 Nodes, 192 Cores) –Use extremely low-latency Connections through NEM (< 1.6 μsecs) Take Advantage of the Architecture at all Levels Applications should be cognizant of various SMP/Network levels More topology aware scheduling is under investigation More topology aware scheduling is under investigation