Cooperative Rendezvous Protocols for Improved Performance and Overlap

Slides:

Advertisements

Similar presentations

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

Advertisements

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

IBM 1350 Cluster Expansion Doug Johnson Senior Systems Developer.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.

OpenFOAM on a GPU-based Heterogeneous Cluster

Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

The hybird approach to programming clusters of multi-core architetures.

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien.

© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.

Gilad Shainer, VP of Marketing Dec 2013 Interconnect Your Future.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

Slide 1 User-Centric Workload Analytics: Towards Better Cluster Management Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal,

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Balazs Voneki CERN/EP/LHCb Online group

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

SPIDAL Analytics Performance February 2017

Modern supercomputers, Georgian supercomputer project and usage areas

Early Results of Deep Learning on the Stampede2 Supercomputer

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

LinkSCEEM-2: A computational resource for the development of Computational Sciences in the Eastern Mediterranean Mostafa Zoubi SESAME Outreach SESAME,

Implementation and Optimization of MPI point-to-point communications on SMP-CMP clusters with RDMA capability.

OCR on Knights Landing (Xeon-Phi)

Super Computing By RIsaj t r S3 ece, roll 50.

Grid Computing.

Performance Evaluation of Adaptive MPI

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

IEEE BigData 2016 December 5-8, Washington D.C.

Early Results of Deep Learning on the Stampede2 Supercomputer

CARLA Buenos Aires, Argentina - Sept , 2017

Hybrid Programming with OpenMP and MPI

Recent Communication Optimizations in Charm++

Indiana University, Bloomington

Introduction, background, jargon

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Chapter 01: Introduction

Parallel Programming in C with MPI and OpenMP

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

Cooperative Rendezvous Protocols for Improved Performance and Overlap Sourav Chakraborty, Mohammadreza Bayatpour, Jahanzeb Hashmi, Hari Subramoni, and Dhabaleswar K Panda Network Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

Overview Introduction Motivation Vision and Contribution Detailed Designs Experimental Evaluations Conclusion and Future Work

Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core architectures (Xeon, OpenPOWER) High-performance interconnects (InfiniBand, OmniPath) Core density (per node) is increasing Improvements in manufacturing tech More performance per watt Message Passing Interface (MPI) used by vast majority of HPC applications Blocking/Non-blocking point-to-point operations Collectives based on point-to-point communication Stampede2 @ TACC Sunway TaihuLight Sierra @ LLNL

CPU Scaling Trends over Past Decades Single thread performance increasing slowly Frequency increase has slowed down Number of transistors continue to grow Number of cores rapidly increasing More compute power in small number of nodes Need to improve both intra-node and inter-node communication! https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/

Intra-Node communication in MPI Kernel MPI Sender MPI Receiver Send Buffer Recv Buffer Shared Memory Kernel MPI Sender MPI Receiver Send Buffer Recv Buffer Eager Protocol Requires two copies Better for Small Messages Rendezvous Protocol Requires single copy Better for Large Messages

Existing Rendezvous Protocols in MPI Receiver Sender Receiver Sender Control messages (RTS, CTS, FIN) used to exchange PID, address, length etc. Read/Write done through kernel modules (CMA, KNEM, LiMiC) or kernel-assisted page mapping (XPMEM) Statically selected (same protocol used for all communication) RTS RTS CTS Read Data Write Data FIN FIN Write-Based (RPUT) Read-Based (RGET)

Research Questions Are existing rendezvous protocols using all the available resources? Can rendezvous protocols take advantage of different communication channels for better performance and overlap? How does the communication pattern affect performance of different rendezvous protocols? Are existing protocols ensuring effective overlap of intra-node and inter-node communication? Do we need to rethink the design of MPI rendezvous protocols to deliver the best performance to applications?

Overview Introduction Motivation Vision and Contribution Detailed Designs Experimental Evaluations Conclusion and Future Work

Limitations of Existing Rendezvous Protocols - Resource Utilization Write-Based (RPUT) Read-Based (RGET) Write-based protocols (RPUT) are driven by the sender CPU while the receiver CPU idles (opposite for RGET) Can the sender and the receiver cooperate to improve the performance or overlap of point-to-point communication?

Impact of Communication Pattern on Performance One-to-All: RGET > RPUT All-to-one: RPUT > RGET Different communication patterns require different rendezvous protocols How can MPI processes “discover” the overall communication pattern? How can MPI libraries dynamically select the appropriate rendezvous protocol for these different patterns?

Progress and Overlap of Multiple Concurrent Communications Intra-node communication is driven by the CPU itself Blocking nature of copying data prevents the CPU from processing control messages and issuing RDMA read/writes Limits the concurrency of intra-node communication Reduces overlap of intra-node and inter-node communication Can new rendezvous protocols be designed to allow better progress?

Overview Introduction Motivation Vision and Contribution Detailed Designs Experimental Evaluations Conclusion and Future Work

Broad Vision and Contribution Rethink MPI rendezvous protocols with the goal of improved performance and overlap Design efficient and dynamic “Cooperative” rendezvous protocols Multiple levels of “cooperation”: Cooperation between the sender and the receiver for improved resource utilization Cooperation among processes one the same node to intelligently adapt to the application’s communication pattern Cooperation among processes across nodes for improved overlap of intra-node and inter-node communication

Overview Introduction Motivation Vision and Contribution Detailed Designs Experimental Evaluations Conclusion and Future Work

Cooperation between the Sender and the Receiver Combines both RGET and RPUT protocols Control messages (RTS, CTS) used to exchange buffer addresses and lengths Sender writes half of the data directly to receiver’s memory Receiver reads rest of the data from the sender’s memory Reads and writes happen concurrently Utilizes both the sender and the receiver’s CPU to extract more parallelism and reduce the latency RTS CTS Read Data Write Data FIN FIN Proposed COOP-p2p protocol

Offloading Point-to-point Communication for Overlap (COOP-hca) CPU performs the copy operation and progresses the communication Unavailable for application computation (low overlap) Can be offloaded to DMA engines of HCA to improve CPU availability Use multiple channels (both CPU and HCA) to further reduce latency Intelligent striping required to get the best performance

Cooperation Based on Communication Primitive (COOP-coll) Determine the need of overlap at the sender or the receiver Isend/Irecv => overlap wanted Send/Recv => No overlap wanted Works for collectives built using point-to-point primitives Enhance the control messages (RTS/CTS) to convey protocol information to the peer process Ensure both sender and receiver decides to use the same protocol Decision tree for COOP-coll protocol

Dynamic Load Balancing among Processes (COOP-load) MPI collectives are often imbalanced (one-to-many or many-to-one communication patterns) Point-to-point operations like Stencil has both intra-node and inter-node components Can lead to load-imbalance on the driving CPUs Use number of pending rendezvous copies as heuristic for CPU load Incremented when RTS/CTS is received and before the copy is performed Decremented when FIN is received or copy operation has finished Share “load” information through shared memory regions Dynamically select the protocol based on load

Cooperation among Processes Across Nodes (COOP-chunked) Intra-node copies for large messages are time-consuming and blocks the CPU Can’t process control messages and issue RDMA operations to the HCA Lack of overlap of intra-node and inter-node messages Use a chunking based design to progress the intra-node communication Improves CPU availability for progressing inter-node communication

Hybrid Cooperative Protocol Combines aspects of different Rendezvous protocols discussed Applies the protocol best suitable for the scenario COOP-p2p is used for blocking Send/Recv operations COOP-coll is used for collectives and Send/Irecv or Isend/Recv communication patterns COOP-load is used for Isend/Irecv communication COOP-chunked is used for messages larger than a system-specific threshold value COOP-hca is used if system is undersubscribed Used for application level evaluations (coming up!)

Overview Introduction Motivation Vision and Contribution Detailed Designs Experimental Evaluations Conclusion and Future Work

Experimental Setup Specification Xeon Phi Xeon OpenPower Processor Family Knights Landing Broadwell IBM POWER-8 Processor Model KNL 7250 E5 v2680 PPC64LE Clock Speed 1.4 GHz 2.4 GHz 3.4 GHz No. of Sockets 1 2 Cores per Socket 68 14 10 Threads per Core 4 8 Memory Config Cache NUMA RAM (DDR) 96 GB 128 GB 256 GB HBM (MCDRAM) 16 GB - Interconnect Omni-Path (100 Gbps) InfiniBand EDR (100 Gbps) MPI Library MVAPICH2X-2.3rc1 OpenMPI v3.1.0

Overview of the MVAPICH2 Project High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 MVAPICH2-X (MPI + PGAS), Available since 2011 Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 Support for Virtualization (MVAPICH2-Virt), Available since 2015 Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 Used by more than 2,950 organizations in 86 countries More than 500,000 (> 0.5 million) downloads from the OSU site directly Empowering many TOP500 clusters (June ‘18 ranking) 2nd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China 12th, 556,104 cores (Oakforest-PACS) in Japan 15th, 367,024 cores (Stampede2) at TACC 24th, 241,108-core (Pleiades) at NASA 62nd, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) http://mvapich.cse.ohio-state.edu Empowering Top500 systems for over a decade System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> Stampede at TACC (12th in Jun’16, 462,462 cores, 5.168 Plops)

Performance of Point-to-point Operations Intra-node communication latency of large messages improved by up to 2X on Broadwell and OpenPower Up to 1.75x improvement on KNL (L2 cache sharing and lack of L3 cache) Similar improvement for unidirectional bandwidth

Impact on Performance of Collectives One-to-all All-to-one Reduce-Scatter+Gather RGET performs better for one-to-all communication RPUT performs better for all-to-one communication Neither provides the best performance for mixed communication patterns COOP-coll performs equal or better than existing protocols for all communication patterns

Impact on Performance of 3DStencil COOP-load outperforms existing design by up to 10% Equalizes number of copy operations across ranks Reduces variance in time spent in copying across processes

Impact on Application Performance Graph500 CoMD MiniGhost Graph Processing: up to 19% improvement for Graph500 @ 896 processes Molecular Dynamics: up to 16% improvement for CoMD @ 896 processes Halo Exchange: up to 10% improvement for MiniGhost @ 448 processes

Overview Introduction Motivation Vision and Contribution Detailed Designs Experimental Evaluations Conclusion and Future Work

Conclusion and Future Work Existing MPI rendezvous protocols are suboptimal for emerging architectures Lack of cooperation among participant processes Loss of performance, overlap, and adaptivity to diverse communication patterns Designed novel rendezvous protocols based on cooperation among processes Take advantage of available resources Dynamically adapt to the application communication pattern Improve overlap of intra-node and inter-node communication Up to 20% reduction in application runtime of Graph500, CoMD, and MiniGhost Proposed designs available in MVAPICH2X-2.3rc2 Deployed in TACC Stampede2 (17th in TOP500) Larger scale run planned in the future Download from http://mvapich.cse.ohio-state.edu/downloads/

Network Based Computing Thank You! { chakraborty.52, bayatpour.1, hashmi.29, subramoni.1, panda.2 }@ohio-state.edu Network Based Computing Laboratory Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/