Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

Slides:

Advertisements

Similar presentations

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.

Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.

Sensor Node Architecture Issues Stefan Dulman

Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.

Presented By Chandra Shekar Reddy.Y 11/5/20081Computer Architecture & Design.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.

Direct Access File System (DAFS): Duke University Demo Source-release reference implementation of DAFS Broader research goal: Enabling efficient and transparently.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

Agent-based Model Simulation with Twister Bingjing Zhang, Lilian Weng, B649 Term.

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Automatic Communication Refinement for System Level Design Samar Abdi, Dongwan Shin and Daniel Gajski Center for Embedded Computer Systems, UC Irvine

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.

Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,

Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

Classification and Analysis of Distributed Event Filtering Algorithms Sven Bittner Dr. Annika Hinze University of Waikato New Zealand Presentation at CoopIS.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Charm Workshop CkDirect: Charm++ RDMA Put Presented by Eric Bohm CkDirect Team: Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

The Mach System Silberschatz et al Presented By Anjana Venkat.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

1 March 17, 2006Zhiyi’s RSL VODCA: View-Oriented, Distributed, Cluster-based Approach to parallel computing Dr Zhiyi Huang Dept of Computer Science University.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

ECE 456 Computer Architecture Lecture #9 – Input/Output Instructor: Dr. Honggang Wang Fall 2013.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

NFV Compute Acceleration APIs and Evaluation

High Performance and Reliable Multicast over Myrinet/GM-2

Architecture and Algorithms for an IEEE 802

Definition of Distributed System

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Course Outline Introduction in algorithms and applications

Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI

Presentation transcript:

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda The Ohio State University Pavan Balaji The Ohio State University Jarek Nieplocha Pacific Northwest National Lab

Contents  Motivation  Design Issues  RDMA-based Broadcast  RDMA-based All Reduce  Conclusions and Future Work

Motivation Communication Characteristics of Parallel ApplicationsCommunication Characteristics of Parallel Applications Point-to-Point Communication o Send and Receive primitives Collective Communication o Barrier, Broadcast, Reduce, All Reduce o Built over Send-Receive Communication primitives Communication Methods for Modern ProtocolsCommunication Methods for Modern Protocols Send and Receive Model Remote Direct Memory Access (RDMA) Model

Remote Direct Memory Access Remote Direct Memory Access (RDMA) ModelRemote Direct Memory Access (RDMA) Model oRDMA Write oRDMA Read (Optional) Widely supported by modern protocols and architecturesWidely supported by modern protocols and architectures oVirtual Interface Architecture (VIA) oInfiniBand Architecture (IBA) Open QuestionsOpen Questions oCan RDMA be used to optimize Collective Communication? [rin02] oDo we need to rethink algorithms optimized for Send-Receive? [rin02]: “Efficient Barrier using Remote Memory Operations on VIA-based Clusters”, Rinku Gupta, V. Tipparaju, J. Nieplocha, D. K. Panda. Presented at Cluster 2002, Chicago, USA

Send-Receive and RDMA Communication Models User buffer Registered S R NIC User buffer NIC descriptor User buffer Registered S R NIC Registered User buffer NIC descriptor Send/Recv RDMA Write

Benefits of RDMA RDMA gives a shared memory illusion Receive operations are typically expensive RDMA is Receiver transparent Supported by VIA and InfiniBand architecture A novel unexplored method

Contents  Motivation  Design Issues  Buffer Registration  Data Validity at Receiver End  Buffer Reuse  RDMA-based Broadcast  RDMA-based All Reduce  Conclusions and Future Work

Buffer Registration Static Buffer Registration Static Buffer Registration  Contiguous region in memory for every communicator  Address exchange is done during initialization time Dynamic Buffer Registration - Rendezvous Dynamic Buffer Registration - Rendezvous  User buffers, registered during the operation, when needed  Address exchange is done during the operation

Data Validity at Receiver End Interrupts Interrupts Too expensive; might not be supported Use Immediate field of VIA descriptor Use Immediate field of VIA descriptor Consumes a receive descriptor RDMA write a Special byte to a pre-defined location RDMA write a Special byte to a pre-defined location

Buffer Reuse Static Buffer Registration Static Buffer Registration  Buffers need to be reused  Explicit notification has to be sent to sender Dynamic Buffer Registration Dynamic Buffer Registration  No buffer Reuse

Contents  Motivation  Design Issues  RDMA-based Broadcast  Design Issues  Experimental Results  Analytical Models  RDMA-based All Reduce  Conclusions and Future Work

Buffer Registration and Initialization Static Registration Scheme (for size <= 5K bytes) Static Registration Scheme (for size <= 5K bytes) P0P1P2 P3 Constant Block size Notify Buffer Dynamic Registration Scheme (for size > 5K) -- Dynamic Registration Scheme (for size > 5K) -- Rendezvous scheme

1 11 Data Validity at Receiver End P0P1P2P3 Constant Block size Broadcast counter = 1 (First Broadcast with Root P0) Data size Broadcast counter Notify Buffer 1

Buffer Reuse P0P1P2P3 11 Notify Buffer 1 Broadcast Buffer P0P1P2P3

Performance Test Bed 16 1GHz PIII nodes, 33MHz PCI bus, 512MB RAM. Machines connected using GigaNet cLAN 5300 switch. MVICH Version : mvich-1.0 Integration with MVICH-1.0 MPI_Send modified to support RDMA Write Timings were taken for varying block sizes Tradeoff between number of blocks and size of blocks

RDMA Vs Send-Receive Broadcast (16 nodes) Improvement ranging from 14.4% (large messages) to 19.7% (small messages) Block size of 3K is performing the best 19.7% 14.4%

Anal. and Exp. Comparison (16 nodes) Broadcast Error difference of lesser than 7%

RDMA Vs Send-Receive for Large Clusters (Analytical Model Estimates: Broadcast) 16% 21% 16% 21% Estimated Improvement ranging from 16% (small messages) to 21% (large messages) for large clusters of sizes 512 nodes and 1024 nodes

Contents  Motivation  Design Issues  RDMA-based Broadcast  RDMA-based All Reduce  Degree-K tree  Experimental Results (Binomial & Degree-K)  Analytical Models (Binomial & Degree-K)  Conclusions and Future Work

Degree-K tree-based Reduce P1P2P3P4P5P6P7P0 [ 1 ] [ 3 ] [ 2 ] P1P2P3P4P5P6P7P0 [ 1 ] [ 2 ] P1P2P3P4P5P6P7P0 [ 1 ] K = 1K = 3K = 7

Experimental Evaluation Integrated into MVICH-1.0 Reduction Operation = MPI_SUM Data type = 1 INT (data size = 4 bytes) Count = 1 (4 bytes) to 1024 (4096) bytes Finding the optimal Degree-K Experimental Vs Analytical (best case & worst case) Exp. and Anal. comparison of Send-Receive with RDMA

4 nodes 8 nodes 16 nodes Degree-3 Degree-7 Degree-3Degree-3 Degree-1 Degree-3 Degree-1 Degree-3 Degree B 256-1KB Beyond 1KB Choosing the Optimal Degree-K for All Reduce For lower message sizes, higher degrees perform better than degree-1 (binomial)

Degree-K RDMA-based All Reduce Analytical Model Experimental timings fall between the best case and the worst case analytical estimates For lower message sizes, higher degrees perform better than degree-1 (binomial) 4 nodes 8 nodes 16 nodes Degree-3 Degree-7 Degree-3Degree-3 Degree-1 Degree-3 Degree-1 Degree-3 Degree B 256-1KB Beyond 1KB Degree-3Degree-3 Degree-1 Degree-3Degree-3 Degree nodes 512 nodes

Binomial Send-Receive Vs Optimal & Binomial Degree-K RDMA (16 nodes) All Reduce 38.13% 9% Improvement ranging from 9% (large messages) to 38.13% (small messages) for the optimal degree-K RDMA-based All Reduce compared to Binomial Send-Receive

Binomial Send-Receive Vs Binomial & Optimal Degree-K All Reduce for large clusters Improvement ranging from 14% (large messages) to 35-40% (small messages) for the optimal degree-K RDMA-based All Reduce compared to Binomial Send-Receive 35-40% 14% 35-41% 14%

Contents  Motivation  Design Issues  RDMA-based Broadcast  RDMA-based All Reduce  Conclusions and Future Work

Conclusions Novel method to implement the collective communication library Degree-K algorithm to exploit the benefits of RDMA Implemented the RDMA-based Broadcast and All Reduce Broadcast: 19.7% improvement for small and 14.4% for large messages (16nodes) All Reduce: 38.13% for small messages, 9.32% for large messages (16nodes) Analytical models for Broadcast and All Reduce Estimate Performance benefits of large clusters Broadcast: 16-21% for 512 and 1024 node clusters All Reduce: 14-40% for 512 and 1024 node clusters

Future Work Exploit the RDMA Read feature if available Round-trip cost design issues Extend to MPI-2.0 One sided Communication Extend framework to emerging InfiniBand architecture

For more information, please visit the Network Based Computing Group, The Ohio State University Thank You! NBC Home Page

Backup Slides

Receiver Side Best for Large messages (Analytical Model) P3 P2 P1 Tt TtToTnTs = ( Tt * k ) + Tn + Ts + To + Tck - No of Sending nodes = ( Tt * k ) + Tn + Ts + To + Tck - No of Sending nodes Tt TtToTnTs ToTnTs

P3 P2 P1 ToTtTnTsTo To Receiver Side Worst for Large messages (Analytical Model) = ( Tt * k ) + Tn + Ts + ( To * k ) + Tc k - No of Sending nodes = ( Tt * k ) + Tn + Ts + ( To * k ) + Tc k - No of Sending nodesTtTnTs TtTnTs

Buffer Registration and Initialization Static Registration Scheme (for size <= 5K) Static Registration Scheme (for size <= 5K) P0 P1 P2 P3 Constant Block size (5K+1) P1 P2 P3 Each block is of size 5K+1. Every process has N blocks, where N is the number of processes in the communicator

Data Validity at Receiver End P0P1P2P P0P1P2P Computed Data P0P1P2P Data 1 Data P0P1P2P Computed Data