The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Slides:



Advertisements
Similar presentations
Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
©2009 HP Confidential template rev Ed Turkel Manager, WorldWide HPC Marketing 4/7/2011 BUILDING THE GREENEST PRODUCTION SUPERCOMPUTER IN THE.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
OpenFOAM on a GPU-based Heterogeneous Cluster
1 InfiniBand HW Architecture InfiniBand Unified Fabric InfiniBand Architecture Router xCA Link Topology Switched Fabric (vs shared bus) 64K nodes per sub-net.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 High Performance Interconnects for Distributed Computing (HPI-DC)
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
SRP Update Bart Van Assche,.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Roland Dreier Technical Lead – Cisco Systems, Inc. OpenIB Maintainer Sean Hefty Software Engineer – Intel Corporation OpenIB Maintainer Yaron Haviv CTO.
Copyright DataDirect Networks - All Rights Reserved - Not reproducible without express written permission Adventures Installing Infiniband Storage Randy.
1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.
Current major high performance networking technologies InfiniBand 10G-Ethernet.
Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.
Extracted directly from:
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
The NE010 iWARP Adapter Gary Montry Senior Scientist
InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.
Gilad Shainer, VP of Marketing Dec 2013 Interconnect Your Future.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
InfiniBand in the Lab Erik 1.
March 9, 2015 San Jose Compute Engineering Workshop.
© 2012 MELLANOX TECHNOLOGIES 1 Disruptive Technologies in HPC Interconnect HPC User Forum April 16, 2012.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. HP Update IDC HPC Forum.
CMS week, June 2002, CERN 1 First P2P Measurements on Infiniband Luciano Berti INFN Laboratori Nazionali di Legnaro.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
InfiniBand By Group 3: Casey Bauer Mary Daniel William Hunter Hannah McMahon John Walls.
iSER update 2014 OFA Developer Workshop Eyal Salomon
Internet Protocol Storage Area Networks (IP SAN)
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
2006 Sonoma Workshop February 2006Page 1 MemFree Technology Gilad Shainer Mellanox Technologies Inc.
© 2007 EMC Corporation. All rights reserved. Internet Protocol Storage Area Networks (IP SAN) Module 3.4.
IBM Power system – HPC solution
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.
Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Enhancements for Voltaire’s InfiniBand simulator
Balazs Voneki CERN/EP/LHCb Online group
Infiniband Architecture
What is Fibre Channel? What is Fibre Channel? Introduction
Appro Xtreme-X Supercomputers
OCP: High Performance Computing Project
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Cluster Computers.
Presentation transcript:

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer

© 2010 MELLANOX TECHNOLOGIES Overview 2  The GPUDirect project was announced Nov 2009 “NVIDIA Tesla GPUs To Communicate Faster Over Mellanox InfiniBand Networks”  GPUDirect was developed by Mellanox and NVIDIA New interface (API) within the Tesla GPU driver New interface within the Mellanox InfiniBand drivers Linux kernel modification  GPUDirect availability was announced May 2010 “Mellanox Scalable HPC Solutions with NVIDIA GPUDirect Technology Enhance GPU-Based HPC Performance and Efficiency” &rec_id=427http:// &rec_id=427

© 2010 MELLANOX TECHNOLOGIES Why GPU Computing? 3  GPUs provide cost effective way for building supercomputers  Dense packaging of compute flops with high memory bandwidth

© 2010 MELLANOX TECHNOLOGIES 4 4 The InfiniBand Architecture  Industry standard defined by the InfiniBand Trade Association  Defines System Area Network architecture Comprehensive specification: from physical to applications  Architecture supports Host Channel Adapters (HCA) Target Channel Adapters (TCA) Switches Routers  Facilitated HW design for Low latency / high bandwidth Transport offload Processor Node InfiniBand Subnet Gateway HCA Switch Processor Node HCA TCA Storage Subsystem Consoles TCA RAID Ethernet Gateway Fibre Channel HCA Subnet Manager

© 2010 MELLANOX TECHNOLOGIES Bandwidth per direction (Gb/s) 40G-IB-DDR 60G-IB-DDR 120G-IB-QDR 80G-IB-QDR 200G-IB-EDR 112G-IB-FDR 300G-IB-EDR 168G-IB-FDR 8x HDR 12x HDR # of Lanes per direction Per Lane & Rounded Per Link Bandwidth (Gb/s) 5G-IB DDR 10G-IB QDR 14G-IB-FDR (14.025) 26G-IB-EDR ( ) x12 x8 x x NDR 8x NDR 40G-IB-QDR 100G-IB-EDR 56G-IB-FDR 20G-IB-DDR 4x HDR x4 4x NDR 10G-IB-QDR 25G-IB-EDR 14G-IB-FDR 1x HDR 1x NDR Market Demand InfiniBand Link Speed Roadmap 5

© 2010 MELLANOX TECHNOLOGIES Efficient HPC 6 GPUDirect CORE-Direct Offloading Congestion Control Adaptive Routing Management Messaging Accelerations Advanced Auto-negotiation Advanced Auto-negotiation MPI

© 2010 MELLANOX TECHNOLOGIES GPU – InfiniBand Based Supercomputers  GPU-InfiniBand architecture enables cost/effective supercomputers Lower system cost, less space, lower power/cooling costs 7 National Supercomputing Centre in Shenzhen (NSCS) 5K end-points (nodes) Mellanox IB – GPU PetaScale Supercomputer (#2 on the Top500)

© 2010 MELLANOX TECHNOLOGIES GPUs – InfiniBand Balanced Supercomputers  GPUs introduce higher demands on the cluster communication 8 NVIDIA “Fermi” 512 cores Network Data CPU GPU Must be Balanced Must be Balanced

© 2010 MELLANOX TECHNOLOGIES System Architecture  The InfiniBand Architecture (IBA) is an industry-standard fabric designed to provide high bandwidth, low-latency computing, scalability for ten-thousand nodes and multiple CPU/GPU cores per server platform and efficient utilization of compute processing resources. 40Gb/s of bandwidth node to node Up to 120Gb/s between switches Latency of 1μsec 9

© 2010 MELLANOX TECHNOLOGIES GPU-InfiniBand Bottleneck (pre-GPUDirect)  For GPU communications “pinned” buffers are used A section in the host memory dedicated for the GPU Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best performance  InfiniBand uses “pinned” buffers for efficient RDMA Zero-copy data transfers, Kernel bypass 10 CPU GPU Chip set GPUMemory InfiniBand SystemMemory

© 2010 MELLANOX TECHNOLOGIES GPU-InfiniBand Bottleneck (pre-GPUDirect)  The CPU has to be involved GPU to GPU data path For example: memory copies between the different “pinned buffers” Slow down the GPU communications and creates communication bottleneck 11 CPU GPU Chip set GPUMemory InfiniBand SystemMemory CPU GPU Chip set GPUMemory InfiniBand SystemMemory Transmit Receive

© 2010 MELLANOX TECHNOLOGIES 12 Mellanox – NVIDIA GPUDirect Technology  Allows Mellanox InfiniBand and NVIDIA GPU to communicate faster Eliminates memory copies between InfiniBand and GPU Mellanox-NVIDIA GPUDirect Enables Fastest GPU-to-GPU Communications GPUDirect No GPUDirect

© 2010 MELLANOX TECHNOLOGIES GPUDirect Elements  Linux Kernel modifications Support for sharing pinned pages between different drivers Linux Kernel Memory Manager (MM) allows NVIDIA and Mellanox drivers to share the host memory –Provides direct access for the latter to the buffers allocated by NVIDIA CUDA library, and thus, providing Zero Copy of data and better performance  NVIDIA driver Allocated buffers by the CUDA library are managed by the NVIDIA Tesla driver Modifications to mark these pages to be shared so the Kernel MM will allows the Mellanox InfiniBand drivers to access them and use them for transportation without the need for copying or re-pinning them.  Mellanox driver The InfiniBand application running over Mellanox InfiniBand adapters can transfer the data that resides on the buffers allocated by the NVIDIA Tesla driver Modifications to query this memory and to be able to share it with the NVIDIA Tesla drivers using the new Linux Kernel MM API Mellanox driver registers special callbacks to allow other drivers sharing the memory to notify any changes performed during run time in the shared buffers state, in order for the Mellanox driver to use the memory accordingly and to avoid invalid access to any shared pinned buffers 13

© 2010 MELLANOX TECHNOLOGIES Accelerating GPU Based Supercomputing  Fast GPU to GPU communications  Native RDMA for efficient data transfer  Reduces latency by 30% for GPUs communication 14 CPU GPU Chip set GPUMemory InfiniBand System Memory 1 1 CPU GPU Chip set GPUMemory InfiniBand System Memory 1 1 Transmit Receive

© 2010 MELLANOX TECHNOLOGIES Applications Performance - Amber  Amber is a molecular dynamics software package One of the most widely used programs for bimolecular studies Extensive user base Molecular dynamics simulations calculate the motion of atoms  Benchmarks FactorIX - simulates the blood coagulation factor IX, consists of 90,906 atoms Cellulose - simulates a cellulose fiber, consists of 408,609 atoms Using the PMEMD simulation program for Amber  Benchmarking system – 8-node system Mellanox ConnectX-2 adapters and InfiniBand switches Each node includes one NVIDIA Fermi C2050 GPU 15 FactorIX Cellulose

© 2010 MELLANOX TECHNOLOGIES Amber Performance with GPUDirect Cellulose Benchmark  33% performance increase with GPUDirect  Performance benefit increases with cluster size 16

© 2010 MELLANOX TECHNOLOGIES Amber Performance with GPUDirect Cellulose Benchmark  29% performance increase with GPUDirect  Performance benefit increases with cluster size 17

© 2010 MELLANOX TECHNOLOGIES Amber Performance with GPUDirect FactorIX Benchmark  20% performance increase with GPUDirect  Performance benefit increases with cluster size 18

© 2010 MELLANOX TECHNOLOGIES Amber Performance with GPUDirect FactorIX Benchmark  25% performance increase with GPUDirect  Performance benefit increases with cluster size 19

© 2010 MELLANOX TECHNOLOGIES Summary  GPUDirect enables the first phase of direct GPU-Interconnect connectivity  Essential step towards efficient GPU Exa-scale computing  Performance benefits range depending on application and platform From 5-10% for Linpack, to 33% for Amber Further testing will include more applications/platforms  The work presented was supported by the HPC Advisory Council World-wide HPC organization (160 members) 20

21 HPC Advisory Council Members

22