Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.

CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.

GASNet: A Portable High-Performance Communication Layer for Global Address-Space Languages Dan Bonachea Jaein Jeong In conjunction with the joint UCB and.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Low Overhead Fault Tolerant Networking (in Myrinet)

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Protocol-Dependent Message-Passing Performance on Linux Clusters Dave Turner – Xuehua Chen – Adam Oline This work is funded by the DOE MICS office.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Synchronization and Communication in the T3E Multiprocessor.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

IEEE Workshop on HSLN 16 Nov 2004 SCI Networking for Shared-Memory Computing in UPC: Blueprints of the GASNet SCI Conduit Hung-Hsun Su, Burton C. Gordon,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Programming Parallel Hardware using MPJ Express By A. Shafi.

Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.

High Performance and Reliable Multicast over Myrinet/GM-2

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Fabric Interfaces Architecture – v4

J.M. Landgraf, M.J. LeVine, A. Ljubicic, Jr., M.W. Schulz

Architectural Interactions in High Performance Clusters

Lecture Topics: 11/1 General Operating System Concepts Processes

Cluster Computers.

Presentation transcript:

Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL UPC compiler development project

GAS Languages Access to remote memory is performed by de- referencing a variable Cost of small (single word) messages is important Desirable Qualities of Target Architectures Ability to perform one-sided communication Low latency performance for remote accesses Ability to hide network latency by overlapping communication with computation or other communication Support for collective communication and synchronization operations

Purpose of this Study Measure the performance characteristics of various UPC/GAS target architectures. We use micro-benchmarks to measure network parameters, including those defined in the LogP model. Given the characteristics of the communication subsystem, should we… Overlap communication with computation? Group communication operations together? Aggregate (pack/unpack) small messages?

Target Architectures Cray T3E 3D Torus Interconnect Directly read/write E-registers IBM SP Quadrics/Alpha Quadrics/Intel Myrinet/Intel Dolphin/Intel Torus Interconnect NIC on PCI bus Giganet/Intel (old, but could foreshadow InfiniBand) Virtual Interface Architecture NIC on PCI bus

IBM SP Hardware: NERSC SP – Seaborg processor Power 3+ SMP nodes running AIX Switch Adapters 2 Colony (switch2) adapters per node connected to a 2GB/sec 6XX memory bus (not PCI). No RDMA, reliable delivery or hardware assist in protocol processing Software “user space” protocol for kernel bypass 2 MPI libraries – single threaded & thread-safe LAPI Non-blocking one-sided remote memory copy ops Active messages Synchronization via counters and fence (barrier) ops Polling or Interrupt mode

Quadrics Hardware: Oak Ridge —”Falcon” cluster 64 4-way Alpha 667 MHz SMP nodes running Tru64 Low latency network Onboard 100 MHz processor with 32 MB memory NIC processor can duplicate up to 4 GB of page tables Uses virtual addresses, can handle page faults RDMA allows async, one-sided communication w/o interrupting remote processor. Runs over 66 MHz, 64 bit PCI bus Single switch can handle 128 nodes: federated switches can go up to 1024 nodes Software: Supports MPI, T3E’s shmem, and ‘elan’ messaging APIs Kernel bypass provided by elan layer

Myrinet 2000 Hardware: UCB Millennium cluster 4-way Intel SMP, 550 MHz with 4GB/node 33 MHz 32 bit PCI bus Myricom NIC: PCI64B 2MB onboard ram 133 MHz LANai 9.0 onboard processor Software: MPI & GM GM provides: Low-level API to control NIC sends/recvs/polls User space API with kernel bypass Support for zero-copy DMA directly to/from user address space –Uses physical addresses, requires memory pinning

The Network Parameters EEL – End to end latency or time spent sending a short message between two processes. BW – Large message network bandwidth Parameters of the LogP Model L – “Latency”or time spent on the network During this time, processor can be doing other work O – “Overhead” or processor busy time on the sending or receiving side. During this time, processor cannot be doing other work We distinguish between “send” and “recv” overhead G – “gap” the rate at which messages can be pushed onto the network. P – the number of processors

LogP Parameters: Overhead & Latency Non-overlapping overheadSend and recv overhead can overlap P0 P1 o send L o recv P0 P1 o send o recv EEL = o send + L + o recv EEL = f(o send, L, o recv )

LogP Parameters: gap The Gap is the delay between sending messages Gap could be larger than send ovhd NIC may be busy finishing the processing of last message and cannot accept a new one. Flow control or backpressure on the network may prevent the NIC from accepting the next message to send. The gap represents the inverse bandwidth of the network for small message sends. P0 P1 o send gap

LogP Parameters and Optimizations If gap > o send Arrange code to overlap computation with communication The gap value can change if we queue multiple communication operations back-to-back If the gap decreases with increased queue-depth Arrange the code to overlap communication with communication (back-to-back). If EEL is invariant of message size, at least for a range of message sizes Aggregate (pack/unpack) short message if possible

Benchmarks Designed to measure the network parameters for each target network. Also provide: gap as function of queue depth Implemented once in MPI For portability and comparison to target specific layer Implemented again in target specific communication layer: LAPI ELAN GM SHMEM VIPL

Benchmark: Ping-Pong Measure the round trip time (RTT) for messages of various size Report the average RTT of a large number (10000) of message sends. EEL = RTT/2 = f(L, o send, o recv ) Approximate: f(L, o send, o recv ) = L + o send + o recv Also provides large message bandwidth measurement RTT o send o recv L

Benchmark: Flood Test Calculate the rate at which messages can be injected into the network. Issue N=10000 non-blocking send messages and wait for final ack from receiver. Next send is issued as soon as previous send is complete at sender. F = 2o + L + N*max(o send,g) F avg = F/N ~ max(o send,g) For large N Can run: Q_Depth >= 1 P0 P1 o send gap F ack

Benchmark: Overlap Test In the overlap test, we interleave send and receive communication calls with a cpu loop of known duration Allows measurement of send and receive overhead. Similar to the Flood Test, we can measure the average value of T. We vary the “cpu” time until T begins to increase, at T* o send = T* – cpu By moving the cpu loop to recv side we measure o recv P0 o send gap cpu P0 o send gap cpu T

Putting it all together… From Overlap Test, we get: o send o recv From Ping-Pong Test: EEL BW If no overlap of send and receive processing: L = EEL – o send – o recv From Flood Test: F avg = max(o send, g) If (F avg > o send ) then g = F avg Otherwise cannot measure gap, but its not important

Results: EEL and Overhead

Results: Gap and Overhead

Flood Test: Overlapping Communication

Bandwidth Chart

EEL vs. Message Size

Benchmark Results: IBM High Latency, High Software Overhead Gap ~ O send No overlap of computation with communication Gap does not vary with number of queued ops No overlap of communication with communication LAPI Cost to send 1 byte ~ cost to send 1KB Short message packing is best option IBM Performance O send usec Gap usec O recv usec EEL usec L usec BW MB/s IBM PublishedN/A *500* MPI LAPI * Theoretical Peak

Benchmark Results: Myrinet 2000 Small o send and large gap: g - o send = 16.5 usec Overlap of computation with communication a big win Big reduction in Gap with queue depth > 1 (5-7 usec) Overlap of communication with communication is useful RDMA capability allows for minimal o recv Bandwidth limited by 33MHz 32bit PCI bus. Should improve with better bus. Myrinet Performance O send usec Gap usec O recv usec EEL usec L usec BW MB/s Myricom Published0.3N/A GM (measured) ~

Benchmark Results: Quadrics Observed one-way msg time slightly better than advertised! Using shmem/elan is big savings over MPI for latency and CPU overhead. No CPU overhead on remote processor w/shmem Some computation overlap is possible MPI implementation a bit flaky… Quadrics Performance O send usec Gap usec O recv usec EEL usec L usec BW MB/s Quadrics PublishedN/A 2 MPI (measured) * * Quadrics Put0.51.6~ * MPI Bugs?

General Conclusions Overlap of Computation with Communication A win on systems with HW support for protocol processing Myrinet, Quadrics, Giganet MPI o send ~ gap on most systems: no overlap. Overlap of Communication with Communication Win on Myrinet, Quadrics, Giganet Most MPI implementation exhibit this to a minor extent Aggregation of small messages (pack/unpack) A win on all systems

Old/Extra Slides

Quadrics Advertised Bandwidth/latency, with PCI bottleneck shown

IBM SP – Hardware Used NERSC SP – Seaborg processor Power 3+ SMP nodes 16 – 64 GB memory per node Switch Adapters 2 Colony (switch2) adapters per node connected to a 2GB/sec 6XX memory bus (not PCI). Csss “bonding” driver will multiplex through both adapters On-board 740 PowerPC processor On-board firmware and RamBus memory for segmentation and re-assembly of user packets to and from 1KB switch packets. No RDMA, reliable delivery or hardware assist in protocol processing

IBM SP - Software AIX “user space” protocol for kernel bypass access to switch adapter 2 MPI libraries – single threaded and thread-safe Thread-safe version increases RTT latency by usec LAPI – Lowest level comm API exported to user Non-blocking one-sided remote memory copy ops Active messages Synchronization via counters and fence (barrier) ops Thread-safe (locking overhead) Mulit-threaded implementation: Notification thread (progress engine) Completion handler thread for active messages Polling or Interrupt mode Software based flow-control and reliable delivery (overhead)

Quadrics Low latency network, w/100 MHz processor on NIC RDMA allows async, one-sided communication w/o interrupting remote processor. Supports MPI, T3E’s shmem, and ‘elan’ messaging APIs. Advertised one way latency as low as 2 us (5 us for MPI). Single switch can handle 128 nodes: federated switches can go up to 1024 nodes (Pittsburgh running 750 nodes). NIC processor can duplicate up to 4 GB of page tables—good for global address space languages. Runs over PCI bus—limits both latency & bandwidth 64 node cluster at Oak Ridge Nat’l Lab—”Falcon” 64 4-way Alpha 667 MHz SMP nodes running Tru64 66 MHz, 64 bit PCI bus Future work: look at Intel/Linux Quadrics cluster at LLNL

Myrinet 2000 Hardware: UCB Millennium cluster 4-way Intel SMP, 550 MHz with 4GB/node 33 MHz 32 bit PCI bus Myricom NIC: PCI64B 2MB onboard ram 133 MHz LANai 9.0 onboard processor Software: GM Low-level API to control NIC sends/recvs/polls User space API with kernel bypass Support for zero-copy DMA directly to/from user address space