Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Slides:



Advertisements
Similar presentations
I/O Management and Disk Scheduling
Advertisements

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.
CCNA3: Switching Basics and Intermediate Routing v3.0 CISCO NETWORKING ACADEMY PROGRAM Switching Concepts Introduction to Ethernet/802.3 LANs Introduction.
TELE202 Lecture 7 X.25 1 Lecturer Dr Z. Huang Overview ¥Last Lecture »Routing in WAN »Source: chapter 10 ¥This Lecture »X.25 »Source: chapter 10 ¥Next.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
1 SpaceWire Update NASA GSFC November 25, GSFC SpaceWire Status New Link core with split clock domains complete (Much faster) New Router core.
OFA Openframework WG SHMEM/PGAS Feedback Worksheet 1/27/14.
Router Architecture : Building high-performance routers Ian Pratt
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Network based System on Chip Final Presentation Part B Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group
Computer Networks Transport Layer. Topics F Introduction  F Connection Issues F TCP.
ESA UNCLASSIFIED – For Official Use Deterministic Communication with SpaceWire Martin Suess CCSDS Spring Meeting /03/2015.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
TCP: Software for Reliable Communication. Spring 2002Computer Networks Applications Internet: a Collection of Disparate Networks Different goals: Speed,
Router Architectures An overview of router architectures.
Router Architectures An overview of router architectures.
Switching, routing, and flow control in interconnection networks.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
CIS 725 Wireless networks. Low bandwidth High error rates.
Synchronization and Communication in the T3E Multiprocessor.
Protocols and the TCP/IP Suite
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
Lecture 3 Review of Internet Protocols Transport Layer.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
I. Basic Network Concepts. I.1 Networks Network Node Address Packet Protocol.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
CCNA 3 Week 4 Switching Concepts. Copyright © 2005 University of Bolton Introduction Lan design has moved away from using shared media, hubs and repeaters.
Cisco 3 - Switching Perrine. J Page 16/4/2016 Chapter 4 Switches The performance of shared-medium Ethernet is affected by several factors: data frame broadcast.
Sem1 - Module 8 Ethernet Switching. Shared media environments Shared media environment: –Occurs when multiple hosts have access to the same medium. –For.
Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Chapter 16 Protocols and Layering. Network Communication Protocol an agreement that specifies the format and meaning of messages computers exchange Network.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
TCP continued. Discussion – TCP Throughput TCP will most likely generate the saw tooth type of traffic. – A rough estimate is that the congestion window.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
1 Chapter 11 I/O Management and Disk Scheduling Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and.
Youngstown State University Cisco Regional Academy
Deterministic Communication with SpaceWire
Module 12: I/O Systems I/O hardware Application I/O Interface
Fabric Interfaces Architecture – v4
Operating System I/O System Monday, August 11, 2008.
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
Congestion Control, Internet transport protocols: udp
I. Basic Network Concepts
Switching, routing, and flow control in interconnection networks
Network Core and QoS.
File Transfer Issues with TCP Acceleration with FileCatalyst
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Peng Liu Lecture 14 I/O Peng Liu
Switching Techniques.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Requirements Definition
Computer Networks Protocols
Module 12: I/O Systems I/O hardwared Application I/O Interface
Network Core and QoS.
Presentation transcript:

Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Overview Network Interface Router Reliability, Availability, and Serviceability Features Software Stack Performance Cray Inc. Hot Interconnects 2

Integrated NIC and Router External HSS Monitoring Supports 2 Nodes per ASIC Advanced Resiliency Features Hardware Global Address Support Advanced NIC designed to efficiently support MPI One-sided MPI Shmem UPC, Coarray FORTRAN Cray Inc. Hot Interconnects 3

4 Y X Z Z X Y

Fast Memory Access (FMA) – fine grain remote PUT/GET Block Transfer Engine (BTE) – offload for long transfers Completion Queue (CQ) – client notification Atomic Memory Op (AMO) – fetch&add, etc. Cray Inc. Hot Interconnects 5

Single-sided Processor stores become remote PUT or GET FMA descriptors hold state to help determine destination node and memory location FMA PUT for short messages Uncached processor store to Gemini window translated directly to network packet FMA GET allows reverse direction data transfer of 1 to 64 bytes Cray Inc. Hot Interconnects 6

Driver managed BTE PUT for long messages DMA transfer to offload data movement from processor BTE SEND for IP traffic, etc. Send message to remote node Single receive queue for all sources Upper level protocol covers lost messages BTE GET support for simplified data transfers In lieu of involving remote side for PUT Cray Inc. Hot Interconnects 7

Hardware remote atomic memory operations in the NIC Add, Compare & Swap, Logical Operations Executed at the node with the memory AMO cache for hot locations Up to 64 locations with AMOs in process Global operations support Barriers Counters Collectives (reductions, global sum) Cray Inc. Hot Interconnects 8

6x8 tile matrix Input queue to one of 6 subswitches Route to one of 8 output buffers Hashed routing preserves order to cachelines Adaptive routing Cray Inc. Hot Interconnects 9

Route around stalled or down links If a link goes down, adaptive routing mask updated in hardware to exclude it OS traffic uses adaptive routing only, recovers from finite loss of packets Quiesce and re-route to repair deterministic routes Congestion feedback to allow routing around bottlenecks Potential for improved performance on difficult traffic patterns such as transpose Packets reordered in receive buffer (DRAM) Separate notification (completion event) when all stored Cray Inc. Hot Interconnects 10

24 bit flit Maximum size packet is =32 flit Put request of 64 bytes Minimum is 2 flit Put response Cray Inc. Hot Interconnects 11

Automatic link-level retries HT3 support including automatic retries and improved CRC Most internal data structures are at least parity protected The longer the occupancy of data at a location, the stronger the protection Errors reported as precisely as possible Payload errors reported directly to user Control errors often cannot be associated with a particular transaction In all cases OS or HSS can be notified of the error Router errors included Reported at the point of error Endpoint(s) (user) see a timeout Cray Inc. Hot Interconnects 12

Cray Inc. Hot Interconnects 13 User level Gemini Network Interface (uGNI) User level Gemini Network Interface (uGNI) DMAPP MPICH MPICH2 SHMEM Gemini Hardware Abstraction Layer (GHAL) Gemini Hardware Abstraction Layer (GHAL) GNI Core IOCTL or S yste m C all Kernel level GNI (kGNI) Kernel level GNI (kGNI) Lustre Network Driver (LND) Lustre Network Driver (LND) IP over Gemini Fabric (IPoGIF) IP over Gemini Fabric (IPoGIF) Direct Access Linux Core GART Resource Management (GRM) GART Resource Management (GRM) Cray COW solution MRT-size page support Registration Cache support PGAS Direct Access

Latency Bandwidth Atomic operations Cray Inc. Hot Interconnects 14

Gemini expanded to HT3 at up to 5.2 GT/s Expect to sustain greater than 6 GB/s user data injection Network bandwidth is limited by XT packaging Link speed from to 6.25 Gbit/sec In some cases, double wide X & Z links also offer increased bandwidth Gemini relies on user level threads MPI processing limits to 2M messages/sec per thread Scales beyond 10M msg/sec per NIC Cray Inc. Hot Interconnects 15

One way PUT in 750ns Waiting for Ack in only 1.1 us Remote GET increases to 1.4 us Cray Inc. Hot Interconnects 16

Peak bandwidth reached with small transfers Multiple threads reach peak with smaller, still, transfers Cray Inc. Hot Interconnects 17

Hot location reaches 100 Mupdates/sec Random locations (GUPS) still over 45 Mupdates/sec Cray Inc. Hot Interconnects 18

Gemini provides low latency, and performance for fine grain operations Gemini has features to scale in performance and reliability to large system size Questions? Cray Inc. Hot Interconnects 19