Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963.

Slides:



Advertisements
Similar presentations
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.
Advertisements

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL STATE OF THE ART.
Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Hardwired networks on chip for FPGAs and their applications
Router Architecture : Building high-performance routers Ian Pratt
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Spring EE4272 Switch vs. Router Switch:  Def. 1: A network node that forwards packets from inputs to outputs based on header information in each.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Router Architectures An overview of router architectures.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
Router Architectures An overview of router architectures.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis.
KYLIN-I 麒麟一号 High-Performance Computing Cluster Institute for Fusion Theory and Simulation, Zhejiang University
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Extensible Message Layers for Multimedia Cluster Computers Dr. Craig Ulmer Center for Experimental Research in Computer Systems.
Network Intrusion Detection Systems on FPGAs with On-Chip Network Interfaces Christopher ClarkGeorgia Institute of Technology Craig UlmerSandia National.
1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.
Mapping of scalable RDMA protocols to ASIC/FPGA platforms
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
SLAC Particle Physics & Astrophysics The Cluster Interconnect Module (CIM) – Networking RCEs RCE Training Workshop Matt Weaver,
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
ECE 526 – Network Processing Systems Design Networking: protocols and packet format Chapter 3: D. E. Comer Fall 2008.
Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby 1, Mohamed Taher 1, Tarek El-Ghazawi 1,
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA SOS-8 Workshop.
Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.
The IBM Blue Gene/L System Architecture Presented by Sabri KANTAR.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
CS 4396 Computer Networks Lab Router Architectures.
CYPRESS SEMICONDUCTOR. 2 Cypress Confidential QDR Class vs DDR III (DRAM) 8 8 DDR3 SDRAM QDR2+ SRAM Multiplexed Address Bus (Row &
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.
Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.
Reconfigurable Computing Leveraging FPGA Accelerators in High-Performance Computing Applications Craig Ulmer June 2, 2005 Sandia is.
Interconnection network network interface and a case study.
Reconfigurable Computing: HPC Network Aspects Mitch Sukalski (8961) David Thompson (8963) Craig Ulmer (8963) Pete Dean R&D Seminar December.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
© 2007 Altera Corporation FPGA Coprocessing in Multi-Core Architectures for DSP J Ryan Kenny Bryce Mackin Altera Corporation 101 Innovation Drive San Jose,
Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.
Cray XD1 Reconfigurable Computing for Application Acceleration.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
3 Sep 2009SLM1 of 12 SLM performance and limitations based on HW tests.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
NaNet Problem: lower communication latency and its fluctuations. How?
Reference Router on NetFPGA 1G
Router Construction Outline Switched Fabrics IP Routers
Reference Router on NetFPGA 1G
Presentation transcript:

Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963 September 1, 2004 Craig Ulmer

Outline Reconfigurable computing refresher –Progress update Cray XD1 –Architecture –General message passing –Reconfigurable Computing and the XD1

Reconfigurable Computing Update

Reconfigurable Computing Use reconfigurable hardware devices to implement key computations in hardware double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; } * + + a[i]a[i+1] Z -1 a[i+2]

First Year Progress Computation (Underwood SNL/NM) –Double-precision Floating Point Cores Communication –Multi-gigabit Transceiver (MGT) interface –Gigabit Ethernet work Early application experiments –Simplified isosurfacing –Networked pattern matching

Peak Floating-Point Performance Core Single PrecisionDouble Precision Speed Cores per V2P100-6 Peak Performance Speed Cores per V2P100-6 Peak Performance Addition195 MHz8917 GFLOPS143 MHz405.7 GFLOPS Multiplication176 MHz7413 GFLOPS142 MHz273.8 GFLOPS Division120 MHz222.6 GFLOPS98 MHz60.58 GFLOPS From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04

Connecting FPGAs to the Network Fabric Modern FPGAs feature multi-gigabit transceivers –Experimented with GigE, Myrinet 2000, and IB –Implemented TCP Offload Engine (TOE) in hardware –Working on OpenTOE and OpenGigE cores MGT Control Tx IP Header ARP Ping ARP Cache MAC Framer Align CRC Rx CRC GT_Ethernet_2 Rocket I/O MGT Pad Ping Reply CRC Decode Incoming Data Queue Timeout Monitor SEQ Gen ACK Monitor CRC Gen ARP Reply Outgoing Data Queue SNL_OpenTOE T C P I/F S o c k e t I/F SNL_OpenGigE

Cray XD1 Overview

NDA Notice We do have an NDA with Cray Canada The XD1 we have on loan is an early Beta system

Cray XD1 Overview Dense MP system –12 AMD Opterons on 6 blades –6 Xilinx Virtex-II/Pro FPGAs –InfiniBand-like interconnect –6 SATA hard drives –4 PCI-X slots –3U Rack

Individual Blade DDR Memory DDR Memory RAP NI Opteron RapidArray Fabric (24 4x IB Ports) * All data rates are aggregates (i.e., 3.2 GB/s = 1.6 GB/s GB/s) HT: 3.2 GB/s4xIB: 2 GB/s HT: 6.4 GB/s “Einstein” Chip “HT”: 3.2 GB/s RAP NI RapidArray Fabric (24 4x IB Ports)

Message Passing MPICH –Latency:2.25 μs –Bandwidth:1.3 GB/s (82% of HT-IB link) RapidArray message layer –Open source –MP, RDMA –Global address space Message Size (Bytes) Bandwidth (Million Bytes/s) MPI Bandwidth PCI-X GB/s HT

System Administration Active manager –Synchronize each node’s OS –Partition blade functionality –Control access rights Embedded processor –Monitors health (heartbeats) –Can restart nodes Issues?

Reconfigurable Computing and the Cray XD1

Connecting to the “Einstein” Accelerator RAP NI Host HT Net IB HT User-defined Circuits FPGA HT I/F FPGA Port Fabric Port GB/s QDR2 I/F QDR2 I/F QDR2 I/F QDR2 I/F 2MB SRAM 2MB SRAM 2MB SRAM 2MB SRAM GB/s

Example: Random Number Generator Monte Carlo app in need of good random numbers –Mersenne twister Implemented in FPGA –FPGA pushes to host memory –301 vs 101 Million Integers/s –~1.2 GB/s NI CPU Host Memory RNG FPGA

General XD1 Comments Reconfigurable computing –FPGA in memory –Fast local memory Other accelerators –ClearSpeed Global address space –Opteron limits (40b PA) Vendor lock-in –Incompatible network –All-in-one box? Current NI is a bottleneck Density vs. Reliability Value-added features Good Not-so-Good

Friendly Users? We have a month left on evaluation –Could use feedback from other users