Reconfigurable Computing: A First Look at the Cray-XD1 Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963 September 1, 2004 Craig Ulmer
Outline Reconfigurable computing refresher –Progress update Cray XD1 –Architecture –General message passing –Reconfigurable Computing and the XD1
Reconfigurable Computing Update
Reconfigurable Computing Use reconfigurable hardware devices to implement key computations in hardware double doX( double *a, int n) { int i; double x; x=0; for(i=0;i<n;i+=3){ x+= a[i] * a[i+1] + a[i+2]; … } … return x; } * + + a[i]a[i+1] Z -1 a[i+2]
First Year Progress Computation (Underwood SNL/NM) –Double-precision Floating Point Cores Communication –Multi-gigabit Transceiver (MGT) interface –Gigabit Ethernet work Early application experiments –Simplified isosurfacing –Networked pattern matching
Peak Floating-Point Performance Core Single PrecisionDouble Precision Speed Cores per V2P100-6 Peak Performance Speed Cores per V2P100-6 Peak Performance Addition195 MHz8917 GFLOPS143 MHz405.7 GFLOPS Multiplication176 MHz7413 GFLOPS142 MHz273.8 GFLOPS Division120 MHz222.6 GFLOPS98 MHz60.58 GFLOPS From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04
Connecting FPGAs to the Network Fabric Modern FPGAs feature multi-gigabit transceivers –Experimented with GigE, Myrinet 2000, and IB –Implemented TCP Offload Engine (TOE) in hardware –Working on OpenTOE and OpenGigE cores MGT Control Tx IP Header ARP Ping ARP Cache MAC Framer Align CRC Rx CRC GT_Ethernet_2 Rocket I/O MGT Pad Ping Reply CRC Decode Incoming Data Queue Timeout Monitor SEQ Gen ACK Monitor CRC Gen ARP Reply Outgoing Data Queue SNL_OpenTOE T C P I/F S o c k e t I/F SNL_OpenGigE
Cray XD1 Overview
NDA Notice We do have an NDA with Cray Canada The XD1 we have on loan is an early Beta system
Cray XD1 Overview Dense MP system –12 AMD Opterons on 6 blades –6 Xilinx Virtex-II/Pro FPGAs –InfiniBand-like interconnect –6 SATA hard drives –4 PCI-X slots –3U Rack
Individual Blade DDR Memory DDR Memory RAP NI Opteron RapidArray Fabric (24 4x IB Ports) * All data rates are aggregates (i.e., 3.2 GB/s = 1.6 GB/s GB/s) HT: 3.2 GB/s4xIB: 2 GB/s HT: 6.4 GB/s “Einstein” Chip “HT”: 3.2 GB/s RAP NI RapidArray Fabric (24 4x IB Ports)
Message Passing MPICH –Latency:2.25 μs –Bandwidth:1.3 GB/s (82% of HT-IB link) RapidArray message layer –Open source –MP, RDMA –Global address space Message Size (Bytes) Bandwidth (Million Bytes/s) MPI Bandwidth PCI-X GB/s HT
System Administration Active manager –Synchronize each node’s OS –Partition blade functionality –Control access rights Embedded processor –Monitors health (heartbeats) –Can restart nodes Issues?
Reconfigurable Computing and the Cray XD1
Connecting to the “Einstein” Accelerator RAP NI Host HT Net IB HT User-defined Circuits FPGA HT I/F FPGA Port Fabric Port GB/s QDR2 I/F QDR2 I/F QDR2 I/F QDR2 I/F 2MB SRAM 2MB SRAM 2MB SRAM 2MB SRAM GB/s
Example: Random Number Generator Monte Carlo app in need of good random numbers –Mersenne twister Implemented in FPGA –FPGA pushes to host memory –301 vs 101 Million Integers/s –~1.2 GB/s NI CPU Host Memory RNG FPGA
General XD1 Comments Reconfigurable computing –FPGA in memory –Fast local memory Other accelerators –ClearSpeed Global address space –Opteron limits (40b PA) Vendor lock-in –Incompatible network –All-in-one box? Current NI is a bottleneck Density vs. Reliability Value-added features Good Not-so-Good
Friendly Users? We have a month left on evaluation –Could use feedback from other users