Presentation on theme: "Reconfigurable Supercomputing. 2 Challenges in Computing Space and Power Limits: Large systems need space… A hypothetical 50 node 1 rack unit system."— Presentation transcript:
2 Challenges in Computing Space and Power Limits: Large systems need space… A hypothetical 50 node 1 rack unit system requires est. 20 square feet of floor space for air flow One rack (dual core) delivers 100 CPU units Speedup potential only 50 for 99% parallel application …for cooling systems Need for ambient air cooling increases with CPU power Multiple cores still means more cooling per node Bottom line: Building-size refrigerators (data centers) expensive to build, expand or retrofit To build: $67M to build data center consuming 15 MW power To operate: $12M+ per year at 40% capacity [Stahlberg06]
3 Challenges in Computing Challenges in face of increased demands: Increasing amounts of data generated By research simulations and instruments By surveillance sensors, cameras, RFID Longer secure records keeping Government regulations (SOX, privacy, HIPAA) Protecting and leveraging R/D intellectual property Data volumes are soaring Genbank volume doubles every 18 months [Walmart preparing to add TBs regularly for RFID] Increased application complexity Demand for even lower times to solution
4 A Solution in Reconfigurable Computing FPGA computing is well established MAPLD, RECON, ACM/IEEE, etc. FCCM around for 14 years Diverse tools employed by logic designers in lieu of an ASIC approach Generalizable platform Reconfigurable – an option to change without physical replacement Well developed embedded markets Vehicle control systems Defense systems (radars, missiles, aircraft) Medical systems Communication/networking hardware
5 Application-Specific vs. General Supercomputing Application-specific acceleration : Cryptography, Image processing, Bioinformatics Supercomputing (HPC): Each node: −A commercial microprocessor + −One or more large commercial FPGAs. Largest supercomputer companies: 1.Cray Research, 2.SRC, 3.SGI (Silicon Graphics Inc.)
7 Multi-Core Speedup Amdahl’s Law: Speedup by multi-cores is limited. Even when the number of cores is large, it is limited by the serial part: −If 88% of the operations are not parallelizable, (regardless of the number of cores): Max speedup = 1/(1 - 0.12) = 1.136 ! −To achieve 50x speedup, at least 98% of the application must be appropriate for substantial acceleration.
8 FPGA Technology Potentials of FPGAs in supercomputing not used until recently, due to: limited FPGA resources (e.g. for FP operations) general unavailability of the tightly coupled RC/CPU systems, lack of high-level programming languages suitable for rapid FPGA code development. High progress in FPGA technology in recent years: n x 10 2 MHz clock rate n x 10 Gbps serial IO On-board scalar processor cores 45nm process allowing for −low power consumption, −very high logic density. Terabyte/sec memory bandwidth, n x 10 of Tops/sec. − Powerful device for supercomputing
9 Common FPGA Acceleration Techniques 1.Coprocessor model: with FPGA available via an I/O bus −typically PCI(X) or VME. Data is loaded into the FPGA in a DMA operation, Results are loaded back to main memory. Advantages: PCI and VME buses are everywhere through all general-purpose computing platforms. − Solid and well-known development environment. Inexpensive solution, Can use existing ASIC simulation software solutions.
10 Common FPGA Acceleration Techniques 1.Coprocessor model: Disadvantages: I/O bus limitations on bandwidth to main memory. HDL use is difficult for most ISV or end-user programmers. −Solution 1: Schematic-based tools allow the FPGA programmer to take existing modules (e.g. FFT module) and link together with other modules. −A bitstream is created. −Solution 2: A compiler converts the C-like constructs into HDL. −Problem: Inefficient conversion. −Tools usually allow some sections to be described in HDL.
11 Challenges * Challenges in using FPGAs for supercomputing: 1.Floating-Point Arithmetic: There is no native support for floating-point arithmetic on FPGAs. −Floating-point soft cores tend to require large amounts of area. −To achieve high clock frequencies, they must be deeply pipelined. −Floating-point performance of FPGAs is competitive with that of GPPs and is increasing at a faster rate than that of GPPs
12 Challenges * Challenges in using FPGAs for supercomputing: 2.Low Clock Frequency: Especially because of the programmable interconnects. −The main ways FPGAs obtain high performance is through the exploitation of pipelining and parallelism. −The use of pipelining and parallelism has limits.
13 Challenges * Challenges in using FPGAs for supercomputing: 3.Development of Designs: Designs for FPGAs have traditionally been written in HDL. −Unfamiliar to most in the scientific computing community. −At this lower level, developing large designs can be even more difficult than it is with high-level programming languages. −Development time: followed a digital design timeframe: −Months to define, design, develop and deliver an executable image. −Numerous efforts aimed at facilitating high-level spec-to- hardware. −These efforts are becoming especially important with the development of reconfigurable computers. −All solutions are different (Nallatech, Mitrionics, DSPLogic, Celoxica, ImpulseC, SRC, …)
14 Challenges * Challenges in using FPGAs for supercomputing: 4.Acceleration of Complete Applications: Much of the early studies of the acceleration has focused on individual tasks, (e.g., matrix operations and FFT) −Impressive speedups in individual kernels does not necessarily translate into impressive speed-ups when the kernels are integrated back into complete applications.
15 Floating Point Arithmetic FP arithmetic: Many HPC applications require double precision FP arithmetic. −So prevalent that benchmarking application ranking supercomputers (LINPACK) heavily uses DP FP math. FPGA vendors tailor their products toward their dominant customers: −DSP, network applications, embedded computing − None need FP performance. Embedded FP: −[Chong09] A flexible, generic embedded FPU, which over a variety of applications will improve performance and save a significant amount of FPGA real estate when compared to implementations on current FPGAs. −Can be configured to perform a wide range of operations: −Floating-point adder and multiplier: −one double-precision operation or −two single-precision operations in parallel. −Access to the large integer multiplier, adder and shifters in the FPU
17 Speedup Comparison Published FPGA supercomputing application results: Shows only the trend (not normalized to a common processor). Many of them compare a highly optimized FPGA FP implementation with non-optimized software. [Craven07] SP: single precision DP: double precision
18 FPGA for Floating Point Arithmetic [Dou05]: One of the highest performance benchmarks of 15.6 GFLOPS for FP matrix multiplication: −Places 39 floating-point processing elements on a theoretical Xilinx XC2VP125 FPGA. −A linear array of MAC elements, linked to a host processor providing memory access. −Pipelined to a depth of 12, permitting operation at a frequency up to 200 MHz. −Interpolating for Virtex-II Pro device: 12.4 GFLOPS, −4x a 3.2GHz Intel Pentium. −Taken as an absolute upper limit on FPGA’s DP FP performance.
19 Cost-Performance Comparison Dou architecture vs uP. Considering cost- performance: Worst processor beats best FPGA in current technology. But Only chip cost is considered in the table −Costs for circuit board and other components (motherboard, memory, network, etc.) are necessary to produce a functioning supercomputer. As most clusters incorporating FPGAs include a host processor to handle serial tasks and communication − Table favors FPGAs. * Cell processor : from Sony * System X : Virginia Tech’s supercomputing cluster
20 FPGA FP vs. Processors FPGA performance is improving faster than processors Analysis shows in 2009-2012 will be as cost- effective as processors.
21 Xilinx Core vs. Dou Architecture Xilinx 18-bit multipliers: FP core combines 16 of 18-bit multipliers to produce the 52-bit multiplication needed for DP FP mult. Dou design uses 9 of them. Non-standard data formats: IEEE standard format prevents users from leveraging an FPGA’s configurability to effectively customize for a specific problem.
22 FPGA-Based Supercomputing The main differences with accelerator board: 1.The FPGA board communicates with conventional processors as well as other FPGA boards through a high- bandwidth, low-latency interconnection network. −An order of magnitude higher bandwidth, −Latency is on the order of microseconds. FPGA computation can share state with microprocessor more easily, The granularity of computation on the FPGAs can be smaller. 2.There can be many FPGA boards on the interconnection network, The aggregate system can solve very large supercomputing problems.
23 SRC Architecture * MAP boards: Contain high-end Xilinx FPGAs. Exploits the DRAM interface of Pentium processors as a communication port to the “MAP” board SNAP ASIC: manages the protocol and communication between a dual processor Pentium node and its associated MAP (FPGA) board.
24 SRC Architecture Cluster: These microprocessor/FPGA units can be combined using commercial interconnect such as Gigabit Ethernet to form a cluster. InterConnections: A proprietary switching network connects multiple “MAPStations.” MAP boards can also communicate directly through chain ports, −by-passing the interconnection network for large, multi-FPGA board designs.
25 SGI’s RASC Reconfigurable Application-Specific Computing: Combines −low power + volume cost + high performance of new FPGAs and −very high I/O capability of the SGI Altix system. −e.g. 120 adds in parallel instead of using 12 adder in Intel Itanium 2 (10x improvement). A software environment that allows control of powerful yet unfamiliar FPGAs in the familiar C and common Linux tools (e.g gdb). Applications in use: −image processing, −encryption.
26 SRC’s Explicit/Implicit Architecture * SRC’s explicitly controlled processor is called MAP ® Fortran Unified Executable C Implicit Device Explicit Device Carte™ Programming Environment Memory I/O Bridge Memory Control Implicitly Controlled Device Dense logic device Higher clock rates Typically fixed logic µP, DSP, ASIC, etc. Explicitly Controlled Device Direct execution logic Lower clock rates Typically reconfigurable FPGA, CPLD, OPLD, etc.
27 MAP ® Implementation * Direct Execution Logic (DEL) made up of one or more User Logic devices Control circuits allow explicit control of memory prefetch and data access Multiple banks of On-Board Memory maximizes local memory bandwidth GPIO ports allow direct MAP to MAP chain connections or direct data input Multiple DMA engines support Distributed SRAM in User Logic −264 KB @ 844 GB/s Block SRAM in User Logic −648 KB @ 260 GB/s On-Board SRAM −28 MB @ 9.6 GB/s Microprocessor Memory −8 GB @ 1400 MB/s Six Banks Dual-ported On-Board Memory (24 MB) 4800 MB/s (6 x 64b) 4800 MB/s 192b 2400 MB/s each GPIO 4800 MB/s (6 x 64b) Controller XC2V6000 User Logic 1 XC2V6000 User Logic 2 XC2V6000 108b 4800 MB/s (6 x 64b) 108b 1400 MB/s sustained payload MAP 1400 MB/s sustained payload Dual-ported Memory (4 MB)
28 Wide Area Network Disk Storage Area Network Local Area Network PCI-X MAPstation MAP PPPP MemorySNAP™GPIOPorts SRC MAPstation™ MAPstation Configurations SRC-6 uses standard external network connections Tower 2U Single MAP Workstation Portable SNAP: SRC developed interface: µp boards connect to (and share memory with) the MAP processors. 4x more than PCI- X133
29 Utilizes standard clustering technology SRC Cluster Based Systems PCI-X MAPstation MAP PPPP MemorySNAPGPIOPort PCI-X MAPstation™ MAP ® PPPP MemorySNAP™GPIOPort PCI-X MAPstation MAP PPPP MemorySNAPGPIOPort PCI-X MAPstation MAP PPPP MemorySNAPGPIOPort Gig Ethernet etc. Wide Area Network Disk Storage Area Network Local Area Network SRC-6
30 Wide Area Network Disk Storage Area Network Local Area Network SRC MAPstation ™ with Hi-Bar ™ MAPstation towers hold up to 3 MAP or memory nodes MAPstation Tower MAPstation with 2 MAPs and Common Memory PCI-X/EXP PPPP Memory SNAP™ MAP GPIOPorts SRC Hi-Bar ™ Switch MEMORY MAP GPIOPorts
32 R/T Spectrum Analyzer Performance ComputeFFT TimeSpeedup Microprocessor487 S*1x MAP1.6 S304x Sources of Performance C with optimized functional units Pipelined loops Parallel code blocks Streams extend pipelines
33 R/T Video – Edge Detect Buffers 4 VGA cameras Four images on monitor 54 MPixels/S (120 FPS) Median Filter Prewitt Edge DetectR/T
34 180 o Edge Detection Performance ComputeSpeedup Microprocessor1x* MAP120x Sources of Performance C with optimized functional units – Data access Pipelined loops Parallel code blocks Streams to fuse loops – extend pipelines
35 R/T Video – Target Recognition VGA VGAcamera Image on monitor 30 FPS 30 FPS ProbeCodeTargetRecognition Probe Description
36 R/T Video – Target Recognition Performance ComputeTime/ImageSpeedup Microprocessor2.5 S1x MAP.007 S357x Sources of Performance C with optimized functional units – Data access Pipelined loops Parallel functional units
37 Cray XD1 Combines AMD dual processor Opteron nodes, a proprietary “RapidArray” interconnection network, and FPGA boards. As with SRC, the interconnect network is low latency and high bandwidth. Implements the network interface logic in an FPGA rather than a specialized ASIC. VendorProcessorComm BWFPGAMemoryInter-FPGA interconnect SRCPentium 44.8 GB/sV26K24MBChaining 4.8 GB/s CrayOpteron3.2 GB/sV2Pro5032MBRapidIO 3.125 Gb/s
40 Tools * HPC user are scientists, researchers or engineers who desire to accelerate some scientific application. −Acquainted with programming languages (C, Fortran, MATLAB, …) −Some HLL-to-gates synthesis tools: −Celoxica, SRC, …. Problems: The state of these tools does not remove the need for hardware expertise: −Hardware debugging and interfacing are still needed. Porting an existing scientific code to an RC platform using one of these languages is not as simple as just recompiling the code with a different compiler to run on a different microprocessor. −Requires adaptation of the code to the available FPGA resources −scientific application developers are not familiar. Inefficient synthesis: −Translating inherently sequential HL description into a parallel hardware eats into the performance of hardware accelerators.
41 HLL-to-FPGA Compilation * Three Approaches: 1.Compile a subset of an existing language (e.g., C or Java) to hardware. −Typically omits some operations: −dynamic memory allocation, −recursion, −complex pointer-based data structures. 2.Extend a base sequential language with constructs to manipulate bit widths, explicitly describe parallelism, and connect pieces of hardware. −Celoxica’s Handel C, −Impulse C, −MAP C compiler in SRC’s Carte programming environment 3.Create a language for algorithmic description: −University of Montreal’s SHard1, −Mitrion-C data-flow language. −Simplifies the compiler’s work, −but it can require programmers to significantly restructure algorithmic description as well as rewrite in a new syntax.
42 Impulse C From Impulse Accelerated Technologies −www.ImpulseC.com Can process blocks of C code, (most often represented by one or a small number of C subroutines), into VHDL/Verilog. Enables the automated scheduling of C statements for increased parallelism and automated and semi- automated optimizations such as loop pipelining and unrolling. Interactive tools let designers iteratively analyze and experiment with alternative hardware pipelining strategies.
43 Mitrion-C Mitrion-C and the Mitrion virtual processor From Mitrionics www.mitrionics.com Offering a fully parallel programming language. −In standard C, programmers describe the program’s order-of-execution −not fit well with parallel execution. −Mitrion-C’s processing model is based on data dependencies, −a much better fit.
44 SRC Carte Carte Design Environment: A traditional program development methodology: −Write code in C or Fortran, −Compile, −Debug via standard debugger, −Edit code, −Recompile, −…. When the application runs correctly in a microprocessor environment, it is recompiled and targeted for MAP (the direct execution logic processor).
45 SRC Carte Three compilation modes: Debug mode: Carte compiles microprocessor code using a MAP emulator to verify the interaction between the CPU and MAP. Simulation mode: Carte supports applications composed of C or Fortran and Verilog or VHDL. −The compilation produces an HDL simulation executable that supports the simulation of generated logic. Hardware compilation mode: the target is the direct execution logic that runs in MAP’s FPGAs. −In this mode, Carte optimizes for parallelism by pipelining loops, scheduling memory references, and supporting parallel code blocks and streams.
46 Handle-C From Celoxica www.celoxica.com Synthesizes user code to FPGAs. User replaces the algorithmic loop in the original Fortran, C, or C++ source application with a Celoxica API call to elicit the C code that is to be compiled into the FPGA. Handel-C extends C with constructs for hardware design, such as parallelism and timing.
47 Trident Trident: Synthesizes circuits from an HLL. 2006 R&D 100 award for innovative technology. Provides an open framework for exploring algorithmic C computation on FPGAs: −by mapping the C program’s FP operations to hardware FP modules. Users are free to select floating-point operators from a variety of standard libraries or to import their own. −Libraries: e.g., FPLibrary, Quixilica. The compiler’s open source code is available on SourceForge: −http://trident.sf.net
48 Trident The programmer manually partitions the program into software and hardware sections and writes C code to coordinate the data communication between the two parts. The C code to be mapped to hardware must conform to the synthesizable subset of C: Not permitted: −Print statements, −Recursion, −Dynamic memory allocation, −Function arguments or returned values, −Calls to functions with variable-length argument lists, −Arrays without a declared size.
Floating Point Arithmetic on FPGAs
50 Floating Point Arithmetic Floating-point arithmetic is essential to many scientific applications. Floating-point arithmetic (especially double-precision) requires a great deal of hardware resources Only recently become possible to implement many floating-point cores on a single FPGA. Disadvatages of FP cores: To achieve a high clock rate, FP cores for FPGAs must be deeply pipelined. − Difficult to reuse the same FP core for a series of computations that are dependent upon one another. They use large area − Important to use as few FP cores in an architecture as possible.
51 References [Scrafano06] Scrafano, “Accelerating scientific computing applications with reconfigurable hardware,” PhD Dissertation, 2006. −Presents the most in-depth analysis of the performance of double- precision floatingpoint FFTs on FPGAs [Hemmert05] K. Scott Hemmert and Keith D. Underwood. “An analysis of the double precision floating-point FFT on FPGAs’” In IEEE Symposium on Field-Programmable Custom Computing Machines, April 2005. [Govindu04] Gokul Govindu, Ling Zhuo, Seonil Choi, and Viktor K. Prasanna. Analysis of high performance oating point arithmetic on FPGAs. In Proceedings of the 11th Reconfigurable Architectures Workshop (RAW 2004), April 2004. [Zhuo04] Ling Zhuo and Viktor K. Prasanna. Scalable modular algorithms for floatingpoint matrix multiplication on FPGAs. In Proceedings of the 11th Reconfigurable Architectures Workshop (RAW 2004), April 2004.
52 Benchmarks A supercomputer's performance is often measured by its performance in executing the LINPACK benchmark, which is composed of linear algebra routines [86, 112]. Additionally, the fast Fourier transform (FFT).as well as LINPACK and several others.appears as a kernel in the HPCChallenge benchmark suite, showing its importance as a kernel in scientific computing applications .  Jack J. Dongarra and Piotr Luszczek. Introduction to the HPCChallenge benchmark suite. Technical Report ICL-UT-05- 01, University of Tennessee, 2005.  A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. HPL - a portable implementation of the high-performance Linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/.  Top 500 supercomputing sites. www.top500.org.
53 References [Craven07] Craven and Athanas, “Examining the viability of FPGA supercomputing,” EURASIP Journal of Embedded Systems, Article ID 93652, 2007. [SGI04] Silicon Graphics, Inc. “Extraordinary Acceleration of Workflows with Reconfigurable Application-specific Computing from SGI,” (http://www.sgi.com/pdfs/3721.pdf ), 2004. [Gokhale05] Gokhale, Graham, “Reconfigurable Computing Accelerating Computation with Field-Programmable Gate Arrays,” Springer, 2005. Tripp, Gokhale, Peterson,“Trident: From High-Level Language to Hardware Circuitry,” IEEE Computer Magazine, March 2007. [Dou05] Dou, et al, “64-bit Floating-Point FPGA Matrix Multiplication,” FPGA 2005. [Stahlberg06] Stahlberg, Wohlever and Strenski, “ Defining Reconfigurable Supercomputing,” Cray User Group, 2006.