Presentation is loading. Please wait.

Presentation is loading. Please wait.

FPGA Applications IEEE Micro Special Issue on Reconfigurable Computing Vol. 34, No.1, Jan.-Feb. 2014 Dr. Philip Brisk Department of Computer Science and.

Similar presentations


Presentation on theme: "FPGA Applications IEEE Micro Special Issue on Reconfigurable Computing Vol. 34, No.1, Jan.-Feb. 2014 Dr. Philip Brisk Department of Computer Science and."— Presentation transcript:

1 FPGA Applications IEEE Micro Special Issue on Reconfigurable Computing Vol. 34, No.1, Jan.-Feb Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223

2 Guest Editors 2 Walid Najjar UCR Paolo Ienne EPFL, Lausanne, Switzerland

3 High-Speed Packet Processing Using Reconfigurable Computing Gordon Brebner and Weirong Jiang Xilinx, Inc.

4 Contributions PX: a domain-specific language for packet-processing PX-to-FPGA compiler Evaluation of PX-designed high-performance reconfigurable computing architectures Dynamic reprogramming of systems during live packet processing Demonstrated implementations running at 100 Gbps and higher rates

5 PX Overview Object-oriented semantics – Packet processing described as component objects – Communication between objects Engine – Core packet processing functions Parsing, editing, lookup, encryption, pattern matching, etc. System – Collection of communicating engines and/or sub Parsing, editing, lookup, encryption, pattern matching, etc.

6 Interface Objects Packet – Communication of packets between components Tuple – Communication of non-packet data between components

7 OpenFlow Packet Classification in PX Send packet to parser engine

8 OpenFlow Packet Classification in PX Parser engine extracts a tuple from the packet Send the tuple to the lookup engine for classification

9 OpenFlow Packet Classification in PX Obtain the classification response from the lookup engine Forward the response to the flowstream output interface

10 OpenFlow Packet Classification in PX Forward the packet (without modification) to the outstream output interface

11 PX Compilation Flow 100 Gbps : 512-bit datapath 10 Gbps: 64-bit datapath Faster to reconfigure the generated architecture than the FPGA itself (not always applicable)

12 OpenFlow Packet Parser (4 Stages) Allowable Packet Structures: (Ethernet, VLAN, IP, TCP/UDP) (Ethernet, IP, TCP/UDP) Stage 1:Ethernet Stage 2:VLAN or IP Stage 3:IP or TCP/UDP Stage 4:TCP/UDP or bypass

13 OpenFlow Packet Parser Max. packet size Max. number of stacked sections Structure of the tuple I/O interface Ethernet header expected first Determine the type of the next section of the packet Determine how far to go in the packet to reach the next section Set relevant members in the tuple Being populated

14 OpenFlow Packet Parser

15 Three-stage Parser Pipeline Internal units are customized based on PX requirements Units are firmware-controlled – Specific actions can be altered (within reason) without reconfiguring the FPGA – e.g., add or remove section classes handled at that stage

16 OpenFlow Packet Parser Results Adjust throughput for wasted bits at the end of packets

17 Ternary Content Addressable Memory (TCAM) X = Don’t Care TCAM width and depth are configurable in PX

18 TCAM Implementation in PX depth key length result bitwidth The parser (previous example) extracted the tuple Set up TCAM access Collect the result TCAM architecture is generated automatically as described by one of the authors’ previous papers

19 TCAM Architecture

20 TCAM Parameterization PX Description – Depth (N) – Width – Result width Operational Properties – Number of rows (R) – Units per row (L) – Internal pipeline stages per unit (H) Performance – Each unit handles N/(LR) TCAM units – Lookup latency is LH + 2 clock cycles LH to process the row 1 cycle for priority encoding 1 cycle for registered output

21 Results

22 Database Analytics: A Reconfigurable Computing Approach Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Bernard Brezzo, Sameh Asaad, and Donna Eng. Dillenberger IBM T.J. Watson Research Center

23 Example: SQL Query

24 Online Transaction Processing (OLTP) Rows are compressed for storage and I/O savings Rows are decompressed when issuing queries Data pages are cached in a dedicated memory space called the buffer pool I/O operations between buffer pool and disk are transparent Data in the buffer pool is always up-to-date

25 Table Traversal Indexing – Efficient for locating a small number of records Scanning – Sift through the whole table – Used when a large number of records match the search criteria

26 FPGA-based Analytics Accelerator

27 Workflow DBMS issues a command to the FPGA – Query specification and pointers to data FPGA – Pulls pages from main memory – Parses pages to extract rows – Queries the rows – Writes qualifying queries back to main memory in database-formatted pages

28 FPGA Query Processing Join and sort operations are not streaming – Data re-use is required – FPGA block RAM storage is limited – Perform predicate evaluation and projection before join and sort Eliminate disqualified rows Eliminate unneeded columns

29 Where is the Parallelism? Multiple tiles process DB pages in parallel – Concurrently evaluate multiple records from a page within a tile Concurrently evaluate multiple predicates against different columns within a row

30 Predicate Evaluation Stored predicate values Logical Operations (Depends on query) #PEs and reduction network size are configurable at synthesis time

31 Two-Phase Hash-Join Stream the smaller join table through the FPGA Hash the join columns to populate a bit vector Store the full rows in off-chip DRAM Join columns and row addresses are stored in the address table (BRAM) Rows that hash to the same position are chained in the address table Stream the second table through the FPGA Hash rows to probe the bit vector (eliminate non-matching rows) Matches issue reads from off-chip DRAM Reduces off-chip accesses and stalls

32 Database Sort Support long sort keys (tens of bytes) Handle large payloads (rows) Generate large sorted batches (millions of records) Coloring bins keys into sorted batches https://en.wikipedia.org/wiki/Tournament_sort

33 CPU Savings Predicate Eval. Decompression + Predicate Eval.

34 Throughput and FPGA Speedup

35 Scaling Reverse Time Migration Performance Through Reconfigurable Dataflow Engines Haohan Fu 1, Lin Gan 1, Robert G Clapp 2, Huabin Ruan 1, Oliver Pell 3, Oskar Mencer 3, Michael Flynn 2, Xiaomeng Huang 1, and Guangwen Yang 1 1 Tsinghua University 2 Stanford University 3 Maxeler Technologies

36 Migration (Geology) https://upload.wikimedia.org/wikipedia/commons/3/38/GraphicalMigration.jpg

37 Reverse Time Migration (RTM) Imaging algorithm Used for oil and gas exploration Computationally demanding

38 RTM Pseudocode Iterate over time-steps, and 3D grids Iterations over shots (sources) are independent and easy to parallelize Iterate over time-steps, and 3D grids Propagate source wave fields from time 0 to nt - 1 Propagate receiver wave fields from time nt - 1 to 0 Cross-correlate the source and receiver wave field at the same time step to accumulate the result Add the recorded source signal to the corresponding location Add the recorded receiver signal to the corresponding location Boundary conditions

39 RTM Computational Challenges Cross-correlate source and receiver signals – Source/receiver wave signals are computed in different directions in time – The size of a source wave field for one time-step can be 0.5 to 4 GB – Checkpointing: store source wave field and certain time steps and recompute the remaining steps when needed Memory access pattern – Neighboring points may be distant in the memory space – High cache miss rate (when the domain is large)

40 Hardware

41 General Architecture

42 Java-like HDL / MaxCompiler Stencil Example Automated construction of a window buffer that covers different points needed by the stencile Data type: no reason that all floating- point data must be 32- or 64-bit IEEE compliant (float/double)

43 Performance Tuning Optimization strategies – Algorithmic requirements – Hardware resource limits Balance resource utilization so that none becomes a bottleneck – LUTs – DSP Blocks – block RAMs – I/O bandwidth

44 Algorithm Optimization Goal: – Avoid data transfer required to checkpoint source wave fields Strategies: – Add randomness to the boundary region – Make computation of source wave fields reversible

45 Custom BRAM Buffers 37 pt. Star Stencil on a MAX3 DFE 24 concurrent pipelines at 125 MHz Concurrent access to 37 points per cycle Internal memory bandwidth of 426 Gbytes/sec

46 More Parallelism Process multiple points concurrently – Demands more I/O Cascade multiple time steps in a deep pipeline – Demands more buffers

47 Number Representation 32-bit floating-point was default Convert many variables to 24-bit fixed-point – Smaller pipelines => MORE pipelines Floating-point -16,943 LUTs -23,735 flip-flops -24 DSP48Es Fixed-point -3,385 LUTs -3,718 flip-flops -12 DSP48Es

48 Hardware Decompression I/O is a bottleneck – Compress data off-chip – Decompress on the fly – Higher I/O bandwidth Wave field data – Must be read and written many times – Lossy compression acceptable 16-bit storage of 32-bit data Velocity data and read-only Earth model parameters – Store values in a ROM – Create a table of indices into the ROM Decompression requires ~1300 LUTs and ~1200 flip-flops

49 Results

50 Performance Model Memory bandwidth constraint # points processed in parallel # bytes per point compression ratio frequency memory bandwidth Resource constraint (details omitted)

51 Performance Model Cost (cycles consumed on redundant streaming of overlapping halos) Model # points processed in parallel # time steps cascaded in one pass frequency

52 Model Evaluation

53 Fast, Flexible High-Level Synthesis from OpenCL Using Reconfiguration Contexts James Coole and Greg Stitt University of Florida

54 Compiler Flow Intermediate Fabric – Coarse-grained network of arithmetic processing units synthesized on the FPGA – 1000x faster place-and- route than an FPGA directly – 72-cycle maximum reconfiguration time

55 Intermediate Fabric Multiple datapaths per kernel Reconfigure the FPGA to swap datapaths The intermediate fabric can implement each kernel by reconfiguring the fabric routing network The core arithmetic operators are often shared across kernels

56 Reconfiguration Contexts One intermediate fabric may not be enough Generate an application-specific set of intermediate fabrics Reconfigure the FPGA to switch between intermediate fabrics

57 Context Design Heuristic Maximize number of resources reused across kernels in a context Minimize area of individual contexts Use area savings to scale-up contexts to support kernels that were not known at context design-time Kernels grouped using K- means clustering

58 Compiler Results

59 Configuration Bitstream Sizes and Recompilation Times

60 ReconOS: An Operating Systems Approach for Reconfigurable Computing Andreas Agne 1, Markus Happe 2, Ariane Keller 2, Enno Lübbers 3, Bernhard Plattner 2, Marco Platzner 1, and Christian Plessl 1 1 University of Paderborn, Germany 2 ETH Zürich, Switzerland 3 Intel Labs, Europe

61 Does Reconfigurable Computing Need an Operating System? Application partitioning – Sequential:Software/CPU – Parallel/deeply pipelined:Hardware/FPGA Partitioning requires – Communication and synchronization The OS provides – Standardization and portability The alternative is – System- and application-specific services – Error-prone – Limited portability – Reduced designer productivity

62 ReconOS Benefits Application development is structured and starts with software Hardware acceleration is achieved by design space exploration OS-defined synchronization and communication mechanisms provide portability Hardware and software threads are the same from the application development perspective Implicit support for partial dynamic reconfiguration of the FPGA

63 Programming Model Application partitioned into threads (HW/SW) Threads communicate and synchronize using one of the programming model’s objects – Communication: Message queues, mailboxes, etc. – Synchronization: Mutexes

64 Stream Processing Software

65 Stream Processing Hardware

66 OSFSM in VHDL VHDL library wraps all OS calls with VHDL procedures – Transitions are guarded by an OS-controlled signal done, Line 47 – Blocking OS calls can pause execution of a HW thread e.g., mutex_lock(),

67 ReconOS System Architecture Delegate Thread Interface between HW thread and the OS via OSIF The OS is oblivious to HW acceleration Imposes non-negligible overhead on OS calls

68 OSFSM / OSIF / CPU Interface Handshaking provides synchronization OS requests are a sequence of words communicated via FIFOs

69 HW Thread Interfaces

70 ReconOS Toolflow

71 Example: Object Tracking


Download ppt "FPGA Applications IEEE Micro Special Issue on Reconfigurable Computing Vol. 34, No.1, Jan.-Feb. 2014 Dr. Philip Brisk Department of Computer Science and."

Similar presentations


Ads by Google