Presentation on theme: "1 2005 MAPLD/203 Linderman FPGA Acceleration of Information Management Services Richard Linderman Mark Linderman Air Force Research Laboratory Information."— Presentation transcript:
MAPLD/203 Linderman FPGA Acceleration of Information Management Services Richard Linderman Mark Linderman Air Force Research Laboratory Information Directorate Chun-Shin Lin Electrical and Computer Engr. Univ. of Missouri-Columbia
MAPLD/203 Linderman Introduction Information Management capabilities are built upon core services such as publish and subscribe Enhanced Publish and subscribe (Pub-Sub) services allow subscribers to specify predicates that filter out undesired publications The goal of this research is to accelerate XML predicate evaluations using FPGAs Recent advances include the incorporation of 2-bit logic encodings to enhance partial evaluations and incremental design techniques to handle volatile predicates
MAPLD/203 Linderman Pub-Sub Brokering Problem Information regarding a publication is described using an XML metadata document. What the subscribers want are defined using XPATH predicates. The pub-sub brokering system evaluates predicates against the XML document to find matches.
MAPLD/203 Linderman Metadata in XML: an example mil.af.rl.mti.report 1 0 text/plain T14:20:00 VMAQ T163105Z
MAPLD/203 Linderman FPGAs for Acceleration Use an FPGAs to implement a finite state machine to parse the metadata document. The XML document is written into the block RAM of the FPGA from a microprocessor through DMA. Predicates are evaluated in parallel using the data generated by the parser. (Combinational logic).
MAPLD/203 Linderman The System Micro-processor (a node on HHPC) Input FIFO & Output FIFO 64-bit bus FPGA board XML Parser Predicates Evaluator
MAPLD/203 Linderman Software on PC Schema List of Leaves Generator Predicates Predicate VHDL Generator List of Leaves Hash Table Generator Perfect Hash Parameters & Table pred.vhd Parser VHDL Modifier constant.vhdparser.vhd Xilinx ISE Top-level VHDL pe.hex or pe.x86Parsed results to HHPC A one-step tool has been developed to generate the x86 file automatically for configuring an FPGA.
MAPLD/203 Linderman Handling of Data Types Character strings: converted to 16-bit data by hashing Numbers: Supports up to 3 decimal digits (sufficient for standard lat/lon representations) Date/time: converted into 32-bit representation
MAPLD/203 Linderman Date/Time Comparison Date/Time (in format T ) is converted into the following representation, a 32-bit number. 6 bits 4 bits 5 bits 5 bits 6 bits 6 bits year month day hour min sec | + 32 years A sub-state-machine performs time-zone adjustment. Up to 5 clock cycles could be required.
MAPLD/203 Linderman Experimental Evaluation To evaluate how much time is saved using FPGA, we compared predicate evaluation time on two configurations: 1.Xeon ® processor (software only) Xeon ® computes all predicates 2.Xeon ® with the FPGA coprocessor (H/W + S/W) Xeon ® only evaluates residual (maybe true/maybe false) predicates All HHPC software written in C
MAPLD/203 Linderman Timing Results Character string (leaves) dominant case: (75 MHz Clock) For 10 publications and 30 predicates: Average DMA transaction: usec Average check time: usec Average check all time: usec Numerical data (leaves) dominant case: (50 MHz Clock) For 20 publications and 100 predicates: Average DMA transaction: usec Average check time: usec Average check all time: usec
MAPLD/203 Linderman The Need of Incremental Design When the set of predicates becomes big, the synthesis time can become quite long (hours for thousands predicates). If a few predicates are changed, re-synthesis is time consuming. Partitioning the set into subsets allows the re- synthesis for only altered subset(s) and thus saves time.
MAPLD/203 Linderman Stable and Volatile Predicate Sets Stable Set Includes stable predicates Size is BIG Synthesis time could be LONG Re-synthesis is not required Volatile Set Includes volatile predicates Size is small Re-synthesis takes short time. Stable Set Volatile Set
MAPLD/203 Linderman Experimental Results Using Stable and Volatile Sets: 1. For 400 stable predicates plus 6 in Volatile Set Synthesis Place and route Nonincremental 1197 sec 480 sec (with area grouping) Incremental 7 sec 183 sec (with area grouping) 2. For 1000 stable predicates plus 6 in Volatile Set Synthesis Place and route Nonincremental 7240 sec 752 sec (with area grouping) Incremental 7 sec 278 sec (with area grouping) * Saving on synthesis is major. * Synthesis time increases exponentially with set size.
MAPLD/203 Linderman Equal Size Small Predicate Sets All sets are equal size Any set can be re-synthesized inexpensively. For better efficiency, it is suggested to put volatile predicates into one or a few subsets although any set is allowed to be re-synthesized. Advantage Any set can be re-synthesized inexpensively. Disadvantage More hardware used :::: Volatile set Stable Set
MAPLD/203 Linderman Experimental Results Using Small Sets: for 1000 predicates One set 5 subsets 10 subsets (1000) (200 each) (100 each) ________________________________________________________ - Total synthesis time 7240s 1644s976s - Place and route 286s 299s369s - Reconfigure when one subset has been changed 7526s < 609s <455s - Hardware (FFs) Hardware (LUTs)
MAPLD/203 Linderman Experimental Results (cont.) Observations/Explanation 1.Synthesis time decreases significantly when smaller subsets are used. Reason: minimization is local. 2.Place and Route time increases when smaller subsets are used Reason: the number of LUTs increases; more hardware to handle 3.Use the same amount of flip-flops (FFs) Reason: predicate hardware is completely combinational. 4.The number of LUTs increases when the subset is smaller Reason: logic function minimization is local and thus less efficient/complete.
MAPLD/203 Linderman Conclusion Hardware + Software evaluation times substantially better than software-only implementations Help from incremental synthesis is significant and that from incremental place and route seems limited. To make the incremental synthesis more efficient, partitioning predicates into smaller subsets helps much. A drawback is that the hardware usage will increase. For the size of about 1000 predicates, reconfiguration time can be reduced from 7526 sec (over 2 hours) to several minutes (e. g. 455 seconds) depending on set partition.