Presentation is loading. Please wait.

Presentation is loading. Please wait.

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.

Similar presentations


Presentation on theme: "Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University."— Presentation transcript:

1 Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, Motorola/Freescale Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), Kris Miller (MS 2007), David Sheldon (3 rd yr PhD), Ryan Mannion (2 nd yr PhD), Scott Sirowy (1 st yr PhD)

2 Frank Vahid, UC Riverside2/57 Outline FPGAs Overview Hard to program --> Binary-level partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary

3 Frank Vahid, UC Riverside3/57 FPGAs FPGA -- Field-Programmable Gate Array Off-the-shelf chip, evolved in early 1990s Implements custom circuit just by downloading stream of bits (“software”) Basic idea: N-address memory can implement N-input combinational logic (Note: no “gate array” inside) Memory called Lookup Table, or LUT FPGA “fabric” Thousands of small (~3-input) LUTs – larger LUTs are inefficient Thousands of switch matrices (SM) for programming interconnections Possibly additional hard core components, like multipliers, RAM, etc. CAD tools automatically map desired circuit onto FPGA fabric * + * “Lookup table” -- LUT ab FG a1a0a1a0 4x2 Memory abab 1 0 1 0 1 1 1 0 d 1 d 0 F G 00 01 10 11 Implement particular circuit just by downloading particular bits LUT SM LUT SM LUT *

4 Frank Vahid, UC Riverside4/57 FPGAs: "Programmable" like Microprocessors -- Download Bits Processor 001010010 … 001010010 … 0010 … Bits loaded into program memory Microprocessor Binaries 001010010 … 01110100... Bits loaded into LUTs, CLBs, and SMs FPGA Binaries

5 Frank Vahid, UC Riverside5/57 FPGAs as Coprocessors Coprocessor -- Accelarates application kernel by implementing as circuit ASIC coprocessor known to speedup many application kernels Energy advantages too (e.g., Henkel’98, Rabaey’98, Stitt/Vahid’04) FPGA coprocessor also gives speedup/energy benefits (Stitt/Vahid IEEE D&T’02, IEEE TECS’04) Con: more silicon (~20x), ~4x performance overhead (Rose FPGA'06) Pro: platform fully programmable Shorter time-to-market, smaller non- recurring engineering (NRE) cost, low cost devices available, late changes (even in- product) ASIC Proc. Application FPGA Proc. Application

6 Frank Vahid, UC Riverside6/57 FPGAs as Coprocessors Surprisingly Competitive to ASIC FPGA 34% energy savings versus ASIC’s 48% (Stitt/Vahid IEEE D&T’02, IEEE TECS’04) A jet isn’t as fast as a rocket, but it sure beats driving

7 Frank Vahid, UC Riverside7/57 FPGA – Why (Sometimes) Better than Microprocessor x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); C Code for Bit ReversalHardware for Bit Reversal Bit Reversed X Value........... Original X Value Processor FPGA Requires only 1 cycle (speedup of 32x to 128x) sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles In general, because of concurrency, from bit-level to task level

8 Frank Vahid, UC Riverside8/57 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. FPGAs: Why (Sometimes) Better than Microprocessor for (i=0; i < 128; i++) y[i] += c[i] * x[i].. ************ ++++++ + ++ ++ +....... C Code for FIR Filter Processor 1000’s of instructions Several thousand cycles Hardware for FIR Filter Processor FPGA ~ 7 cycles Speedup > 100x

9 Frank Vahid, UC Riverside9/57 FPGAs are Hard to Program Synthesis from hardware description languages (HDLs) VHDL, Verilog Great for parallelism But non-standard languages, manual partitioning SystemC a good step C/C++ partitioning compilers Use language subset Growing in importance But special compiler limits adoption Binary Applic. Profiling Special Compiler Binary Netlist FPGA Binary Micropr. Binary Includes synthesis, tech. map, pace & route FPGAProc. 100 software writers for every CAD user Only about 15,000 CAD seats worldwide; millions of compiler seats

10 Frank Vahid, UC Riverside10/57 Binary-Level Partitioning Helps Binary-level partitioning Stitt/Vahid, ICCAD’02 Recent commercial product: Critical Blue [www.criticalblue.com] Partition and synthesize starting from SW binary Advantages Any compiler, any language, multiple sources, assembly/object support, legacy code support Better incorporation into toolflow Disadvantage Quality loss due to lack of high-level language constructs? (More later) Binary SW Profiling Standard Compiler Binary Binary-level Partitioner Netlist Modified Binary Traditional partitioning done here Less disruptive, back-end tool Includes synthesis, tech. map, place & route FPGAProc.

11 Frank Vahid, UC Riverside11/57 Outline FPGAs Overview Hard to program --> Binary-level partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary

12 Frank Vahid, UC Riverside12/57 Warp Processing Observation: Dynamic binary recompilation to a different microprocessor architecture is a mature commercial technology e.g., Modern Pentiums translate x86 to VLIW Question: If we can recompile binaries to FPGA circuits, can we dynamically recompile binaries to FPGA circuits?

13 Frank Vahid, UC Riverside13/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

14 Frank Vahid, UC Riverside14/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µPµP

15 Frank Vahid, UC Riverside15/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

16 Frank Vahid, UC Riverside16/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

17 Frank Vahid, UC Riverside17/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

18 Frank Vahid, UC Riverside18/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +...

19 Frank Vahid, UC Riverside19/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... FPGA CLB SM ++

20 Frank Vahid, UC Riverside20/57 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”

21 Frank Vahid, UC Riverside21/57 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ µPµP Likely multiple microprocessors per chip, serviced by one on-chip CAD block µPµP µPµP µPµP µPµP µPµP

22 Frank Vahid, UC Riverside22/57 Warp Processing: Trend Towards Processor/FPGA Programmable Platforms FPGAs with hard core processors FPGAs with soft core processors Computer boards with FPGAs Cray XD1. Source: FPGA journal, Apr’05 Xilinx Virtex II Pro. Source: XilinxAltera Excalibur. Source: Altera Xilinx Spartan. Source: Xilinx

23 Frank Vahid, UC Riverside23/57 Warp Processing: Trend Towards Processor/FPGA Programmable Platforms Programming a key challenge Soln 1: Compile high-level language to custom binaries using both microprocessor and FPGA Soln 2: Use standard microprocessor binaries, dynamically re-compile (warp) Cons: Less high-level information when compiling, less optimization Pros: Available to all software developers, not just specialists Data dependent optimization Most importantly, standard binaries enable “ecosystem” among tools, architecture, and applications Architectures Applications Tools Standard binaries Standard binary (and ecosystem) concept presently absent in FPGAs and other new programmable platforms Binary SW Profiling Standard Compiler Binary Profiling CAD Tools Traditional partitioning done here FPGAProc. FPGAProc. FPGAProc. Profiling CAD Tools FPGAProc. Profiling CAD Tools

24 Frank Vahid, UC Riverside24/57 Outline FPGAs Overview Hard to program --> Binary-level partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary

25 Frank Vahid, UC Riverside25/57 µPµP I$ D$ FPGA Profiler On-chip CAD Warp Processing Steps (On-Chip CAD) Technology mapping, placement, and routing Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

26 Frank Vahid, UC Riverside26/57 Warp Processing – Profiling and Partitioning Applications spend much time in small amount of code 90-10 rule Observed 75-4 rule for MediaBench, NetBench Developed efficient hardware profiler Gordon-Ross/Vahid, CASES'04, IEEE Trans. on Comp 06 Partitioning straightforward Try most critical code first µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

27 Frank Vahid, UC Riverside27/57 Warp Processing – Decompilation Synthesis from binary has a key challenge High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – huge overheads Need to recover high-level information Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone

28 Frank Vahid, UC Riverside28/57 Warp Processing – Decompilation Solution –Recover high-level information from binary: decompilation Extensive previous work (for different purposes) Adapted Developed new decompilation methods also Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

29 Frank Vahid, UC Riverside29/57 New Decompilation Method: Loop Rerolling Problem: Compiler unrolling of loops (to expose parallelism) causes synthesis problems: Huge input (slow), can’t unroll to desired amount, can’t use advanced loop methods (loop pipelining, fusion, splitting,...) Solution: New decompilation method: Loop Rerolling Identify unrolled iterations, compact into one iteration for (int i=0; i < 3; i++) accum += a[i]; Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2, 100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add reg1, reg1, reg2 Loop Unrolling for (int i=0; i<3;i++) reg1 += array[i]; Loop Rerolling

30 Frank Vahid, UC Riverside30/57 Loop Rerolling: Identify Unrolled Iterations x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Original C Code Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 Binary x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Unrolled Loop Add r3, r3, 1 => B Ld r0, b(0) => A Add r1, r0, 1 => B St a(0), r1 => C Ld r0, b(1) => A Add r1, r0, 1 => B St a(1), r1 => C Mov r4, r3 => D Map to String BABCABCD String Representation Find consecutively repeating instruction sequences abc c d b abcabcd c abcd d d d Suffix Tree Derived from bioinformatics techniques

31 Frank Vahid, UC Riverside31/57 Warp Processing – Decompilation Study Synthesis after decompilation often quite similar Almost identical performance, small area overhead FPGA 2005

32 Frank Vahid, UC Riverside32/57 2. Deriving high-level constructs from binaries Recent study of decompilation robustness In presence of compiler optimizations, and instruction sets Energy savings of 77%/76%/87% for MIPS/ARM/Microblaze ICCAD’05 DATE’04

33 Frank Vahid, UC Riverside33/57 Decompilation is Effective Even with High Compiler-Optimization Levels Average Speedup of 10 Examples Speedups similar on MIPS for –O1 and –O3 optimizations Speedups similar on ARM for –O1 and –O3 optimizations Speedups similar between ARM and MIPS Complex instructions of ARM didn’t hurt synthesis MicroBlaze speedups much larger MicroBlaze is a slower microprocessor -O3 optimizations were very beneficial to hardware Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.

34 Frank Vahid, UC Riverside34/57 Decompilation Effectiveness In-Depth Study Performed in-depth study with Freescale H.264 video decoder Highly-optimized proprietary code, not reference code Huge difference Research question: Is synthesis from binaries competitive on highly- optimized code? Several-month study MPEG 2 H.264: Better quality, or smaller files, using more computation

35 Frank Vahid, UC Riverside35/57 Optimized H.264 Larger than most benchmarks H.264: 16,000 lines Previous work: 100 to several thousand lines Highly-optimized H.264: Many man-hours of manual optimization 10x faster than reference code used in previous works Different profiling results Previous examples ~90% time in several loops H.264 ~90% time in ~45 functions Harder to speedup

36 Frank Vahid, UC Riverside36/57 C vs. Binary Synthesis on Opt. H.264 Binary partitioning competitive with source partitioning Speedups compared to ARM9 software Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information needed for partitioning and synthesis

37 Frank Vahid, UC Riverside37/57 Warp Processing – Synthesis ROCM - Riverside On-Chip Minimizer Standard register-transfer synthesis Logic synthesis – make it lean Combination of approaches from Espresso-II [Brayton, et al., 1984][Hassoun & Sasoa, 2002] and Presto [Svoboda & White, 1979] Cost/benefit analysis of operations Result Single expand phase instead of multiple iterations Eliminate need to compute off-set – reduces memory usage On average only 2% larger than optimal solution µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation ExpandReduceIrredundant dc-seton-setoff-set

38 Frank Vahid, UC Riverside38/57 Warp Processing – JIT FPGA Compilation Hard – Routing is extremely compute/memory intensive Solution – Jointly design CAD and FPGA architecture Cost/benefit analysis Highly iterative process µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

39 Frank Vahid, UC Riverside39/57 Warp-Targeted FPGA Architecture CAD-specialized configurable logic fabric Simplified switch matrices Directly connected to adjacent CLB All nets are routed using only a single pair of channels Allows for efficient routing Routing is by far the most time- consuming on-chip CAD task Simplified CLBs Two 3 input, 2 output LUTs Each CLB connected to adjacent CLB to simplify routing of carry chains Currently being prototyped by Intel (scheduled for 2006 Q3 shuttle) 0 0L 1 1L 2L 2 3L 3 0 1 2 3 0L 1L 2L 3L 0 1 2 3 0L1L2L3L 0123 0L1L2L 3L LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB DATE’04 µPµP I$ D$ FPGA Profiler On-chip CAD

40 Frank Vahid, UC Riverside40/57 Warp Processing – Technology Mapping Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 ROCTM - Technology Mapping/Packing Decompose hardware circuit into DAG Nodes correspond to basic 2-input logic gates (AND, OR, XOR, etc.) Hierarchical bottom-up graph clustering algorithm Breadth-first traversal combining nodes to form single-output LUTs Combine LUTs with common inputs to form final 2-output LUTs Pack LUTs in which output from one LUT is input to second LUT JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

41 Frank Vahid, UC Riverside41/57 Warp Processing – Placement ROCPLACE - Placement Dependency-based positional placement algorithm Identify critical path, placing critical nodes in center of CLF Use dependencies between remaining CLBs to determine placement Attempt to use adjacent CLB routing whenever possible CLB Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

42 Frank Vahid, UC Riverside42/57 ROCR - Riverside On-Chip Router Requires much less memory than VPR as resource graph is smaller 10x faster execution time than VPR (Timing driven) Produces circuits with critical path 10% shorter than VPR (Routablilty driven) Warp Processing – Routing Dynamic FPGA Routing for Just-in-Time FPGA Compilation, DAC’04 Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing

43 Frank Vahid, UC Riverside43/57 Outline FPGAs Overview Hard to program --> Binary-level partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary

44 Frank Vahid, UC Riverside44/57 Experiments with Warp Processing Warp Processor ARM/MIPS plus our fabric Riverside on-chip CAD tools to map critical region to configurable fabric Requires less than 2 seconds on lean embedded processor to perform synthesis and JIT FPGA compilation Traditional HW/SW Partitioning ARM/MIPS plus Xilinx Virtex-E FPGA Manually partitioned software using VHDL VHDL synthesized using Xilinx ISE 4.1 ARM I$ D$ FPGA Profiler On-Chip CAD ARM I$ D$ Xilinx Virtex-E FPGA

45 Frank Vahid, UC Riverside45/57 Warp Processors Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E SW Only Execution WCLA simplicity results in faster HW circuits

46 Frank Vahid, UC Riverside46/57 Warp Processors Performance Speedup (Overall, Multiple Kernels) Average speedup of 7.4 Energy reduction of 38% - 94% SW Only Execution Assuming 100 MHz ARM, and fabric clocked at rate determined by synthesis

47 Frank Vahid, UC Riverside47/57 Warp Processors - Results Execution Time and Memory Requirements 60 MB 9.1 s Xilinx ISE 3.6MB 1.4s DPM (CAD) (75MHz ARM7) 3.6MB 0.2 s DPM (CAD)

48 Frank Vahid, UC Riverside48/57 Outline FPGAs Overview Hard to program --> Binary-level partitioning Warp processing Techniques underlying warp processing Overall warp processing results Directions and Summary

49 Frank Vahid, UC Riverside49/57 Direction: Coding Guidelines for Partitioning? In-depth H264 study led to a question: Why aren’t speedups (from binary or C) closer to “ideal” (0-time per fct) We thus examined dozens of benchmarks in more detail Are there simple coding guidelines that result in better speedups when kernels are synthesized to circuits?

50 Frank Vahid, UC Riverside50/57 Synthesis-Oriented Coding Guidelines Pass by value-return Declare a local array and copy in all data needed by a function (makes lack of aliases explicit) Function specialization Create function version having frequent parameter-values as constants void f(int width, int height ) {.... for (i=0; i < width, i++) for (j=0; j < height; j++)... } void f_4_4() {.... for (i=0; i < 4, i++) for (j=0; j < 4; j++)... } Bounds are explicit so loops are now unrollable Original Rewritten

51 Frank Vahid, UC Riverside51/57 Synthesis-Oriented Coding Guidelines Algorithmic specialization Use parallelizable hardware algorithms when possible Hoisting and sinking of error checking Keep error checking out of loops to enable unrolling Lookup table avoidance Use expressions rather than lookup tables int clip[512] = {... } void f() {... for (i=0; i < 10; i++) val[i] = clip[val[i]];... } void f() {... for (i=0; i < 10; i++) if (val[i] > 255) val[i] = 255; else if (val[i] < 0) val[i] = 0;... } val[1] < 0255 3x1 > val[1] val[0] < 0255 3x1 > val[0] OriginalRewritten... Comparisons can now be parallelized

52 Frank Vahid, UC Riverside52/57 Synthesis-Oriented Coding Guidelines Use explicit control flow Replace function pointers with if statements and static function calls void (*funcArray[]) (char *data) = { func1, func2,... }; void f(char *data) {... funcPointer = funcArray[i]; (*funcPointer) (data);... } void f(char *data) {... if (i == 0) func1(data); else if (i==1) func2(data);... } OriginalRewritten

53 Frank Vahid, UC Riverside53/57 Coding Guideline Results on H.264 Simple coding guidelines made large improvement Rewritten software only ~3% slower than original And, binary partitioning still competitive with C partitioning Speedups: Binary: 6.55, C: 6.56 Small difference caused by switch statements that used indirect jumps

54 Frank Vahid, UC Riverside54/57 Coding Guideline Results on Other Benchmarks Studied guidelines further on standard benchmarks Further synthesis speedups (again, independent of C vs. binary issue) More guidelines to be developed As compute platforms incorporate FPGAs, might these guidelines become mainstream?

55 Frank Vahid, UC Riverside55/57 Direction: New Applications – Image Processing 32x average speedup compared to uP with 10x faster clock Exploits parallelism in image processing Window operations contain much fine-grained parallelism And, each pixel can be determined in parallel Performance is memory-bandwidth limited Warp processing can output a pixel per cycle for each pixel that can be fetched from memory per cycle Faster memory will further improve performance

56 Frank Vahid, UC Riverside56/57 Direction: Applications with Process- Level Parallelism Parallel code provides further speedup Average 79x speedup compared to desktop uP Use FPGA to implement 10s or 100s of processors Can also exploit instruction-level parallelism Warp tools will have to detect coarse-grained parallelism

57 Frank Vahid, UC Riverside57/57 Summary Showed feasibility of warp technology Application kernels can be dynamically mapped to FPGA by reasonable amount of on-chip compute resources Tremendous potential applicability Presently investigating Embedded (w/ Freescale) Desktop (w/ Intel) Server (w/ IBM) Radically-new FPGA apps may be possible Neural networks that rewire themselves? Network routers whose queuing structure changes based on traffic patterns? If the technology exists to synthesize circuits dynamically, what can we do with that technology?


Download ppt "Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University."

Similar presentations


Ads by Google