Presentation is loading. Please wait.

Presentation is loading. Please wait.

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Similar presentations


Presentation on theme: "Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate."— Presentation transcript:

1 Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, and Freescale Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), Kris Miller (MS 2007), David Sheldon (3 rd yr PhD), Ryan Mannion (2 nd yr PhD), Scott Sirowy (1 st yr PhD)

2 Frank Vahid, UC Riverside2/39 Outline FPGAs Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Directions

3 Frank Vahid, UC Riverside3/39 FPGAs FPGA -- Field-Programmable Gate Array Implement circuit by downloading bits N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric Thousands of LUTs and SMs, increasingly additional hard core components like multipliers, RAM, etc. CAD tools automatically map desired circuit onto FPGA fabric ab a1a0a1a0 4x2 Memory abab 1 0 1 0 1 1 1 0 d 1 d 0 F G 00 01 10 11 Implement circuit by downloading particular bits LUT FG 2x2 switch matrix x y 0 1 0 1 10 a b FPGA SM LUT SM LUT 01 11 01 00 11... 10 11 00 01...

4 Frank Vahid, UC Riverside4/39 FPGAs are "Programmable" like Microprocessors – Just Download Bits Processor 001010010 … 001010010 … 0010 … Bits loaded into program memory Microprocessor Binaries 001010010 … 01110100... Bits loaded into LUTs and SMs FPGA "Binaries" Processor FPGA 0111 … More commonly known as "bitstream"

5 Frank Vahid, UC Riverside5/39 FPGA – Why (Sometimes) Better than Microprocessor x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles Circuit for Bit Reversal Bit Reversed X Value........... Original X Value Processor FPGA Requires only 1 cycle (speedup of 32x to 128x)

6 Frank Vahid, UC Riverside6/39 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. FPGA: Why (Sometimes) Better than Microprocessor for (i=0; i < 128; i++) y[i] += c[i] * x[i].. ************ ++++++ + ++ ++ + C Code for FIR Filter Processor 1000’s of instructions Several thousand cycles Circuit for FIR Filter Processor FPGA ~ 7 cycles Speedup > 100x In general, FPGA better due to circuit's concurrency, from bit-level to task level

7 Frank Vahid, UC Riverside7/39 Extensive Studies over Past Decade Large speedups on many important applications See ACM/SIGDA Int. Symp. on FPGAs So why aren't FPGAs ubiquitous?

8 Frank Vahid, UC Riverside8/39 Why FPGAs aren’t Mainstream Cost – But improving yearly Power – But improving yearly, and energy benefits too Extra chip – But integration continues Programming methodology Source: Xilinx 1 million system gate FPGA cost

9 Frank Vahid, UC Riverside9/39 Why FPGAs aren’t Mainstream Cost Power Extra chip Programming methodology Though tremendous progress in past decade Implementation Assembly code Microprocessor binaryFPGA binary Logic equations / FSMs Register transfers Compilation (1960s, 1970s) Assembling, linking (1950s, 1960s) Behavioral synthesis (1990s) RT synthesis (1980s, 1990s) Logic synthesis, physical design (1970s, 1980s) MicroprocessorsFPGA circuits Automated hardware/software partitioning C/C++/JavaC/C++/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C... Application (C/C++/Java/SystemC/Handel-C/Streams-C/…) Downloading

10 Frank Vahid, UC Riverside10/39 So What’s the Holdup? FPGAs require special compilers Limits adoption – desktop world dominates 100 software writers for every CAD user Millions of compiler seats worldwide, vs. 15,000 CAD seats Binary Applic. Standard Compiler Binary FPGA Binary Microproc Binary FPGAProc. Includes synthesis, tech. map, place & route Special Compiler

11 Frank Vahid, UC Riverside11/39 Outline FPGAs Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Directions

12 Frank Vahid, UC Riverside12/39 Can we Hide FPGAs from Programmers and Standard Tools? Example Radically different x86 architectures hidden from programmers and tools All execute standard x86 binaries On-chip tools dynamically translate binary to particular architecture Idea: Hide FPGA from programmers and tools Download standard binary Have on-chip tools dynamically translate binary (portions) to FPGA We call this Warp Processing Binary SW Profiling Standard Compiler Binary Traditional partitioning done here RISC architecture Translator VLIW architecture Translator FPGAProc. Translator

13 Frank Vahid, UC Riverside13/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

14 Frank Vahid, UC Riverside14/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µPµP

15 Frank Vahid, UC Riverside15/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

16 Frank Vahid, UC Riverside16/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

17 Frank Vahid, UC Riverside17/39 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

18 Frank Vahid, UC Riverside18/39 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +...

19 Frank Vahid, UC Riverside19/39 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... FPGA CLB SM ++

20 Frank Vahid, UC Riverside20/39 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”

21 Frank Vahid, UC Riverside21/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ µPµP Likely multiple microprocessors per chip, serviced by one on-chip CAD block µPµP µPµP µPµP µPµP µPµP

22 Frank Vahid, UC Riverside22/39 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

23 Frank Vahid, UC Riverside23/39 Decompilation Synthesis from binary has a potential hurdle High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – huge overheads Need to recover high-level information Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone

24 Frank Vahid, UC Riverside24/39 Decompilation Solution – Recover high-level information from binary: decompilation Adapted extensive previous work (for different purposes) Developed new decompilation methods also Ph.D. work of Greg Stitt (Ph.D. UCR 2006) Numerous publications: http://www.cs.ucr.edu/~vahid/pubs Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

25 Frank Vahid, UC Riverside25/39 Decompilation Results vs. C Compared with synthesis from C Synthesis after decompilation often quite similar Almost identical performance, small area overhead FPGA 2005

26 Frank Vahid, UC Riverside26/39 Decompilation Results on Optimized H.264 In-depth Study with Freescale Used highly-optimized benchmark Results: Binary approach competitive Speedups compared to ARM9 software Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information needed for partitioning and synthesis

27 Frank Vahid, UC Riverside27/39 Simple Coding Guidelines Bring Speedups Closer to Ideal Interesting discovery during H264 study – C style limited speedup Orthogonal to binary vs. C issue – coding style hurt both Developed simple coding guidelines Rewritten software: 20 minutes, and only ~3% slower than original New speedups: Binary: 6.55, C: 6.56 Binary still competitive with C Following guidelines not required, but helps any approach targeting FPGAs

28 Frank Vahid, UC Riverside28/39 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

29 Frank Vahid, UC Riverside29/39 Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA e.g., Our router (ROCR) 10x faster and 20x less memory than popular VPR tool, at cost of 30% longer critical path. Similar results for synth & placement Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) Numerous publications: http://www.cs.ucr.edu/~vahid/pubs JIT FPGA Compilation DAC’04 Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

30 Frank Vahid, UC Riverside30/39 JIT FPGA Compilation 60 MB 9.1 s Xilinx ISE 3.6MB 1.4s Riverside JIT FPGA tools on a 75MHz ARM7 3.6MB 0.2 s Riverside JIT FPGA tools

31 Frank Vahid, UC Riverside31/39 Overall Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E SW Only Execution Simpler FPGA fabric yields faster HW circuits

32 Frank Vahid, UC Riverside32/39 Overall Warp Processing Results Performance Speedup (Overall, Multiple Kernels) Average speedup of 7.4 Energy reduction of 38% - 94% SW Only Execution Assuming 100 MHz ARM, fabric in same technology and clocked at rate determined by synthesis

33 Frank Vahid, UC Riverside33/39 FPGA Ubiquity via Obscurity FPGA is hidden from languages and tools Thus, ANY microprocessor platform extendible with FPGA So any program can potentially be sped up by FPGAs No new languages, no new tools Maintains "ecosystem" among application, tool, and architecture developers FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Architectures Applications Tools Standard binaries

34 Frank Vahid, UC Riverside34/39 Outline FPGAs Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Directions

35 Frank Vahid, UC Riverside35/39 Directions – What’s Next? Immediate future: Develop warp processing using benchmarks from other domains Desktop, server, scientific With partners – IBM, Freescale May require new decompilation techniques Binary SW Profiling Standard Compiler Binary FPGAProc. Translator

36 Frank Vahid, UC Riverside36/39 Directions – What’s Next? Application- specific FPGA Tune FPGA fabric to application (or domain) Parameters: LUTs/CLB, LUT size Many more possible, e.g., switch matrix size, # long vs. short channels Delay for each configuration (LUTs/CLB, and LUT sizes 2-7) for one application Delay & area when tuning parameters for best delay for each app, rather than for all apps

37 Frank Vahid, UC Riverside37/39 Directions – What’s Next? Parallel benchmarks NAS, SPEComp, Splash, … Map each thread to custom FPGA circuit Huge potential speedups Binary SW Profiling Standard Compiler Binary µPµP FPGA On-chip CAD Profiler I Mem D$ µPµP µPµP µPµP µPµP µPµP µPµP Thrd1 Thrd2 Thrd3 ThrdN Thrd1 Thrd2 Thrd3 ThrdN Sample speedups from other works

38 Frank Vahid, UC Riverside38/39 Directions – What’s Next? With JIT FPGA compiler, what else is possible? Implications for existing applications? Image processing, neural networks,... Add FPGA hardware to improve performance, like expandable memory? Standard binaries for FPGAs? Rather than extracting circuit from sequential code, distribute circuit binary itself, use JIT FPGA compiler to best map to FPGA resources Binary FPGAProc. Translator FPGA ************ ++++++ + ++ ++ + Binary Translator FPGA Translator FPGA

39 Frank Vahid, UC Riverside39/39 Summary FPGA future looks bright Hiding FPGA via warp processing is feasible Decompilation can recover high-level constructs to yield speedups competitive with source-level JIT FPGA compilation can be made sufficiently lean Many possible directions exist that may use FPGAs to gain ultra-high performance without ultra-high engineering or hardware costs Publications can be found at: http://www.cs.ucr.edu/~vahid/pubs


Download ppt "Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate."

Similar presentations


Ads by Google