Presentation is loading. Please wait.

Presentation is loading. Please wait.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Similar presentations


Presentation on theme: "The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering."— Presentation transcript:

1 The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, and Freescale Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), David Sheldon (3 rd yr PhD), Ryan Mannion (2 nd yr PhD), Scott Sirowy (1 st yr PhD)

2 Frank Vahid, UC Riverside2/34 Outline FPGAs – The New Software Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Towards Standard Binaries for FPGAs

3 Frank Vahid, UC Riverside3/34 FPGAs FPGA -- Field-Programmable Gate Array Implement circuit by downloading bits N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric Thousands of LUTs and SMs, increasingly additional hard core components like multipliers, RAM, etc. CAD tools automatically map desired circuit onto FPGA fabric ab a1a0a1a0 4x2 Memory abab 1 0 1 0 1 1 1 0 d 1 d 0 F G 00 01 10 11 Implement circuit by downloading particular bits LUT FG 2x2 switch matrix x y 0 1 0 1 10 a b FPGA SM LUT SM LUT 01 11 01 00 11... 10 11 00 01...

4 Frank Vahid, UC Riverside4/34 FPGAs are "Programmable" like Microprocessors – Just Download Bits Processor 001010010 … 001010010 … 0010 … Bits loaded into program memory Microprocessor Binaries 001010010 … 01110100... Bits loaded into LUTs and SMs FPGA "Binaries" Processor FPGA 0111 … More commonly known as "bitstream" "Software" "Hardware"

5 Frank Vahid, UC Riverside5/34 FPGA – Why (Sometimes) Better than Microprocessor x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles Circuit for Bit Reversal Bit Reversed X Value........... Original X Value Processor FPGA Requires only 1 cycle (speedup of 32x to 128x)

6 Frank Vahid, UC Riverside6/34 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. FPGA: Why (Sometimes) Better than Microprocessor for (i=0; i < 128; i++) y[i] += c[i] * x[i].. ************ ++++++ + ++ ++ + C Code for FIR Filter Processor 1000’s of instructions Several thousand cycles Circuit for FIR Filter Processor FPGA ~ 7 cycles Speedup > 100x In general, FPGA better due to circuit's concurrency, from bit-level to task level

7 Frank Vahid, UC Riverside7/34 Extensive Studies over Past Decade Large speedups on many important applications See ACM/SIGDA Int. Symp. on FPGAs So why aren't FPGAs ubiquitous?

8 Frank Vahid, UC Riverside8/34 Why FPGAs aren’t Ubiquitous Cost – But improving yearly Power – But improving yearly, and energy benefits too Extra chip – But integration continues Programming methodology Source: Xilinx 1 million system gate FPGA cost

9 Frank Vahid, UC Riverside9/34 Why FPGAs aren’t Mainstream Cost Power Extra chip Programming methodology Though tremendous progress in past decade Implementation Assembly code Microprocessor binaryFPGA binary Logic equations / FSMs Register transfers Compilation (1960s, 1970s) Assembling, linking (1950s, 1960s) Behavioral synthesis (1990s) RT synthesis (1980s, 1990s) Logic synthesis, physical design (1970s, 1980s) MicroprocessorsFPGA circuits Automated hardware/software partitioning C/C++/JavaC/C++/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C... Application (C/C++/Java/SystemC/Handel-C/Streams-C/…) Downloading

10 Frank Vahid, UC Riverside10/34 So What’s the Holdup? FPGAs require special compilers Limits adoption – desktop world dominates 100 software writers for every CAD user Millions of compiler seats worldwide, vs. 15,000 CAD seats Can't ignore "ecosystem" from separation of applications, tools, and architectures Just consider history of popular processors Binary Applic. Standard Compiler Binary FPGA Binary Microproc Binary FPGAProc. Includes synthesis, tech. map, place & route Special Compiler Architectures Applications Tools Standard binaries

11 Frank Vahid, UC Riverside11/34 Outline FPGAs – The New Software Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Towards Standard Binaries for FPGAs

12 Frank Vahid, UC Riverside12/34 Can we Hide FPGAs from Programmers and Standard Tools? Example Radically different x86 architectures hidden from programmers and tools All execute standard x86 binaries On-chip tools dynamically translate binary to particular architecture Idea: Hide FPGA from programmers and tools Download standard binary Have on-chip tools dynamically translate binary (portions) to FPGA We call this Warp Processing Binary SW Profiling Standard Compiler Binary Traditional partitioning done here RISC architecture Translator VLIW architecture Translator FPGAProc. Translator

13 Frank Vahid, UC Riverside13/34 µPµP FPGA On-chip CAD Warp Processing Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

14 Frank Vahid, UC Riverside14/34 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µPµP

15 Frank Vahid, UC Riverside15/34 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

16 Frank Vahid, UC Riverside16/34 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

17 Frank Vahid, UC Riverside17/34 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

18 Frank Vahid, UC Riverside18/34 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +...

19 Frank Vahid, UC Riverside19/34 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... FPGA CLB SM ++

20 Frank Vahid, UC Riverside20/34 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 ++++++ +++ + + +... CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”

21 Frank Vahid, UC Riverside21/34 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

22 Frank Vahid, UC Riverside22/34 Decompilation If we don't decompile High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – big overhead Need to recover high-level information Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone

23 Frank Vahid, UC Riverside23/34 Decompilation Solution – Recover high-level information from binary: decompilation Adapted extensive previous work (for different purposes) Developed new decompilation methods also Ph.D. work of Greg Stitt (Ph.D. UCR 2006) Numerous publications: http://www.cs.ucr.edu/~vahid/pubs Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

24 Frank Vahid, UC Riverside24/34 Decompilation Results vs. C Compared with synthesis from C Synthesis after decompilation often quite similar Almost identical performance, small area overhead FPGA 2005

25 Frank Vahid, UC Riverside25/34 Decompilation Results on Optimized H.264 In-depth Study with Freescale Used highly-optimized benchmark Results: Binary approach competitive Speedups compared to ARM9 software Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information needed for partitioning and synthesis

26 Frank Vahid, UC Riverside26/34 Tangent: Simple Coding Guidelines Bring Speedups Closer to Ideal Interesting discovery during H264 study – C style limited speedup Orthogonal to binary vs. C issue – coding style hurt both Developed simple coding guidelines Rewritten software: 20 minutes, and only ~3% slower than original New speedups: Binary: 6.55, C: 6.56 Binary still competitive with C Following guidelines not required, but helps any approach targeting FPGAs

27 Frank Vahid, UC Riverside27/34 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

28 Frank Vahid, UC Riverside28/34 Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA e.g., Our router (ROCR) 10x faster and 20x less memory than popular VPR tool, at cost of 30% longer critical path. Similar results for synth & placement Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) Numerous publications: http://www.cs.ucr.edu/~vahid/pubs JIT FPGA Compilation DAC’04 Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation 60 MB 9.1 s Xilinx ISE 3.6MB 1.4s Riverside JIT FPGA tools on a 75MHz ARM7 3.6MB 0.2 s Riverside JIT FPGA tools

29 Frank Vahid, UC Riverside29/34 Overall Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E SW Only Execution Simpler FPGA fabric yields faster HW circuits Currently prototyping our simpler FPGA fabric with Intel, scheduled for Q3 shuttle Overall application speedup average is 7.4

30 Frank Vahid, UC Riverside30/34 Outline FPGAs – The New Software Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Towards Standard Binaries for FPGAs

31 Frank Vahid, UC Riverside31/34 FPGA Ubiquity via Obscurity Warp processing hides FPGA from languages and tools ANY microprocessor platform extendible with FPGA Maintains "ecosystem": application, tool, and architecture developers New platforms with FPGAs appearing FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Standard Binary Architectures Applications Tools Standard binaries New processor platforms with FPGA evolving

32 Frank Vahid, UC Riverside32/34 FPGA Standard Binaries? Microprocessor binary represents one form of a "standard binary for FPGAs" Missing is explicit concurrency Parallelism, pipelining, queues, etc. As FPGAs appear in more platforms, might a more general FPGA binary evolve? FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Standard Binary Architectures Applications Tools Standard binaries Binary SystemC? Standard FPGA Compiler Binary Standard FPGA binary? Standard FPGA binaries Ecosystem for FPGAs presently sorely missing

33 Frank Vahid, UC Riverside33/34 FPGA Standard Binaries? Translator makes best use of existing FPGA resources Can even add FPGA, like adding memory, to improve performance Add more FPGA to your PDA to implement compute-intensive application? Binary FPGAProc. Translator FPGA ************ ++++++ + ++ ++ + Binary FPGA Binary Translator FPGA Low-end PDA 100 sec Translator FPGA High-end PDA 1 sec

34 Frank Vahid, UC Riverside34/34 Summary FPGAs may be the new software Hiding FPGA via warp processing is feasible Decompilation can recover high-level constructs to yield speedups competitive with source-level JIT FPGA compilation can be made sufficiently lean Future: Standard binaries for FPGAs? Extensive work to be done Publications can be found at: http://www.cs.ucr.edu/~vahid/pubs


Download ppt "The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering."

Similar presentations


Ads by Google