Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Slides:



Advertisements
Similar presentations
Digital Design Copyright © 2006 Frank Vahid 1 FPGA Internals: Lookup Tables (LUTs) Basic idea: Memory can implement combinational logic –e.g., 2-address.
Advertisements

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.
The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.
A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.
The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,
Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.
"Standard Binaries for FPGAs" & "eBlocks" Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.
Portability for FPGA Applications—Warp Processing and SystemC Bytecode Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ.
Configurable System-on-Chip: Xilinx EDK
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,
Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators Greg Stitt Dept. of ECE University of Florida This research was supported in part.
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center.
Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Dynamic Hardware/Software Partitioning: A First Approach Authors -Greg Stitt, Roman Lysecky, Frank Vahid Presented By : Aditya Kanawade Guru Sharan 1.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
ECE 465 Introduction to CPLDs and FPGAs Shantanu Dutt ECE Dept. University of Illinois at Chicago Acknowledgement: Extracted from lecture notes of Dr.
Automated Design of Custom Architecture Tulika Mitra
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
Exploiting Parallelism
Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Introduction to Reconfigurable Computing
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.
Ann Gordon-Ross and Frank Vahid*
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
HIGH LEVEL SYNTHESIS.
Dynamic FPGA Routing for Just-in-Time Compilation
Dynamic Hardware/Software Partitioning: A First Approach
Warp Processor: A Dynamically Reconfigurable Coprocessor
Portable SystemC-on-a-Chip
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, and Freescale Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), Kris Miller (MS 2007), David Sheldon (3 rd yr PhD), Ryan Mannion (2 nd yr PhD), Scott Sirowy (1 st yr PhD)

Frank Vahid, UC Riverside2/39 Outline FPGAs Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Directions

Frank Vahid, UC Riverside3/39 FPGAs FPGA -- Field-Programmable Gate Array Implement circuit by downloading bits N-address memory (“LUT”) implements N-input combinational logic Register-controlled switch matrix (SM) connects LUTs FPGA fabric Thousands of LUTs and SMs, increasingly additional hard core components like multipliers, RAM, etc. CAD tools automatically map desired circuit onto FPGA fabric ab a1a0a1a0 4x2 Memory abab d 1 d 0 F G Implement circuit by downloading particular bits LUT FG 2x2 switch matrix x y a b FPGA SM LUT SM LUT

Frank Vahid, UC Riverside4/39 FPGAs are "Programmable" like Microprocessors – Just Download Bits Processor … … 0010 … Bits loaded into program memory Microprocessor Binaries … Bits loaded into LUTs and SMs FPGA "Binaries" Processor FPGA 0111 … More commonly known as "bitstream"

Frank Vahid, UC Riverside5/39 FPGA – Why (Sometimes) Better than Microprocessor x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10]... Binary Compilation Processor Requires between 32 and 128 cycles Circuit for Bit Reversal Bit Reversed X Value Original X Value Processor FPGA Requires only 1 cycle (speedup of 32x to 128x)

Frank Vahid, UC Riverside6/39 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. FPGA: Why (Sometimes) Better than Microprocessor for (i=0; i < 128; i++) y[i] += c[i] * x[i].. ************ C Code for FIR Filter Processor 1000’s of instructions Several thousand cycles Circuit for FIR Filter Processor FPGA ~ 7 cycles Speedup > 100x In general, FPGA better due to circuit's concurrency, from bit-level to task level

Frank Vahid, UC Riverside7/39 Extensive Studies over Past Decade Large speedups on many important applications See ACM/SIGDA Int. Symp. on FPGAs So why aren't FPGAs ubiquitous?

Frank Vahid, UC Riverside8/39 Why FPGAs aren’t Mainstream Cost – But improving yearly Power – But improving yearly, and energy benefits too Extra chip – But integration continues Programming methodology Source: Xilinx 1 million system gate FPGA cost

Frank Vahid, UC Riverside9/39 Why FPGAs aren’t Mainstream Cost Power Extra chip Programming methodology Though tremendous progress in past decade Implementation Assembly code Microprocessor binaryFPGA binary Logic equations / FSMs Register transfers Compilation (1960s, 1970s) Assembling, linking (1950s, 1960s) Behavioral synthesis (1990s) RT synthesis (1980s, 1990s) Logic synthesis, physical design (1970s, 1980s) MicroprocessorsFPGA circuits Automated hardware/software partitioning C/C++/JavaC/C++/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C... Application (C/C++/Java/SystemC/Handel-C/Streams-C/…) Downloading

Frank Vahid, UC Riverside10/39 So What’s the Holdup? FPGAs require special compilers Limits adoption – desktop world dominates 100 software writers for every CAD user Millions of compiler seats worldwide, vs. 15,000 CAD seats Binary Applic. Standard Compiler Binary FPGA Binary Microproc Binary FPGAProc. Includes synthesis, tech. map, place & route Special Compiler

Frank Vahid, UC Riverside11/39 Outline FPGAs Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Directions

Frank Vahid, UC Riverside12/39 Can we Hide FPGAs from Programmers and Standard Tools? Example Radically different x86 architectures hidden from programmers and tools All execute standard x86 binaries On-chip tools dynamically translate binary to particular architecture Idea: Hide FPGA from programmers and tools Download standard binary Have on-chip tools dynamically translate binary (portions) to FPGA We call this Warp Processing Binary SW Profiling Standard Compiler Binary Traditional partitioning done here RISC architecture Translator VLIW architecture Translator FPGAProc. Translator

Frank Vahid, UC Riverside13/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler Initially, software binary loaded into instruction memory 1 I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary

Frank Vahid, UC Riverside14/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Microprocessor executes instructions in software binary 2 µPµP

Frank Vahid, UC Riverside15/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary Profiler monitors instructions and detects critical regions in binary 3 Profiler add beq Critical Loop Detected

Frank Vahid, UC Riverside16/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD reads in critical region 4 Profiler On-chip CAD

Frank Vahid, UC Riverside17/39 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD decompiles critical region into control data flow graph (CDFG) 5 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0

Frank Vahid, UC Riverside18/39 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit 6 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 :=

Frank Vahid, UC Riverside19/39 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary On-chip CAD maps circuit onto FPGA 7 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := FPGA CLB SM ++

Frank Vahid, UC Riverside20/39 µPµP FPGA Dynamic Part. Module (DPM) Warp Processing Idea Profiler µPµP I Mem D$ Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Software Binary8 Profiler On-chip CAD loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := CLB SM ++ FPGA On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 FPGA Software-only “Warped”

Frank Vahid, UC Riverside21/39 µPµP FPGA On-chip CAD Warp Processing Idea Profiler I Mem D$ µPµP Likely multiple microprocessors per chip, serviced by one on-chip CAD block µPµP µPµP µPµP µPµP µPµP

Frank Vahid, UC Riverside22/39 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside23/39 Decompilation Synthesis from binary has a potential hurdle High-level information (e.g., loops, arrays) lost during compilation Direct translation of assembly to circuit – huge overheads Need to recover high-level information Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone

Frank Vahid, UC Riverside24/39 Decompilation Solution – Recover high-level information from binary: decompilation Adapted extensive previous work (for different purposes) Developed new decompilation methods also Ph.D. work of Greg Stitt (Ph.D. UCR 2006) Numerous publications: Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Control/Data Flow Graph Creation Original C Code Corresponding Assembly loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 reg3 := 0 reg4 := 0 Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Function Recovery long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } Control Structure Recovery long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Array Recovery Almost Identical Representations Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside25/39 Decompilation Results vs. C Compared with synthesis from C Synthesis after decompilation often quite similar Almost identical performance, small area overhead FPGA 2005

Frank Vahid, UC Riverside26/39 Decompilation Results on Optimized H.264 In-depth Study with Freescale Used highly-optimized benchmark Results: Binary approach competitive Speedups compared to ARM9 software Binary: 2.48, C: 2.53 Decompilation recovered nearly all high-level information needed for partitioning and synthesis

Frank Vahid, UC Riverside27/39 Simple Coding Guidelines Bring Speedups Closer to Ideal Interesting discovery during H264 study – C style limited speedup Orthogonal to binary vs. C issue – coding style hurt both Developed simple coding guidelines Rewritten software: 20 minutes, and only ~3% slower than original New speedups: Binary: 6.55, C: 6.56 Binary still competitive with C Following guidelines not required, but helps any approach targeting FPGAs

Frank Vahid, UC Riverside28/39 Warp Processing Challenges Two key challenges Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? µPµP I$ D$ FPGA Profiler On-chip CAD Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside29/39 Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA e.g., Our router (ROCR) 10x faster and 20x less memory than popular VPR tool, at cost of 30% longer critical path. Similar results for synth & placement Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) Numerous publications: JIT FPGA Compilation DAC’04 Binary Decompilation Binary FPGA binary Synthesis Profiling & partitioning Binary Updater Binary Micropr. Binary Std. HW Binary JIT FPGA compilation

Frank Vahid, UC Riverside30/39 JIT FPGA Compilation 60 MB 9.1 s Xilinx ISE 3.6MB 1.4s Riverside JIT FPGA tools on a 75MHz ARM7 3.6MB 0.2 s Riverside JIT FPGA tools

Frank Vahid, UC Riverside31/39 Overall Warp Processing Results Performance Speedup (Most Frequent Kernel Only) Average kernel speedup of 41, vs. 21 for Virtex-E SW Only Execution Simpler FPGA fabric yields faster HW circuits

Frank Vahid, UC Riverside32/39 Overall Warp Processing Results Performance Speedup (Overall, Multiple Kernels) Average speedup of 7.4 Energy reduction of 38% - 94% SW Only Execution Assuming 100 MHz ARM, fabric in same technology and clocked at rate determined by synthesis

Frank Vahid, UC Riverside33/39 FPGA Ubiquity via Obscurity FPGA is hidden from languages and tools Thus, ANY microprocessor platform extendible with FPGA So any program can potentially be sped up by FPGAs No new languages, no new tools Maintains "ecosystem" among application, tool, and architecture developers FPGAProc. Translator Binary SW Profiling Standard Compiler Binary Architectures Applications Tools Standard binaries

Frank Vahid, UC Riverside34/39 Outline FPGAs Why they’re great Why they’re not ubiquitous yet Hiding FPGAs from programmers Warp processing Binary decompilation Just-in-time FPGA compilation Directions

Frank Vahid, UC Riverside35/39 Directions – What’s Next? Immediate future: Develop warp processing using benchmarks from other domains Desktop, server, scientific With partners – IBM, Freescale May require new decompilation techniques Binary SW Profiling Standard Compiler Binary FPGAProc. Translator

Frank Vahid, UC Riverside36/39 Directions – What’s Next? Application- specific FPGA Tune FPGA fabric to application (or domain) Parameters: LUTs/CLB, LUT size Many more possible, e.g., switch matrix size, # long vs. short channels Delay for each configuration (LUTs/CLB, and LUT sizes 2-7) for one application Delay & area when tuning parameters for best delay for each app, rather than for all apps

Frank Vahid, UC Riverside37/39 Directions – What’s Next? Parallel benchmarks NAS, SPEComp, Splash, … Map each thread to custom FPGA circuit Huge potential speedups Binary SW Profiling Standard Compiler Binary µPµP FPGA On-chip CAD Profiler I Mem D$ µPµP µPµP µPµP µPµP µPµP µPµP Thrd1 Thrd2 Thrd3 ThrdN Thrd1 Thrd2 Thrd3 ThrdN Sample speedups from other works

Frank Vahid, UC Riverside38/39 Directions – What’s Next? With JIT FPGA compiler, what else is possible? Implications for existing applications? Image processing, neural networks,... Add FPGA hardware to improve performance, like expandable memory? Standard binaries for FPGAs? Rather than extracting circuit from sequential code, distribute circuit binary itself, use JIT FPGA compiler to best map to FPGA resources Binary FPGAProc. Translator FPGA ************ Binary Translator FPGA Translator FPGA

Frank Vahid, UC Riverside39/39 Summary FPGA future looks bright Hiding FPGA via warp processing is feasible Decompilation can recover high-level constructs to yield speedups competitive with source-level JIT FPGA compilation can be made sufficiently lean Many possible directions exist that may use FPGAs to gain ultra-high performance without ultra-high engineering or hardware costs Publications can be found at: