Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Digital Design Copyright © 2006 Frank Vahid 1 FPGA Internals: Lookup Tables (LUTs) Basic idea: Memory can implement combinational logic –e.g., 2-address.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.

Graduate Computer Architecture I Lecture 16: FPGA Design.

Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Dynamic FPGA Routing for Just-in-Time Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering.

The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Warp Processor: A Dynamically Reconfigurable Coprocessor Frank Vahid Professor Department of Computer Science and Engineering University of California,

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid (Task Leader) Department of Computer Science and Engineering University of.

Configurable System-on-Chip: Xilinx EDK

Warp Processing – Towards FPGA Ubiquity Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate.

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University.

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California,

Self-Improving Computer Chips – Warp Processing Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Dynamic Hardware/Software Partitioning: A First Approach Authors -Greg Stitt, Roman Lysecky, Frank Vahid Presented By : Aditya Kanawade Guru Sharan 1.

Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

J. Christiansen, CERN - EP/MIC

IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

EE3A1 Computer Hardware and Digital Design

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

Exploiting Parallelism

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

FPGA Routing Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest.

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

1 Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Introduction to Reconfigurable Computing

Introduction to cosynthesis Rabi Mahapatra CSCE617

Autonomously Adaptive Computing: Coping with Scalability, Reliability, and Dynamism in Future Generations of Computing Roman Lysecky Department of Electrical.

Ann Gordon-Ross and Frank Vahid*

HIGH LEVEL SYNTHESIS.

Dynamic FPGA Routing for Just-in-Time Compilation

Dynamic Hardware/Software Partitioning: A First Approach

Warp Processor: A Dynamically Reconfigurable Coprocessor

Automatic Tuning of Two-Level Caches to Embedded Applications

Presentation transcript:

Warp Processors Towards Separating Function and Architecture Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Faculty member, Center for Embedded Computer Systems, UC Irvine Warp Processor Ph.D. students: Roman Lysecky (Ph.D. 2004), Greg Stitt (Ph.D. 2005) This research is supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Motorola

2 µPµP I$ D$ Warp Config. Logic Architecture Profiler Dynamic Part. Module (DPM) Main Idea Warp Processors – Dynamic HW/SW Partitioning Profile application to determine critical regions2 Profiler Initially execute application in software only1 µPµP I$ D$ Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Program FPGA & update software binary 4 Warp Configurable Logic Architecture Partitioned application executes faster with lower energy consumption (speed has been “warped”)5

3 FPGA 0010 … FPGAs are Programmable … … Processor 0010 … … … SM CLB SM CLB SM CLB SM CLB CLB (Configurable Logic Block) addr a b c xy b a c x y SM (Switch Matrix) Bits loaded into program memory Bits loaded into LUTs, CLBs, and SMs Sw Binaries FPGA Binaries a b a or b

4 sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ….. x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); FPGA do Bit Manipulation Fast x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x ) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x ) | ((x << 1) & 0xaaaaaaaa); C Code for Bit Reversal sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ….. Processor 64 instructions 32 to 128 cycles sll srl Binary Compilation Bit Revered X Value Original X Value Bit Revered X Value Original X Value Bit Revered X Value Original X Value Bit Reversed X Value Original X Value Processor FPGA Synthesizes to wires 1 cycle 32x-128x speedup Hardware for Bit Reversal

5 for (i=0; i < 128; i++) y[i] += c[i] * x[i].. FPGAs Support Much Parallelism for (i=0; i < 128; i++) y[i] += c[i] * x[i].. ************ C Code for FIR Filter Processor 1000’s of instructions Several thousand cycles Hardware for FIR Filter Processor FPGA ~ 7 cycles Speedup > 100x

6 Why aren’t FPGAs Part of Mainstream Computing? Benefits known for over a decade Hardware/software partitioning research since early 90s PRISM [ Athanas, Silverman]; TOSCA [ Balboni, Fornaciari, Sciuto]; COSYMA [Henkel/Ernst]; Vulcan [Gupta/DeMicheli]; SpecSyn [Gajski/Vahid/Narayan/Gong]; etc. Microprocessor/FPGA architectures research since early 90s DISC [ Wirthlin, Hutchings]; GARP [ Hauser, Wawrzynek]; Chimaera [ Hauck, Fry, Hosler, Kao]; Morphosys [ Lee, Singh, Lu, et al] Commercial ventures for several years Chameleon Morphotec Stretch [ using Tensilica [ (April 2004) New Atmel, Triscend, Altera, Xilinx devices in past few years…

7 Single-Chip Microprocessor/FPGA Platforms Appearing Commercially Courtesy of Triscend Courtesy of Atmel PowerPCs Courtesy of Xilinx Courtesy of Altera

8 Why Aren’t FPGAs Mainstream? Doesn’t fit well with SW Well-established languages, tools and flows Concept of a standard binary is missing in FPGA world Thus, FPGAs limited to CAD domain But for every 1 CAD user -- >100 software writers Only about 15,000 CAD seats worldwide; millions of compiler seats Binary SW Profiling Special Compiler Binary Netlist FPGA-specific Netlist Modified Binary Includes synthesis, tech. map, pace & route FPGAProc.

9 Standard Binary is Important Separates function from architecture Tools and architectures can be developed independently Can even dynamically translate/optimize UQBT [Cifuentes]; Dynamo [ Bala, Duesterwald, Banerjia] Transmeta Crusoe, Efficeon [ and modern Pentiums Java bytecode SW ______ SW ______ Profiling Standard Compiler Binary Processor Processor1 Processor Processor2 Processor Processor3

10 Partial Solution to Bring FPGAs into Mainstream SW: Binary-Level Partitioning Binary-level partitioning Stitt/Vahid, ICCAD’02 Recent commercial product: Critical Blue [ Partition and synthesize starting from SW binary Advantages Any compiler, any language, multiple sources, assembly/object support, legacy code support Disadvantage Loses high-level information Quality loss? Binary SW Profiling Standard Compiler Binary Partitioner Netlist Modified Binary Traditional partitioning done here Less disruptive, back-end tool Includes synthesis, tech. map, pace & route FPGAProc.

11 Annotated CDFG Alias Analysis Undoing Back-End Compiler Optimizations Removing Instruction- Set Overhead Control Structure Recovery CDFG Creation Binary Parsing Software Binary Key to Good-Quality Binary-Level Partitioning -- Decompilation Goal: recover high-level information lost during compilation Otherwise, synthesis results poor Utilized sophisticated decompilation methods Developed over past decades for purpose of binary translation We developed additional methods specific to our purpose Some limits (e.g., indirect jumps) How does binary-level partitioning with decompilation compare with source-level partitioning? Binary Parsing Software Binary CDFG Creation Control Structure Recovery Undoing Back-End Compiler Optimizations Alias Analysis Annotated CDFG Removing Instruction- Set Overhead discover loops, if-else, etc. reduce operation sizes, etc. reroll loops, etc. allows parallel memory access

12 In most situations, we can recover all high- level information Recovery success for dozens of benchmarks, using several different compilers and optimization levels: Decompilation Recovery Rate

13 Binary-Level Partitioning vs. Source-Level Stitt/Vahid’04 (submitted)

14 Idea Binary Partitioning Enables Dynamic Partitioning Embed CAD on-chip Feasible in era of billion-transistor chips Advantages No special desktop tools Completely transparent Avoid complexities of supporting different FPGA types Complements other approaches Desktop CAD best from purely technical perspective Dynamic opens additional market segments (i.e., all software developers) that otherwise might not use desktop CAD Back to “standard binary” – opens processor architects to world of speedup using FPGAs Binary SW Profiling Standard Compiler Binary FPGAProc. Binary Partitioner Netlist Modified Binary Includes synthesis, tech. map, pace & route CAD Standard binary!

15 uP I$ D$ Config. Logic Arch. Profiler DPM Warp Processors Tools & Requirements Warp Processor Architecture On-chip profiling architecture Configurable logic architecture Dynamic partitioning module Binary Partitioning Binary HW RT Synthesis Technology Mapping Placement & Routing Logic Synthesis Decompilation Binary Updater Binary Updated Binary DPM with uP overkill? Consider that FPGA much bigger than uP. Also consider there may be dozens or uP, but all can share one DPM.

16 Warp Processors All that CAD on-chip? CAD people may first think dynamic HW/SW partitioning is “absurd” Those CAD tools are complex Require long execution times on powerful desktop workstations Require very large memory resources Usually require GBytes of hard drive space Costs of complete CAD tools package can exceed $1 million All that on-chip? 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2 – 30 mins Route 1-2 mins RT Syn. 10 MB 30 s Decomp. 1 min Partitioning 20 MB 10 MB

17 Warp Processors Tools & Requirements But, in fact, on-chip CAD may be practical since specialized CAD Traditional CAD -- Huge, arbitrary input Warp Processor CAD -- Critical sw kernels FPGA Traditional FPGA – huge, arbitrary netlists, ASIC prototyping, varied I/O Warp Processor FPGA – kernel speedup uP I$ D$ Config. Logic Arch. Profiler DPM Binary Partitioning Binary HW RT Synthesis Technology Mapping Placement & Routing Logic Synthesis Decompilation Binary Updater Binary Updated Binary Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD CAD influences FPGA features Add architecture features for kernels Config. Logic Arch.

18 Warp Processors Configurable Logic Architecture Loop support hardware Data address generators (DADG) and loop control hardware (LCH), found in digital signal processors – fast loop execution Supports memory accesses with regular access pattern Synthesis of FSM not required for many critical loops 32-bit fast Multiply-Accumulate (MAC) unit DADG & LCH Configurable Logic Fabric Reg0 32-bit MAC Reg1 Reg2 uP I$ D$ Config. Logic Arch. Profiler DPM Lysecky/Vahid, DATE’04

19 Warp Processors Configurable Logic Fabric DADG LCH Configurable Logic Fabric 32-bit MAC SM CLB SM CLB SM CLB SM CLB Simple fabric: array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) Simple CLB: Two 3-input 2-output LUTs carry-chain support Simple switch matrices: 4-short, 4-long channels Designed for simple fast CAD Lysecky/Vahid, DATE’04 AR M I$I$ D$D$ Config. Logic Arch. Profiler DPM LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB 0 0L 1 1L 2L 2 3L L 1L 2L 3L L1L2L3L L1L2L 3L

20 Warp Processors Profiler Non-intrusive on-chip loop profiler Gordon-Ross/Vahid CASES’03, to appear in “best of MICRO/CASES” issue of IEEE Trans. on Computers. Provides relative frequency of top 16 loops Small cache (16 entries), only 2,300 gates Less than 1% power overhead when active Frequent Loop Cache Controller Frequent Loop Cache ++ I$ MIPS/ ARM uP Frequent Loop Cache Frequent Loop Cache Controller ++ rd/wr addr data rd/wr addr data saturation I$ data Gordon-Ross/Vahid, CASES’03 uP I$ D$ Config. Logic Arch. Profiler DPM

21 Warp Processors Dynamic Partitioning Module (DPM) Dynamic Partitioning Module Executes on-chip partitioning tools Consists of small low-power processor (ARM7) Current SoCs can have dozens On-chip instruction & data caches Memory: a few megabytes uP I$ D$ WCLA Profiler DPM ARM I$ D$ Memory Binary Partitioning Binary HW RT Synthesis Technology Mapping Placement & Routing Logic Synthesis Decompilation Binary Updater Binary Updated Binary Decompilation

22 Warp Processors Execution Time and Memory Requirements < 1s 1 MB 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2 – 30 mins Route 1-2 mins RT Syn. 10 MB 30 s Decomp. 1 min Partitioning 20 MB 10 MB

23 Warp Processors Dynamic Partitioning Module (DPM) uP I$ D$ WCLA Profiler DPM ARM I$ D$ Memory Binary Partitioning Binary HW RT Synthesis Technology Mapping Placement & Routing Logic Synthesis Decompilation Binary Updater Binary Updated Binary Partitioning

24 Sw Regions HW Regions Profiling Results Decompiled Binary If WCLA is Full, Remaining Regions Stay in SW Move Remaining Regions to HW until WCLA is Full Remove Non-Hw Suitable Regions Warp Processors Binary HW/SW Partitioning Sort Loops by freq. Decompiled Binary HW Regions SW Regions Stitt/Vahid, ICCAD’02 Remove Non-HW Suitable Regions Move Remaining Regions to HW until WCLA is Full If WCLA is Full, Remaining Regions Stay in SW Profiling Results Simple partitioning algorithm -- move most frequent loops to hardware Usually one 2-3 critical loops comprise most execution

25 Warp Processors Execution Time and Memory Requirements < 1s 1 MB < 1s.5 MB 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2 – 30 mins Route 1-2 mins RT Syn. 10 MB 30 s Decomp. 1 min Partitioning 20 MB 10 MB

26 Warp Processors Dynamic Partitioning Module (DPM) uP I$ D$ WCLA Profiler DPM ARM I$ D$ Memory Binary Partitioning Binary HW RT Synthesis Technology Mapping Placement & Routing Logic Synthesis Decompilation Binary Updater Binary Updated Binary RT Synthesis

27 Warp Processors RT Synthesis Converts decompiled CDFG to Boolean expressions Maps memory accesses to our data address generator architecture Detects read/write, memory access pattern, memory read/write ordering Optimizes dataflow graph Removes address calculations and loop counter/exit conditions Loop control handled by Loop Control Hardware 1 r1 + Read r1 + r2 Memory Read Increment Address r3 DADG Read + r2 r3 Stitt/Lysecky/Vahid, DAC’03

28 Warp Processors RT Synthesis Maps dataflow operations to hardware components We currently support adders, comparators, shifters, Boolean logic, and multipliers Creates Boolean expression for each output bit of dataflow graph r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0] r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= ……. ……. r1 r2 + r4 r38 < r5 32-bit adder 32-bit comparator Stitt/Lysecky/Vahid, DAC’03

29 < 1s.5 MB Warp Processors Execution Time and Memory Requirements < 1s 1 MB < 1s.5 MB 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2 – 30 mins Route 1-2 mins RT Syn. 10 MB 30 s Decomp. 1 min Partitioning 20 MB 10 MB

30 Warp Processors Dynamic Partitioning Module (DPM) uP I$ D$ WCLA Profiler DPM ARM I$ D$ Memory Binary Partitioning Binary HW RT Synthesis Technology Mapping Placement & Routing Logic Synthesis Decompilation Binary Updater Binary Updated Binary Logic Synthesis

31 Warp Processors Logic Synthesis Optimize hardware circuit created during RT synthesis Large opportunity for logic minimization due to use of immediate values in the binary code Utilize simple two-level logic minimization approach r1 4 + r2 r2[0] = r1[0] r2[1] = r1[1] xor carry[0] r2[2] = r1[2] xor carry[1] r2[3] = r1[3] xor carry[2] … Logic Synthesis r2[0] = r1[0] xor 0 xor 0 r2[1] = r1[1] xor 0 xor carry[0] r2[2] = r1[2] xor 1 xor carry[1] r2[3] = r1[3] xor 0 xor carry[2] … Stitt/Lysecky/Vahid, DAC’03

32 Warp Processors - ROCM ROCM – Riverside On-Chip Minimizer Two-level minimization tool Utilized a combination of approaches from Espresso-II [Brayton, et al. 1984] and Presto [Svoboda & White, 1979] Eliminate the need to compute the off-set to reduce memory usage Utilizes a single expand phase instead of multiple iterations On average only 2% larger than optimal solution for benchmarks ExpandReduceIrredundant dc-seton-setoff-set Lysecky/Vahid, DAC’03 Lysecky/Vahid, CODES+ISSS’03

33 Warp Processors - ROCM Results 500 MHz Sun Ultra60 40 MHz ARM 7 (Triscend A7) ROCM executing on 40MHz ARM7 requires less than 1 second Small code size of only 22 kilobytes Average data memory usage of only 1 megabyte Lysecky/Vahid, DAC’03 Lysecky/Vahid, CODES+ISSS’03

34 < 1s.5 MB Warp Processors Execution Time and Memory Requirements 1s 1 MB < 1s 1 MB < 1s.5 MB 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2 – 30 mins Route 1-2 mins RT Syn. 10 MB 30 s Decomp. 1 min Partitioning 20 MB 10 MB

35 Warp Processors Dynamic Partitioning Module (DPM) uP I$ D$ WCLA Profiler DPM ARM I$ D$ Memory Binary Partitioning Binary HW RT Synthesis Technology Mapping Placement & Routing Logic Synthesis Decompilation Binary Updater Binary Updated Binary Technology MappingPlacement and Routing

36 Warp Processors Technology Mapping/Packing ROCPAR – Technology Mapping/Packing Decompose hardware circuit into basic logic gates (AND, OR, XOR, etc.) Traverse logic network combining nodes to form single-output LUTs Combine LUTs with common inputs to form final 2-output LUTs Pack LUTs in which output from one LUT is input to second LUT Pack remaining LUTs into CLBs Lysecky/Vahid, DATE’04 Stitt/Lysecky/Vahid, DAC’03

37 Warp Processors Placement ROCPAR – Placement Identify critical path, placing critical nodes in center of configurable logic fabric Use dependencies between remaining CLBs to determine placement Attempt to use adjacent cell routing whenever possible CLB Lysecky/Vahid, DATE’04 Stitt/Lysecky/Vahid, DAC’03

38 < 1s.5 MB < 1s.5 MB Warp Processors Execution Time and Memory Requirements 1s 1 MB < 1s 1 MB < 1s 1 MB < 1s.5 MB 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2 – 30 mins Route 1-2 mins RT Syn. 10 MB 30 s Decomp. 1 min Partitioning 20 MB 10 MB

39 Warp Processors Routing Find a path within FPGA to connect source and sinks of each net VPR – Versatile Place and Route [Betz, et al., 1997] Modified Pathfinder algorithm; allows overuse each iteration, updates costs, rips-up and reroutes – may require big routing resource graph Riverside On-Chip Router (ROCR) Represent routing nets between CLBs as routing between switch matrices Resource Graph: Nodes are SMs, edges short/long channels between SMs Much smaller, fixed size (based on number of SMs) Greedy, depth-first algorithm routes nets between SMs Route Rip-up Done! congestion? illegal? no yes Routing Resource Graph

40 Warp Processors Routing Average 10X faster than VPR (TD) Up to 21X faster for ex5p Memory usage of only 3.6 MB 13X less than VPR Lysecky/Vahid/Tan, DAC’04

41 Warp Processors Routing: Critical Path Results 10% shorter critical path than VPR (Routability Driven) 32% longer critical path than VPR (Timing Driven) Lysecky/Vahid/Tan, DAC’04

42 < 1s.5 MB < 1s.5 MB Warp Processors Execution Time and Memory Requirements 1s 1 MB < 1s 1 MB 10s 3.6 MB < 1s 1 MB < 1s.5 MB 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2 – 30 mins Route 1-2 mins RT Syn. 10 MB 30 s Decomp. 1 min Partitioning 20 MB 10 MB

43 Warp Processors Dynamic Partitioning Module (DPM) uP I$ D$ WCLA Profiler DPM ARM I$ D$ Memory Binary Partitioning Binary HW RT Synthesis Technology Mapping Placement & Routing Logic Synthesis Decompilation Binary Updater Binary Updated Binary Binary Updater

44.. for (i=0; i < 256; i++) output += input1[i]*2;.. Warp Processors Binary Updater Binary Updater Must modify binary to use hardware within WCLA HW initialization function added at end of binary Replace HW loops with jump to HW initialization function HW initialization function jumps back to end of loop.. for (i=0; i < 256; i++) output += input1[i]*2;.. initHw(); enableLoop = 1;.. initHW();.. initHW() {.. }

45 Initial Overall Results Experimental Setup Considered 12 embedded benchmarks from NetBench, MediaBench, EEMBC, and Powerstone Average of 53% of total software execution time was spent executing single critical loop (more speedup possible if more loops considered) On average, critical loops comprised only 1% of total program size

46 Warp Processors Experimental Setup Warp Processor Embedded microprocessor Configurable logic fabric with frequency 80% that of the microprocessor Based on commercial platform (Triscend A7) Dynamic partitioning module maps critical region to hardware Executed on a 75 MHz ARM7 processor DPM active for ~10 seconds Key tools automated; some tasks assisted by hand Versus traditional HW/SW Partitioning Embedded microprocessor Xilinx Virtex-E FPGA (maximum possible speed) Manually partitioned software using VHDL VHDL synthesized using Xilinx ISE 4.1 on desktop ARM I$ D$ Config. Logic Arch. Profiler DPM ARM I$ D$ Xilinx Virtex-E FPGA

47 Warp Processors: Initial Results Speedup (Critical Region/Loop) Average loop speedup of 29x

48 Warp Processors: Initial Results Speedup (overall application with ONLY 1 loop sped up) Average speedup of 2.1 vs. 2.2 for Virtex-E 4.1

49 Warp Processors: Initial Results Energy Reduction (overall application, 1 loop ONLY) Average energy reduction of 33% v.s 36% for Xilinx Virtex-E 74%

50 Warp Processors Execution Time and Memory Requirements (on PC) 3.6MB 60 MB 9.1 s Xilinx ISE 0.2 s ROCPAR On a 75Mhz ARM7: only 1.4 s Manually performed 46x improvement

51 Multi-processor platforms Multiple processors can share a single DPM Time-multiplex Just another processor whose task is to help the other processors Processors can even be soft cores in FPGA DPM can even re-visit same application in case use or data has changed uP Config. Logic Arch. DPM uP Shared by all uP

52 Idea of Warp Processing can be Viewed as JIT FPGA compilation JIT FPGA Compilation Idea: standard binary for FPGA Similar benefits as standard binary for microprocessor Portability, transparency, standard tools May involve microprocessor for compactness of non-critical behavior Binary VHDL/Verilog Profiling Standard CAD Tools Binary Std. HW Binary JIT FPGA Comp. FPGA ++ JIT FPGA Comp. FPGA +**+ MEM

53 Future Directions Already widely known that mapping sw to FPGA has great potential Our work has shown that mapping sw to FPGA dynamically may be feasible Extensive future work needed on tools/fabric to achieve overall application speedups/energy improvements of 100x-1000x

54 Ultimately… Working towards separation of function from architecture Write application, create “standard binary” Map binary to any microprocessor (one or more), any FPGA, or combination thereof Enables improvements in function and architecture without the heavy interdependence of today SW ______ SW ______ Profiling Standard Compiler Binary Processor Processor1 Processor FPGA1 Processor + FPGA Processor Function Processor Processor1 Processor Function Processor + FPGA

55 Warp Processing Summary µPµP I$ D$ Profiler Profile application to determine critical regions2 Profiler Initially execute application in software only 1 µPµP I$ D$ Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Program configurable logic & update software binary 4 Warp Config. Logic Architecture Partitioned application executes faster with lower energy consumption (speed has been “warped”) 5 Kernels sped up 29x on average Over 100x in some Corresponding energy savings Standard binary No tool impact Makes FPGA usable with any existing software environment Currently investigating applications: Embedded (w/ Motorola and Xilinx) Desktop and mainframe (with Philips and IBM)

56 Publications & Acknowledgements All these publications are available at Dynamic FPGA Routing for Just-in-Time FPGA Compilation, R. Lysecky, F. Vahid, S. Tan, Design Automation Conference, A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning, R. Lysecky and F. Vahid, Design Automation and Test in Europe Conference (DATE), February Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, A. Gordon-Ross and F. Vahid, ACM/IEEE Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2003; to appear in special issue “Best of CASES/MICRO” of IEEE Trans. on Comp. A Codesigned On-Chip Logic Minimizer, R. Lysecky and F. Vahid, ACM/IEEE ISSS/CODES conference, Dynamic Hardware/Software Partitioning: A First Approach. G. Stitt, R. Lysecky and F. Vahid, Design Automation Conference, On-Chip Logic Minimization, R. Lysecky and F. Vahid, Design Automation Conference, The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic, G. Stitt and F. Vahid, IEEE Design and Test of Computers, November/December Hardware/Software Partitioning of Software Binaries, G. Stitt and F. Vahid, IEEE/ACM International Conference on Computer Aided Design, November We gratefully acknowledge financial support from the National Science Foundation and the Semiconductor Research Corporation for this work. We also appreciate the collaborations and support from Motorola, Triscend, Philips and Xilinx.