A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer Science and Engineering b Department of Electrical Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and Xilinx

2/22 Introduction Standard binary - Separating Function and Architecture SW ______ SW ______ Profiling Standard Compiler Binary x86 Binary Software binaries of the past Binary reflected specific language of underlying architecture – limited portability Current “standard binary” Concept: separate function from detailed architecture Develop new architectures for existing applications Trend towards dynamic translation and optimization

3/22 Introduction But Today’s Binaries are More than just Software SW ______ SW ______ Profiling Standard Compiler Binary SW Binary Profiling Compiler/ Synthesis Binary Processor1 FPGAProc. SW ______ SW ______ SW ______ HW ______ Processor Processor2 Processor3 FPGA Proc. FPGA Proc.

4/22 Introduction Just-in-Time FPGA Compilation? JIT FPGA compilation Idea: standard binary for FPGA Similar benefits as standard binary for microprocessor Portability, transparency, standard tools Embedded JIT compilation tools optimized for each FPGA Binary VHDL/Verilog Profiling Standard CAD Tools Binary Std. HW Binary JIT FPGA Comp. FPGA ++ JIT FPGA Comp. FPGA +**+ MEM

5/22 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 Binary SW Binary

6/22 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 HW ______ Processor FPGA 1 Processor FPGA 2 Processor FPGA 3 Processor FPGA 4 Binary SW Binary Binary HW Netlist3 Binary SW Binary Binary HW Netlist2 Binary SW Binary Binary HW Netlist1 Binary SW Binary Binary HW Netlist4 HW1 ______ HW2 ______ HW3 ______ HW4 ______

7/22 Introduction One Use of JIT FPGA Compilation CableTV Company Feature Upgrade Feature Upgrade SW ______ Processor ARM7 Processor ARM9 Processor ARM10 Processor ARM11 HW ______ Processor FPGA 1 Processor FPGA 2 Processor FPGA 3 Processor FPGA 4 Binary SW Binary Binary HW Binary JIT FPGA Comp.

8/22 µPµP I$ D$ FPGA Profiler Dynamic Part. Module (DPM) Partitioned application executes faster with lower energy consumption 5 Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Profile application to determine critical regions2 Profiler Initially execute application in software only1 µPµP I$ D$ Partition critical regions to hardware 3 Dynamic Part. Module (DPM) Program configurable logic & update software binary 4 FPGA Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

9/22 µPµP I$ D$ FPGA Profiler DPM (CAD) Introduction Another Use - Warp Processors (Dynamic HW/SW Partitioning) Binary Decompilation Binary HW Bitstream RT Synthesis Partitioning Binary Updater Binary Updated Binary Binary Std. HW Binary JIT FPGA Compilation Tech. Mapping/Packing Placement Logic Synthesis Routing Lysecky/Vahid, DATE’04; Stitt/Lysecky/Vahid, DAC’03; Stitt/Vahid, ICCAD’02

10/22 Introduction Existing FPGAs Not Suitable for JIT FPGA Compilation Existing FPGAs require extremely complex CAD tools Designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and very large memory usage Not suitable for dynamic on-chip execution 50 MB 60 MB 10 MB 1 min Log. Syn. 1 min Tech. Map 1-2 mins Place 2-30 mins Route 10 MB

11/22 JIT FPGA Comp. FPGA ++ JIT FPGA Compilation CAD-Oriented FPGA Solution: Develop a custom CAD-oriented FPGA Careful simultaneous design of FPGA and CAD FPGA features evaluated for impact on CAD Enables development of fast, lean JIT FPGA compilation tools 1s < 1s.5 MB 1 MB < 1s 1 MB 10s 3.6 MB Tech. Mapping/Packing Placement Logic Synthesis Routing Lysecky/Vahid, DATE’04

12/22 Simple Configurable Logic Fabric CAD-Oriented FPGA SM CLB SM CLB SM CLB SM CLB Simple Configurable Logic Fabric (CLF) Hundreds of existing commercial and research FPGA fabrics Most designed to balance circuit density and speed Analyzed FPGA’s features to determine their impact of CAD Designed our CLF in conjunction with JIT FPGA compilation tools Array of configurable logic blocks (CLBs) surrounded by switch matrices (SMs) CLB is directly connected to a SM Along with SM design, allows for design of lean JIT routing Lysecky/Vahid, DATE’04

13/22 Simple Configurable Logic Fabric Combinational Logic Block Combinational Logic Block Incorporate two 3-input 2-output LUTs Equivalent to four 3-input LUTs with fixed internal routing Allows for good quality circuit while reducing JIT technology mapping complexity Provide routing resources between adjacent CLBs to support carry chains Reduces number of nets we need to route FPGAsSCLF Flexibility/Density: Large CLBs, various internal routing resources Simplicity: Limited internal routing, reduce on-chip CAD complexity LUT abcd e f o1o2o3o4 Adj. CLB Adj. CLB Lysecky/Vahid, DATE’04

14/22 Simple Configurable Logic Fabric Switch Matrix 0 0L 1 1L 2L 2 3L 3 0 1 2 3 0L 1L 2L 3L 0 1 2 3 0L1L2L3L 0123 0L1L2L 3L Switch Matrix All nets are routed using only a single pair of channels throughout the configurable logic fabric Each short channel is associated with single long channel Designed for fast, lean JIT FPGA routing FPGAsSCLF Flexibility/Speed: Large routing resources, various routing options Simplicity: Allow for design of fast, lean routing algorithm Lysecky/Vahid, DATE’04

15/22 JIT FPGA Compilation Routing FPGA Routing Find a path within FPGA to connect source and sinks of each net within our hardware circuit Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest path Allows overuse (congestion) of resources If congestion exists (illegal routing) Update cost of congested resources Rip-up all routes and reroute all nets VPR [Betz, et al., 1997] Provides various improvements over Pathfinder Routability-driven: Use fewest tracks possible Timing-driven: Optimize circuit speed Many techniques are used in commercial FPGA CAD tools 1 1 1 1 1 1 1 1 1 2 congestion 2

16/22 SM CLB SM CLB SM CLB Routing Resource Graph 0/4 SM Resource Graph ROCR - Riverside On-Chip Router Resource Graph Nodes correspond to SMs Edges correspond to channels between SMs Capacity of edge equal to the number of wires within the channel Requires much less memory as resource graph is smaller JIT FPGA Compilation ROCR – Riverside On-chip Router Route Rip-up Done! illegal? no yes Lysecky/Vahid/Tan, DAC’04; Lysecky/Vahid, DATE’04

17/22 Scalability of On-chip Routing Experimental Setup SM CLB SM CLB SM CLB SM CLB Experimental Setup 100x100 configurable logic fabric array Routing channel width of 34 Large enough to support all HW circuits 123 MCNC benchmark circuits Circuit complexity ranges from few LUTs to tens of thousands of LUTs Performed technology mapping, packing, and placement using FlowMap, T-VPack, and VPR’s bounding box placement Routed each HW benchmark circuit using: VPR’s timing-driven router VPR’s fast timing-driven router (-fast option) Riverside On-Chip Router (ROCR)

18/22 Scalability of On-chip Routing Memory Usage VPR requires over 100MB of on average ROCR requires at most 8.3 MB VPR requires 18X more than ROCR on average

19/22 Scalability of On-chip Routing Algorithm Performance ROCR is over 40X times faster than VPR for small HW circuits ROCR is 2X-3X times faster than VPR for large HW circuits

20/22 Scalability of On-chip Routing Critical Path 19% longer critical path than VPR 2.6% shorter than VPR (Fast) 30%/27% longer critical path than VPR/VPR (Fast)

21/22 Scalability of On-chip Routing Wire Segments ROCR requires 2%/8% fewer wire segments than VPR/VPR (Fast) for larger HW circuits

22/22 Conclusions and Future Work Conclusions Demonstrated ROCR scales well as circuit size increases On average 2X faster than VPR’s fast timing-driven router Requiring 18X less memory than VPR Produces good circuit quality Critical path 27% longer than VPR (Fast) on average 2.6% shorter critical path for largest HW circuit Requires on average 5% fewer wire segments Future Work Currently project: Major microprocessor vendor is fabricating our custom FPGA Improvements to Riverside On-Chip Router (ROCR) Improve ROCR’s performance for large HW circuits Incorporating timing information to achieve Analyze the scalability of ROCR as circuit size approaches FPGA capacity JIT FPGA Compilation Development of standard HW binary Support more complex FPGA architectures JIT FPGA compilation

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Similar presentations

Presentation on theme: "A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

Similar presentations

Presentation on theme: "A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback