Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen b a Department of Computer Science and Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine b Department of Computer Science and Engineering University of California, San Diego c Department of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx

David Sheldon, UC Riverside 2 of 22 FPGA Soft Core Processors Soft-core Processor HDL description Flexible implementation FPGA or ASIC Technology independent HDL Description FPGAASIC Spartan 3Virtex 2Virtex 4

David Sheldon, UC Riverside 3 of 22 FPGA Soft Core Processors Soft Core Processors can have configurable options Datapath units Cache Bus architecture Current commercial FPGA Soft-Core Processors Xilinx Microblaze Altera Nios FPGA μPμP Cache FPU MAC

David Sheldon, UC Riverside 4 of 22 Goal Goal: Tune FPGA soft-core microprocessor for a given application FPGA Synthesis size time App Configured μP Parameter Values μPμP Parameter Values Configured μP

David Sheldon, UC Riverside 5 of 22 Microblaze – Xilinx FPGA Soft-Core Base MicroBlaze Multiplier Barrel Shifter Divider FPU Cache Significant tradeoffs All units not necessarily the fastest, due to critical path lengthening Instantiatable units

David Sheldon, UC Riverside 6 of 22 Problem Need fast exploration Synthesis runs can take an hour Synthesis ~20-60 mins Parameter Values μPμP Exploration Configured μP This talk Two approaches Approach 1: Using Traditional CAD Techniques Approach 2: Synthesis-in-the- loop Results

David Sheldon, UC Riverside 7 of 22 Constraints on Configurations Size constraints may prevent use of all possible units Multiplier FPU Cache Barrel Shifter Divider MicroBlaze Cache Multiplier FPU Max Area

David Sheldon, UC Riverside 8 of 22 Approach 1: Traditional CAD Techniques Create a model of the problem Solve model with extensive search heuristics We will model this problem as a 0-1 knapsack problem Model Exploration Fast, considers 1000s of configurations MicroBlaze Cache Multiplier FPU Max Area Create model Slow, includes synthesis

David Sheldon, UC Riverside 9 of 22 Approach 1: Traditional CAD Techniques MicroBlaze Multiplier size perf Cache perf size Divider size perf size perf Barrel Shifter perf size FPU BS Perf increment Size increment FPUMULDIVCACHE 1.10.91.21.01.3 1.42.71.81.11.6 Perf/Size0.960.340.630.930.80 Creating the model Synthesis MicroBlaze FPU Synthesis App Base

David Sheldon, UC Riverside 10 of 22 Approach 1: Traditional CAD Techniques 0-1 knapsack model Object’s benefit = Unit’s performance increment / size increment Object’s weight = Unit’s Size Knapsack’s size constraint = FPGA size constraint BS Perf increment Size increment FPUMULDIVCACHE 1.10.91.21.01.3 1.42.71.81.11.6 Perf/Size0.960.340.630.930.80 Micro- Blaze

David Sheldon, UC Riverside 11 of 22 Approach 1: Traditional CAD Techniques Solved the 0-1 knapsack problem using established methods Toth, P., Dynamic Programming Algorithms for the Zero-One Knapsack Problem. Computing 1980 Running time 6 Microblaze configuration synthesis runs to create model O(n*p) to solve model n is the number of factors p is the available area Negligible (seconds) compared to synthesis runtimes (~hour)

David Sheldon, UC Riverside 12 of 22 Approach 1: Traditional CAD Techniques Problems 100’s of target FPGAs Different hard core resources (multiplier, block RAM) Model approach estimates size and performance for two or more units MUL speedup 1.3, DIV speedup 1.6  estimate MUL+DIV speedup 1.9 May really be 1.7 Model inaccuracies may be large

David Sheldon, UC Riverside 13 of 22 Approach 2: Synthesis-in-the-Loop Problem with traditional CAD approach 100’s of target FPGAs Model approach estimates size and performance for two or more units Model inaccuracies may be large Solution – Synthesis in the loop No abstract model Guided by actual size and performance data But slow – can only explore a few configurations Exploration Synthesis perf size Execute Synthesis-in-the-Loop 10’s of minutes Model Exploration Create model

David Sheldon, UC Riverside 14 of 22 Approach 2: Synthesis-in-the-Loop Multiplier size perf Cache perf size Divider size perf BS Perf increment Size increment FPUMULDIVCACHE 1.10.91.21.01.3 1.42.71.81.11.6 Perf/Size0.960.340.630.930.80 size perf Barrel Shifter perf size Floating Point First pre-analyze units to guide heuristic Same calculations as when creating model for knapsack

David Sheldon, UC Riverside 15 of 22 Approach 2: Synthesis-in-the-Loop Build “impact-ordered tree” structure Tree is specific to given application BSFPUMULDIVCACHE Perf/Size0.960.340.630.930.80 Sort BSFPUMULDIVCACHE Perf/Size0.960.340.630.930.80 BS CACHE MUL FPU DIV Application Specific Impact-ordering 0.96 0.80 0.63 0.34 0.93 Impact

David Sheldon, UC Riverside 16 of 22 Approach 2: Synthesis-in-the-Loop Run tree-based search heuristic BS MUL FPU DIV Include Not Include CACHE Useful Yes No 0.96 0.80 0.63 0.34 0.93 Perf/Size Synthesis-in-the-Loop Exploration Synthesis perf size Execute

David Sheldon, UC Riverside 17 of 22 Comparison of Approaches Approach 1 – Traditional CAD 6 synthesis runs to build model O(np) knapsack solution Examines thousands of configurations during exploration Approach 2 – Synthesis in the loop 11 synthesis runs (6 pre-analysis, 5 exploration) Examines (at most) 5 configurations during exploration

David Sheldon, UC Riverside 18 of 22 Results 10 EEMBC and Powerstone benchmarks aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk Average results shown, on Virtex 2 Pro, for particular size constraint Tool Run Time (min) Speedup 0 200 400 600 800 1 1.522.5 Exhaustive App-Spec Knapsack Application-specific impact-ordered tree approach yields near-optimal results in acceptable tool runtime Knapsack sub-optimality due to multi-unit estimation inaccuracy

David Sheldon, UC Riverside 19 of 22 Results Obtained results for six different size constraints Results shown for a second size constraint Similar findings for all six constraints Tool Run Time (min) Speedup 0 200 400 600 800 1 1.522.5 Exhaustive App-Spec Knapsack

David Sheldon, UC Riverside 20 of 22 Results Also ran for different FPGA Xilinx Spartan2 Similar findings Tool Run Time (min) Speedup 0 50 150 250 300 1 1.21.41.6 100 200 Exhaustive App-Spec Knapsack

David Sheldon, UC Riverside 21 of 22 Conclusions Synthesis-in-the-loop approach outperformed traditional CAD approach Better results Slightly longer runtime Application-specific impact-ordered tree heuristic served well for synthesis-in-the-loop approach Future Extend for highly-configurable soft-core processors, and for multiple processors competing for and/or sharing resources

David Sheldon, UC Riverside 22 of 22 Questions?

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

Similar presentations

Presentation on theme: "Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

Similar presentations

Presentation on theme: "Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen."— Presentation transcript:

Similar presentations

About project

Feedback