Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen.

Slides:

Advertisements

Similar presentations

1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Advertisements

Vector Processing as a Soft-core CPU Accelerator Jason Yu, Guy Lemieux, Chris Eagleston {jasony, lemieux, University of British Columbia.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

1 A Self-Tuning Configurable Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

Conjoining Soft-Core FPGA Processors David Sheldon a, Rakesh Kumar b, Frank Vahid a*, Dean Tullsen b, Roman Lysecky c a Department of Computer Science.

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

The Warp Processor Dynamic SW/HW Partitioning David Mirabito A presentation based on the published works of Dr. Frank Vahid - Principal Investigator Dr.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering.

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Configurable System-on-Chip: Xilinx EDK

CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley.

Performance Analysis of Processor Characterization Presentation Performed by : Winter 2005 Alexei Iolin Alexander Faingersh Instructor:

Dynamic Loop Caching Meets Preloaded Loop Caching – A Hybrid Approach Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research Frank Vahid Professor Department of Computer Science and Engineering.

1 Chapter 14 Embedded Processing Cores. 2 Overview RISC: Reduced Instruction Set Computer RISC-based processor: PowerPC, ARM and MIPS The embedded processor.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

Chuanjun Zhang, UC Riverside 1 Using a Victim Buffer in an Application- Specific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering.

המעבדה למערכות ספרתיות מהירות High speed digital systems laboratory הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel institute of technology.

Just-in-Time Compilation for FPGA Processor Cores This work was supported in part by the National Science Foundation (CNS ) and by the Semiconductor.

1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University.

Hardware/Software Partitioning Greg Stitt ECE Department University of Florida.

Hardware-Software Partitioning. EEL6935 / 52 Hardware Software Definition Definition: Given an application, hw/sw partitioning maps each region of the.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

B212/MAPLD 2005 Craven1 Configurable Soft Processor Arrays Using the OpenFire Processor Stephen Craven Cameron Patterson Peter Athanas Configurable Computing.

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,

Parallel Routing for FPGAs based on the operator formulation

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Making Good Points : Application-Specific Pareto-Point Generation for Design Space Exploration using Rigorous Statistical Methods David Sheldon, Frank.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.

Scott Sirowy, Chen Huang, and Frank Vahid † Department of Computer Science and Engineering University of California, Riverside {ssirowy,chuang,

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

A Brief Introduction to FPGAs

Exploring SOPC Performance Across FPGA Architectures Franjo Plavec June 9, 2006.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

New Opportunities for Computer Architecture Research Using High-Density FPGAs and Design Tools Nahi Abdul-Ghani, Patrick Akl, Mohammad El-Majzoub, Maroulla.

Floating-Point FPGA (FPFPGA)

Application-Specific Customization of Soft Processor Microarchitecture

Ann Gordon-Ross and Frank Vahid*

A Self-Tuning Configurable Cache

Automatic Tuning of Two-Level Caches to Embedded Applications

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldon a, Rakesh Kumar b, Roman Lysecky c, Frank Vahid a*, Dean Tullsen b a Department of Computer Science and Engineering University of California, Riverside * Also with the Center for Embedded Computer Systems at UC Irvine b Department of Computer Science and Engineering University of California, San Diego c Department of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx

David Sheldon, UC Riverside 2 of 22 FPGA Soft Core Processors Soft-core Processor HDL description Flexible implementation FPGA or ASIC Technology independent HDL Description FPGAASIC Spartan 3Virtex 2Virtex 4

David Sheldon, UC Riverside 3 of 22 FPGA Soft Core Processors Soft Core Processors can have configurable options Datapath units Cache Bus architecture Current commercial FPGA Soft-Core Processors Xilinx Microblaze Altera Nios FPGA μPμP Cache FPU MAC

David Sheldon, UC Riverside 4 of 22 Goal Goal: Tune FPGA soft-core microprocessor for a given application FPGA Synthesis size time App Configured μP Parameter Values μPμP Parameter Values Configured μP

David Sheldon, UC Riverside 5 of 22 Microblaze – Xilinx FPGA Soft-Core Base MicroBlaze Multiplier Barrel Shifter Divider FPU Cache Significant tradeoffs All units not necessarily the fastest, due to critical path lengthening Instantiatable units

David Sheldon, UC Riverside 6 of 22 Problem Need fast exploration Synthesis runs can take an hour Synthesis ~20-60 mins Parameter Values μPμP Exploration Configured μP This talk Two approaches Approach 1: Using Traditional CAD Techniques Approach 2: Synthesis-in-the- loop Results

David Sheldon, UC Riverside 7 of 22 Constraints on Configurations Size constraints may prevent use of all possible units Multiplier FPU Cache Barrel Shifter Divider MicroBlaze Cache Multiplier FPU Max Area

David Sheldon, UC Riverside 8 of 22 Approach 1: Traditional CAD Techniques Create a model of the problem Solve model with extensive search heuristics We will model this problem as a 0-1 knapsack problem Model Exploration Fast, considers 1000s of configurations MicroBlaze Cache Multiplier FPU Max Area Create model Slow, includes synthesis

David Sheldon, UC Riverside 9 of 22 Approach 1: Traditional CAD Techniques MicroBlaze Multiplier size perf Cache perf size Divider size perf size perf Barrel Shifter perf size FPU BS Perf increment Size increment FPUMULDIVCACHE Perf/Size Creating the model Synthesis MicroBlaze FPU Synthesis App Base

David Sheldon, UC Riverside 10 of 22 Approach 1: Traditional CAD Techniques 0-1 knapsack model Object’s benefit = Unit’s performance increment / size increment Object’s weight = Unit’s Size Knapsack’s size constraint = FPGA size constraint BS Perf increment Size increment FPUMULDIVCACHE Perf/Size Micro- Blaze

David Sheldon, UC Riverside 11 of 22 Approach 1: Traditional CAD Techniques Solved the 0-1 knapsack problem using established methods Toth, P., Dynamic Programming Algorithms for the Zero-One Knapsack Problem. Computing 1980 Running time 6 Microblaze configuration synthesis runs to create model O(n*p) to solve model n is the number of factors p is the available area Negligible (seconds) compared to synthesis runtimes (~hour)

David Sheldon, UC Riverside 12 of 22 Approach 1: Traditional CAD Techniques Problems 100’s of target FPGAs Different hard core resources (multiplier, block RAM) Model approach estimates size and performance for two or more units MUL speedup 1.3, DIV speedup 1.6  estimate MUL+DIV speedup 1.9 May really be 1.7 Model inaccuracies may be large

David Sheldon, UC Riverside 13 of 22 Approach 2: Synthesis-in-the-Loop Problem with traditional CAD approach 100’s of target FPGAs Model approach estimates size and performance for two or more units Model inaccuracies may be large Solution – Synthesis in the loop No abstract model Guided by actual size and performance data But slow – can only explore a few configurations Exploration Synthesis perf size Execute Synthesis-in-the-Loop 10’s of minutes Model Exploration Create model

David Sheldon, UC Riverside 14 of 22 Approach 2: Synthesis-in-the-Loop Multiplier size perf Cache perf size Divider size perf BS Perf increment Size increment FPUMULDIVCACHE Perf/Size size perf Barrel Shifter perf size Floating Point First pre-analyze units to guide heuristic Same calculations as when creating model for knapsack

David Sheldon, UC Riverside 15 of 22 Approach 2: Synthesis-in-the-Loop Build “impact-ordered tree” structure Tree is specific to given application BSFPUMULDIVCACHE Perf/Size Sort BSFPUMULDIVCACHE Perf/Size BS CACHE MUL FPU DIV Application Specific Impact-ordering Impact

David Sheldon, UC Riverside 16 of 22 Approach 2: Synthesis-in-the-Loop Run tree-based search heuristic BS MUL FPU DIV Include Not Include CACHE Useful Yes No Perf/Size Synthesis-in-the-Loop Exploration Synthesis perf size Execute

David Sheldon, UC Riverside 17 of 22 Comparison of Approaches Approach 1 – Traditional CAD 6 synthesis runs to build model O(np) knapsack solution Examines thousands of configurations during exploration Approach 2 – Synthesis in the loop 11 synthesis runs (6 pre-analysis, 5 exploration) Examines (at most) 5 configurations during exploration

David Sheldon, UC Riverside 18 of 22 Results 10 EEMBC and Powerstone benchmarks aifir, BaseFP01, bitmnp, brev, canrdr, g3fax, g721_ps, idct, matmul, tblook, ttsprk Average results shown, on Virtex 2 Pro, for particular size constraint Tool Run Time (min) Speedup Exhaustive App-Spec Knapsack Application-specific impact-ordered tree approach yields near-optimal results in acceptable tool runtime Knapsack sub-optimality due to multi-unit estimation inaccuracy

David Sheldon, UC Riverside 19 of 22 Results Obtained results for six different size constraints Results shown for a second size constraint Similar findings for all six constraints Tool Run Time (min) Speedup Exhaustive App-Spec Knapsack

David Sheldon, UC Riverside 20 of 22 Results Also ran for different FPGA Xilinx Spartan2 Similar findings Tool Run Time (min) Speedup Exhaustive App-Spec Knapsack

David Sheldon, UC Riverside 21 of 22 Conclusions Synthesis-in-the-loop approach outperformed traditional CAD approach Better results Slightly longer runtime Application-specific impact-ordered tree heuristic served well for synthesis-in-the-loop approach Future Extend for highly-configurable soft-core processors, and for multiple processors competing for and/or sharing resources

David Sheldon, UC Riverside 22 of 22 Questions?