Application-Specific Customization of Soft Processor Microarchitecture

Slides:



Advertisements
Similar presentations
Comparison of Altera NIOS II Processor with Analog Device’s TigerSHARC
Advertisements

Computer Architecture Instruction-Level Parallel Processors
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
VESPA: Portable, Scalable, and Flexible FPGA-Based Vector Processors Peter YiannacourasUniv. of Toronto J. Gregory Steffan Univ. of Toronto Jonathan Rose.
Application-Specific Customization of FPGA Soft- core Processors Journal Paper Presentation Presented by: Ahmad Sghaier Course Instructor: Dr. Shawki Areibi.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Experiences Implementing Tinuso in gem5 Maxwell Walter, Pascal Schleuniger, Andreas Erik Hindborg, Carl Christian Kjærgaard, Nicklas Bo Jensen, Sven Karlsson.
ECE 232 L1 Intro.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 1 Introduction.
Specific Choice of Soft Processor Features Mark Grover Prof. Greg Steffan Dept. of Electrical and Computer Engineering.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
SPREE Tutorial Peter Yiannacouras April 13, 2006.
Automated Design of Custom Architecture Tulika Mitra
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Data Parallel FPGA Workloads: Software Versus Hardware Peter Yiannacouras J. Gregory Steffan Jonathan Rose FPL 2009.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005.
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
PipeliningPipelining Computer Architecture (Fall 2006)
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
1 Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Vaughn Betz, Jonathan Rose.
Presenter: Darshika G. Perera Assistant Professor
New Opportunities for Computer Architecture Research Using High-Density FPGAs and Design Tools Nahi Abdul-Ghani, Patrick Akl, Mohammad El-Majzoub, Maroulla.
Prototyping SoC-based Gate Drive Logic for Power Convertors by Generating code from Simulink models. Researchers Rounak Siddaiah, Graduate Student-University.
Lab 1: Using NIOS II processor for code execution on FPGA
Flopoco in LegUp Jenny deng.
Cache Memory and Performance
Floating-Point FPGA (FPFPGA)
Computer Architecture
Design-Space Exploration
Introduction to Programmable Logic
Improving java performance using Dynamic Method Migration on FPGAs
Improving Program Efficiency by Packing Instructions Into Registers
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
A Review of Processor Design Flow
CSCI1600: Embedded and Real Time Software
Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
A High Performance SoC: PkunityTM
Improving Memory System Performance for Soft Vector Processors
A small SOPC-based aircraft autopilot system that contains an FPGA with a Nios processor core, a DSP processor, and memory is seen above. The bottom sensor.
Measuring the Gap between FPGAs and ASICs
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Automatic Tuning of Two-Level Caches to Embedded Applications
Application-Specific Customization of Soft Processor Microarchitecture
Arani Bhattacharya, Han Chen, Peter Milder, Samir R. Das
CSCI1600: Embedded and Real Time Software
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Edward S. Rogers Sr. Department of Electrical and Computer Engineering

Processors and FPGA Systems Processors lie at the “heart” of FPGA systems UART Custom Logic Soft Processor Memory Interface Ethernet Performs coordination and even computation Better processors => less hardware to design We seek improvement through customization

Motivating Application-Specific Customizations of Soft Processors FPGA Configurability Can consider unlimited processor variants A soft processor might be used to run either: A single application A single class of applications Many applications, but can be reconfigured Applications differ in architectural requirements Can specialize architecture for each application Platform allows it (1), FPGA designs allow it (2), it can be beneficial (3) We want to evaluate effectiveness of specialization

Research Goals To investigate The potential for “Application-tuning” Tune processor microarchitecture to favour an application Preserve general purpose functionality “Instruction-set Subsetting” Sacrifice general purpose functionality Eliminate hardware not required by application Combination of both methods We still build a complete processor, but pick and choose the components and pipeline to use which is best for the application. Call this Application-tuning Measure efficiency gained through real implementations

SPREE System (Soft Processor Rapid Exploration Environment) Description ISA Datapath Input: Processor description Made of hand-coded components SPREE System Verify ISA against datapath SPREE Datapath Instantiation Control Generation Multi-cycle/variable-cycle FUs Multiplexer select signals Interlocking Branch handling Exports to synthesizable Verilog RTL RTL Output: Synthesizable Verilog

Back-End Infrastructure RTL Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Modelsim RTL Simulator Quartus II 4.2 CAD Software Stratix 1S40C5 Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power We can measure area/performance/energy accurately

Comparison to Altera’s Nios II Has three variations: Nios II/e – unpipelined, no HW multiplier Nios II/s – 5-stage, with HW multiplier Nios II/f – 6-stage, dynamic branch prediction

Architectural Parameters Used in SPREE Multiplication Support Hardware FU or software routine Shifter implementation Flipflops, multiplier, or LUTs Pipelining Depth (2-7 stages) Organization Forwarding Ideally everything … What can we gain by customizing the actual core? We focus on core microarchitecture

SPREE vs Nios II faster smaller 3-stage pipe HW multiply Multiply-based shifter EXCITEMENT HERE – this graph should be burned in people’s memory faster smaller

Exploration of Soft Processor Architectural Customizations Architectural-tuning Instruction-set subsetting Combination (Arch-tuning + Subsetting)

1. Architectural Tuning Experiment Vary the same parameters Multiplication Support Shifter implementation Pipelining Determine Best overall (general purpose) processor Best per application (application-tuned) Metric: Performance per Area (MIPS/LE) Basically inverse of Area-Delay product Using SPREE, we’ve varied the following

Performance per Area of All Processors 32% 14.1%

2. Instruction-set Subsetting SPREE automatically removes Unused connections Unused components Reduce processor by reducing the ISA Can create application-specific processor Eliminate unused parts of the ISA

Instruction-set Usage of Benchmarks Applications do not use complete ISA Dynaminc instruction Strong potential for hardware reduction

Area Reduction from Subsetting 23% Fraction of Area Give an example of what this can do for a processor Similar seen for energy Area reduced by 60% in some , 23% on average Similar reductions for energy, small impact on performance

3. Combining Application Tuning and Instruction-set Subsetting Subsetting is effective on its own Can apply subsetting on top of tuning Compare different customization methods Tuning Subsetting Tuning + Subsetting

Combining Application Tuning and Instruction-set Subsetting 25% 16% 14% Tuning reduces the waste that subsetting eliminates

Summary of Presented Architectural Conclusions Application tuning 14% average efficiency gain Will increase with more architectural axes Instruction-set Subsetting Up to 60% area & energy savings 16% average efficiency gain Combined Tuning & Subsetting 25% average efficiency gain

Future Work Consider other promising architectural axes Branch prediction, aggressive forwarding ISA changes Datapaths (eg. VLIW) Caches and memory hierarchy Compiler assistance Can improve tuning & subsetting