Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

Slides:



Advertisements
Similar presentations
FPGA and ASIC Technology Comparison - 1 © 2009 Xilinx, Inc. All Rights Reserved FPGA and ASIC Technology Comparison, Part 2.
Advertisements

Spartan-3 FPGA HDL Coding Techniques
Architecture-Specific Packing for Virtex-5 FPGAs
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
FPGA chips and DSP Algorithms By Emily Fabes. 2 Agenda FPGA Background Reasons to use FPGA’s Advantages and disadvantages of using FPGA’s Sample VHDL.
Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.
The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays Steven J.
Lecture 3 1 ECE 412: Microcomputer Laboratory Lecture 3: Introduction to FPGAs.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
Power Reduction for FPGA using Multiple Vdd/Vth
Titan: Large and Complex Benchmarks in Academic CAD
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Ch.9 CPLD/FPGA Design TAIST ICTES Program VLSI Design Methodology Hiroaki Kunieda Tokyo Institute of Technology.
1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
Implementation of Finite Field Inversion
A Flexible DSP Block to Enhance FGPA Arithmetic Performance
J. Christiansen, CERN - EP/MIC
Heterogeneous FPGA architecture and CAD Peter Jamieson Supervisor: Jonathan Rose.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.
Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.
Lecture 6: Mapping to Embedded Memory and PLAs September 27, 2004 ECE 697F Reconfigurable Computing Lecture 6 Mapping to Embedded Memory and PLAs.
1 Leakage Power Analysis of a 90nm FPGA Authors: Tim Tuan (Xilinx), Bocheng Lai (UCLA) Presenter: Sang-Kyo Han (ECE, University of Maryland) Published.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
A Synthesizable Datapath-Oriented Programmable Logic Core Steven J.E. Wilton, Chun Hok Ho, Philip Leong, Wayne Luk, Brad Quinton University of British.
Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich.
An Improved “Soft” eFPGA Design and Implementation Strategy
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
© PSU Variation Aware Placement in FPGAs Suresh Srinivasan and Vijaykrishnan Narayanan Pennsylvania State University, University Park.
Architecture and algorithm for synthesizable embedded programmable logic core Noha Kafafi, Kimberly Bozman, Steven J. E. Wilton 2003 Field programmable.
ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.
Enhancing the Area-Efficiency of FPGAs with Hard Blocks Using Shadow Clusters Peter Jamieson and Jonathan Rose.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu
B0110 Fabric and Trust ENGR xD52 Eric VanWyk Fall 2013.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Floating-Point FPGA (FPFPGA)
Application-Specific Customization of Soft Processor Microarchitecture
Instructor: Dr. Phillip Jones
HIGH LEVEL SYNTHESIS.
Measuring the Gap between FPGAs and ASICs
Application-Specific Customization of Soft Processor Microarchitecture
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009

Motivation Goal: Build faster, cheaper, lower power FPGAs How? Fixed-Functionality (hard) blocks! ▫FPGA reconfigurability comes at the price of area, delay, and power ▫Some reconfigurability is unnecessary, remove it for savings

What to Make Hard? What hard blocks to use? ▫If not used, block is wasted ▫Industry suggests including memories and multipliers ▫Paper suggests adding floating-point units (FPU) Given a hard block, how fractured should it be? ▫Eg. Stratix III FPGA multipliers can be configured in a set of four 18x18 multipliers or one 36x36 multiplier ▫How fractured should the FPU be?

Introducing FPFPGA Contains soft and hard blocks ▫Soft blocks are composed of standard LUTs, FFs ▫Hard blocks are FPUs called Coarse-grained units (CGU) CGU characteristics: ▫Floating-point (FP) adds and multiplies only ▫Bus-based LUT operations using “wordblock” ▫Dedicated output registers ▫Accessible to soft blocks and vice-versa

Architecture of FPFPGA

CGU

CGU parameters # of each type of FP block Bus Width Number of Input Buses Number of Output Buses Number of Feedback Paths

Modeling Methodology Need to measure how “good” FPFPGA is Use empirical measurement method FPFPGA Benchmark Circuit Commercial CAD FLow Measure Quality of Results Very Nice! Commercial tools are unaware of FPFPGA, authors introduce “VEB” as solution

Virtual Embedded Block (VEB) Flow Manually map benchmark circuit into ▫CGU ▫Soft logic Put VEB representing CGU into commercial CAD tool Compile Gather area and timing measurements

VEB Create standard cell ASIC CGU and get area/timing numbers Implement area and timing of ASIC CGU using soft logic of commercial FPGA (different functionality, similar silicon timing, area, and pin demand) Assumes all internal paths == critical path to simplify timing of soft logic implementation

VEB

VEB Details Model delay with carry-chains Model area with shift registers Use LUT inputs and outputs for pin demand Note: Area and delay models use independent resources

VEB Placement Challenge Hard block locations are fixed on an FPGA Commercials tools can’t do that for VEB since it’s just a group of clustered soft logic constrained to be placed in a particular relative distance from each other Solution: ▫Let commercial tools place VEB anywhere ▫Then manually place VEB to fixed locations

VEB Quality 11% delay error when modeling embedded multiplier (non-fp to compare with existing multiplier) Area is accurate (no number given) Important repeatability hint: Must determine timing post-bitstream because of significant false paths (most CGUs do not use the longest path and this is detected post-bitstream)

Benchmarks 32-bit single-precision floating-point 8 benchmarks ▫5 Core computation blocks ▫1 application ▫2 synthetic

Experimental Settings Xilinx Virtex 2: XC2V FF CGUs each implemented as a VEB ▫Each CGU takes up 122 Logic Cells 2 FP multipliers, 2 FP adders, 5 wordblocks ▫In the order: W M A W W M A W W 4 input buses 3 output buses 3 feedback registers

Results Average area reduced by 25x Average delay reduced by ▫3.6x for single precision ▫4.3x for double precision Results are comparable to Kuon FPGA vs ASIC measurements Critical path of all circuits is in FPU

Reason for Good Results Removed reconfiguration bits (area reduction) Efficient directional routing Embedded FP operators

Contributions Exploration of FPGA architectures with embedded floating-point cores VEB methodology to leverage commercial tools to explore new embedded hard blocks even when commercial tools are unaware of those new hard blocks

Weaknesses Significant amounts of speculation ▫Try to claim scope for stuff that should be in future work Especially weak was the paper’s analysis of a FPFPGA compiler which is outside of scope and should be listed as such

My 2 Cents Primary advantage of FPFPGA vs GPU in the floating-point high computation domain is low latency Several applications demand very low latency and very high computational power ▫Plant monitoring of high-speed reactions ▫Financial automatic buy-sell algorithms Secondary advantage is energy consumed to perform the same computations.

My 2 Cents Comparison unfair ▫Most FPGA designers would convert floating- point to fixed point and not leave it as floating- point  Double precision fp add requires 701 slices  Fixed point add 64 LUTs == 16 slices Critical path is in FPU suggests benchmark circuits are unusually geared to use FPU cores and this is admitted by the authors