A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Slides:



Advertisements
Similar presentations
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Advertisements

Altera FLEX 10K technology in Real Time Application.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR.
Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
DSD 2007 Concurrent Error Detection for FSMs Designed for Implementation with Embedded Memory Blocks of FPGAs Andrzej Krasniewski Institute of Telecommunications.
Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.
Optimizing high speed arithmetic circuits using three-term extraction Anup Hosangadi Ryan Kastner Farzan Fallah ECE Department Fujitsu Laboratories University.
The Spartan 3e FPGA. CS/EE 3710 The Spartan 3e FPGA  What’s inside the chip? How does it implement random logic? What other features can you use?  What.
Programmable logic and FPGA
ASIC vs. FPGA – A Comparisson Hardware-Software Codesign Voin Legourski.
Basic Adders and Counters Implementation of Adders in FPGAs ECE 645: Lecture 3.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
Aug Shift Operations Source: David Harris. Aug Shifter Implementation Regular layout, can be compact, use transmission gates to avoid threshold.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
Titan: Large and Complex Benchmarks in Academic CAD
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department.
1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.
Implementation of Finite Field Inversion
J. Christiansen, CERN - EP/MIC
Reconfigurable Computing - Type conversions and the standard libraries John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
An EDA-Friendly Protection Scheme against Side-Channel Attacks Ali Galip Bayrak 1 Nikola Velickovic 1, Francesco Regazzoni 2, David Novo 1, Philip Brisk.
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.
ATS Exploiting Free LUT Entries to Mitigate Soft Errors in SRAM- based FPGAs Keheng Huang, Yu Hu, Xiaowei Li Institute of Computing Technology Chinese.
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
A Decomposition Algorithm to Structure Arithmetic Circuits Ajay K. Verma, Philip Brisk, Paolo Ienne Ecole Polytechnique Fédérale de Lausanne (EPFL) International.
An Improved “Soft” eFPGA Design and Implementation Strategy
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow Ajay K. Verma 1, Philip Brisk 2, Paolo Ienne 1 International Conference.
Routing Wire Optimization through Generic Synthesis on FPGA Carry Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
Review of “Register Binding for FPGAs with Embedded Memory” by Hassan Al Atat and Iyad Ouaiss Lisa Steffen CprE 583.
Application of Addition Algorithms Joe Cavallaro.
Click to edit Master title style Literature Review Measuring the Gap Between FPGAs and ASICs Ian Kuon, Jonathan Rose University of Toronto IEEE TCAD/ICAS.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.
MACHINE VISION GROUP, JANI BOUTELLIER, Architectural Support for the Orchestration of Fine-Grained Multiprocessing for Portable Streaming Applications.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
A Brief Introduction to FPGAs
Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale.
Reconfigurable Computing - Performance Issues John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Improved Resource Sharing for FPGA DSP Blocks
Floating-Point FPGA (FPFPGA)
Altera Stratix II FPGA Architecture
Outline Introduction Floating Point Arithmetic Adder Multiplier.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees
CprE / ComS 583 Reconfigurable Computing
CprE / ComS 583 Reconfigurable Computing
A Novel FPGA Logic Block for Improved Arithmetic Performance
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Basic Adders and Counters Implementation of Adders
Approximate Quaternary Addition with the Fast Carry Chains of FPGAs
Measuring the Gap between FPGAs and ASICs
Presentation transcript:

A Flexible DSP Block to Enhance FGPA Arithmetic Performance Hadi Parandeh-Afshar Alessandro Cevrero Panagiotis Athanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne LAP EPFL LSM, LAP EPFL UCR LSM EPFL Epfl and iis logo Ecole Politechique Federale De lausanne (EPFL) University of California Riverside (UCR) {first_name.last_name@epfl.ch} first_name@cs.ucr.edu

Motivation and contribution New DSP block for high performance FPGAs Increased flexibility PPG Bypassable PPG What are you doing? Why doing that, and why is important Programmable Compressor Tree Enchance FPGA arithmetic performance

Motivation and contribution Data flow transformation automatically expose compressor tree 19 E1 E2 M1 M2 48 4 S1 S2 out sign xor neg 1 not and Fused multiply-addition operations cannot use current DSP blocks in a single-cycle Arithmetic transformations DSP blocks cannot accelerate multi-operand addition (a) (b) Dat flow transformation [Verma et al , TCAD 08]

Outline Related work DSP Block Architecture Experimental methodology Limitations DSP Block Architecture Experimental methodology Results Conclusions Not really sure it is useful ????

FPGA commentary  IP cores [Xilinx, Altera] Logic cells with dedicated addition circuitry and fast carry chains Compressor tree synthesis on 6-LUT FPGAs [Parandeh-Afshar et. al, ASPDAC 08, DATE 08, FPL 09] IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV] Σ  9

FPGA commentary  IP cores [Xilinx, Altera] Logic cells with dedicated addition circuitry and fast carry chains Compressor tree synthesis on 6 LUTs FPGAs [Parandeh-Afshar et al, DATE 08, ASPDAC 08, FPL 09] IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV] Σ  9

Field Programmable Compressor Tree (FPCT) User-configurable multi operand adder Compressor tree + bypassable CPA 15 16 CSlice 6 128 = 816 input bits 48 = 86 output bits Carry-in 15 Carry-out Dedicated to FPCT and how fpct today map a multiplier Previous wok has established the ability of FPCT to accellerarate multi-input addtion operation. 1.6x speed up was observed [Cevrero et al, FPGA 08, TRETS 09]

FPCT limitations PPG soft logic 9x9-bit signed multiplier [Baugh Wooley] Soft-Logic 9x9-bit PPG (81 LUTs) 82 wires  1 FPCT 18 bit output Put low counter utilization

FPCT limitations PPG soft logic Low input utilization for multipliers 9x9-bit signed multiplier [Baugh Wooley] 64% input utilization  Soft-Logic 9x9-bit PPG (81 LUTs) 2 3 C0 C1 C2 C3 C4 C5 C6 82 wires  1 FPCT 18 bit output Put low counter utilization

DSP block architecture 11 DSP block architecture FPCT (8 CSlices) 128 48 Put the constroibution

DSP block architecture 11 DSP block architecture ½-FPCT (4 CSlices) A B PPG PPG* 5 61 21 15 3 90 18 128 61 6 ½-FPCT (4 CSlices) Put the constroibution Two 9x9 signed PPGs One modified to support larger multiplier Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers

DSP block architecture 11 DSP block architecture ½-FPCT (4 CSlices) A B PPG PPG* 5 61 21 15 3 90 18 128 C4 C3 C2 C1 5 2 3 Fixed Logic (A) Logic (B) 61 6 ½-FPCT (4 CSlices) Put the constroibution Two 9x9 signed PPGs One modified to support larger multiplier Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers

DSP block architecture 11 DSP block architecture ½-FPCT (4 CSlices) A B PPG PPG* 5 61 21 15 3 90 18 128 Only 8% larger that traditional FPCT in 90nm CMOS (ARTISAN cell library with TSMC process)  61 6 ½-FPCT (4 CSlices) Put the constroibution Two 9x9 signed PPGs One modified to support larger multiplier Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers

Experimental methodology Input Pins Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F* Same area and I/0 as our DSP IP IP To asses the DPS blcok performances we used the VEB IP Output Pins

Experimental methodology Input Pins Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F* Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay Estimated proposed DSP block delay ASIC design flow (90nm CMOS) F* F* To asses the DPS blcok performances we used the VEB F* Output Pins

Experimental methodology Input Pins Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F* Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay Estimated proposed DSP block delay ASIC design flow (90nm CMOS) For each proposed DSP block in the circuit Subtract delay of F* Add proposed DSP block delay New-DPS New-DPS To asses the DPS blcok performances we used the VEB New-DPS Output Pins

Results Critical Path Delay Ternary GPC [Parandeh-Afshar et al, ASPDAC 08] Stratix II DSP Block FPCT w/ Soft PPG Proposed DSP Block ns

Normalized Area (to Stratix II DSP block area) Results Normalized Area (to Stratix II DSP block area) Stratix II DSP Block FPCT w/ Soft PPG Proposed DSP Block

Conclusion New DSP block proposed Accelerate multiplication and multi-operand addition More flexibility Competitive with Stratix II DSP block Intends to replace compressor tree in existing DSP block Only 8% area overhead respect to original FPCT