Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees

Slides:



Advertisements
Similar presentations
FPGA and ASIC Technology Comparison - 1 © 2009 Xilinx, Inc. All Rights Reserved FPGA and ASIC Technology Comparison, Part 2.
Advertisements

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Reducing the Pressure on Routing Resources of FPGAs with Generic Logic Chains Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
Institute of Applied Microelectronics and Computer Engineering © 2014 UNIVERSITY OF ROSTOCK | College of Computer Science and Electrical Engineering.
ECE 331 – Digital System Design
EECS Components and Design Techniques for Digital Systems Lec 18 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
Basic Adders and Counters Implementation of Adders in FPGAs ECE 645: Lecture 3.
Adders and Multipliers Review. ARITHMETIC CIRCUITS Is a combinational circuit that performs arithmetic operations, e.g. –Addition –Subtraction –Multiplication.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
Titan: Large and Complex Benchmarks in Academic CAD
Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department.
Multi-operand Addition
A Flexible DSP Block to Enhance FGPA Arithmetic Performance
1 Using 2-opr adder Carry-save adder Wallace Tree Dadda Tree Parallel Counters Multi-Operand Addition.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
Wallace Tree Previous Example is 7 Input Wallace Tree
EKT 221 : Digital 2 Serial Transfers & Microoperations Date : Lecture : 2 hr.
Verilog hdl – II.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Routing Wire Optimization through Generic Synthesis on FPGA Carry Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
Reconfigurable Computing - Performance Issues John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Introduction to the FPGA and Labs
EKT 221 : Digital 2 Serial Transfers & Microoperations
Altera Stratix II FPGA Architecture
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
CSE477 VLSI Digital Circuits Fall 2003 Lecture 21: Multiplier Design
Presentation on FPGA Technology of
EKT 221 : Digital 2 Serial Transfers & Microoperations
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
HeAP: Heterogeneous Analytical Placement for FPGAs
Homework Reading Machine Projects Labs
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Principles & Applications
Summary Half-Adder Basic rules of binary addition are performed by a half adder, which has two binary inputs (A and B) and two binary outputs (Carry out.
XOR, XNOR, and Binary Adders
Andy Ye, Jonathan Rose, David Lewis
Programmable Logic Memories
Combinatorial Logic Design Practices
CprE / ComS 583 Reconfigurable Computing
Centar ( Global Signal Processing Expo
CprE / ComS 583 Reconfigurable Computing
ECE 331 – Digital System Design
XOR, XNOR, & Binary Adders
Arithmetic Circuits (Part I) Randy H
Polynomial Construction for Arithmetic Circuits
Multiplier-less Multiplication by Constants
A Novel FPGA Logic Block for Improved Arithmetic Performance
CS 140 Lecture 14 Standard Combinational Modules
Basic Adders and Counters Implementation of Adders
FPGA Glitch Power Analysis and Reduction
Homework Reading Machine Projects Labs
Part III The Arithmetic/Logic Unit
Approximate Quaternary Addition with the Fast Carry Chains of FPGAs
Multioperand Addition
CSE 140 Lecture 14 Standard Combinational Modules
XOR, XNOR, and Binary Adders
UNIVERSITY OF MASSACHUSETTS Dept
XOR, XNOR, and Binary Adders
Presentation transcript:

Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees Hadi P. Afshar Philip Brisk Paolo Ienne

Multi-input Additions are Fundamental DSP and Multimedia Application FIR filters, Motion Estimation,… Parallel Multipliers Flow Graph Transformation D Σ FIR Filter

Flow Graph Transformation BEFORE step 3 delta 7 delta 4 delta 2 delta 1 AFTER >> 4 & & & & step 1 + & = = = >> = 2 step step 1 step 2 step 3 step 2 + & >> >> >> >> >> = 1 + & ∑ Compressor Tree = + ADPCM vpdiff vpdiff

Compressor vs. Adder Tree Compressor Tree Adder Tree CPA CPA CSA Slow intra LUT routing Poor LUT utilization Low logic density Compressors are better than Adder Trees in VLSI But Adder Trees are better than Compressors in FPGA!

But Compressor Trees can be faster and smaller if Properly Designed

Better Compressors on FPGA Generalized Parallel Counter (GPC) is the basic block More logic density Fewer logic levels Less pressure on the routing CPA GPC

Overview Arithmetic Concepts Hybrid Design Approach Experiments Bottom-up Top-down Experiments Conclusion

Parallel Counters Parallel Counter Generalized Parallel Counter (GPC) Count # of input bits set to 1 Output is a binary value 3:2 − Full Adder 2:2 − Half Adder Generalized Parallel Counter (GPC) Input bits can have different bit position Eg. (3, 3; 4) GPC m n m:n counter n = log2(m+1) ∑

Compressor Trees on FPGAs We propose GPCs as the basic blocks for compressor trees Why? GPCs map well onto FPGA logic cells GPCs are flexible

GPC Mapping Example (0,5;3) (3,4;4) (3,5;4) 5 Counters 3 GPCs

Overview Arithmetic Concepts Hybrid Design Approach Experiments Bottom-up Top-down Experiments Conclusion

Hybrid Design Approach Compressor Tree Specification Top-Down GPC Mapped HDL Netlist Place and Route Result Atom Level GPC HDL Library FPGA Architectural Characteristics Bottom-Up

FPGA Logic Cell Altera Stratix-II/III/V + Logic Array Block (LAB) Adaptive Logic Module (ALM) Reg Comb. Logic + 1 2 3 4 5 6 7 8

FPGA Logic Cell ALM Configuration Modes Normal Extended Arithmetic Shared Arithmetic 4-LUT + 4-LUT +

Bottom-up Design LAB1 LAB0 6:3 GPC F2 F1 F0 What if we have bigger GPCs like 7:3 GPC? Can we exploit the carry chain and dedicated adders for building GPCs?

GPC Design Example (0, 6; 3) GPC + s0 s1 c0 c1 z0 z1 z2 ALM0 ALM1 a5 C(a1,a2,a3) C(a4,a5) S(a1,a2,a3) S(a4,a5) a0 s0 s1 c0 c1 z0 z1 z2 ALM0 ALM1 + a5 a4 a3 a2 a1 a0 FA HA s0 c0 s1 c1 z0 z1 z2 (0, 6; 3) GPC

+ + + GPC Placement {cout,s} = cin+ a + a = cin+ 2a Logic separation between carry and sum Zero value on the carry + + GPC Boundary + GPC Boundary a cin cout s {cout,s} = cin+ a + a = cin+ 2a cout = a and s = cin

+ LUT + LUT GPCi GPCi GPCi+1 GPCi+1

Top-down Heuristic Mapping_algorithm(Integer : M, Integer : W, { Build_GPC_library(); repeat while (col_indx<max_col_indx) if(columns[col_indx] > H) Map_by_GPC(); else col_indx++; } lsb_to_msb_covering(); Connect_GPCs_IOs(); Propagate_comb_delay(); Generate_next_stage_dots(); } until three rows of dots remains; Step1: Step2: Step3: Mapping_algorithm(Integer : M, Integer : W, Array of Integers : columns ) (0, H; log2H)

Major Step of Heuristic Mapped to (0, H; log2H) GPCs Height < H Process columns from LSB to MSB

Delay Balancing CP1 = z1d+a0d CP2 = max(z1d+a5d, z4d+a2d, z6d+a0d)   CP1 = z1d+a0d CP2 = max(z1d+a5d, z4d+a2d, z6d+a0d)   z8 z7 z6 z5 z4 z3 z2 z1 z0 a5 a2 a0 z1d > z4d > z6d a0d > a2d > a5d

Overview Arithmetic Constructs Hybrid Design Approach Experiments Bottom-up Top-down Experiments Conclusion

Experiments Bottom-up design Top-down Quartus-II Altera tool Atom-level design by Verilog Quartus Module (VQM) format Top-down Heuristic: C++ Output: Structural VHDL Quartus-II Altera tool Benchmarks DCT, FIR, ME, G721 Multiplier Horner Polynomial Video Mixer

Experiments Mapping methods Ternary LUT Only Arith1: Arithmetic mode, without delay balancing Arith2: Arithmetic mode, with delay balancing

Delay (ns) -27% +2%

Area (ALM) +47% +18%

Area (LAB) -4.5%

Overview Arithmetic Concepts Hybrid Design Approach Experiments Bottom-up Top-down Experiments Conclusion

Conclusion Conventional wisdom has held that adder trees outperform compressor trees on FPGAs Ternary adder trees were a major selling point Conventional wisdom is wrong! GPCs map nicely onto FPGA logic cells Carry-chain Compressor trees on FPGAs, are faster than adder trees when built from GPCs