Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Similar presentations


Presentation on theme: "A Flexible DSP Block to Enhance FGPA Arithmetic Performance"— Presentation transcript:

1 A Flexible DSP Block to Enhance FGPA Arithmetic Performance
Hadi Parandeh-Afshar Alessandro Cevrero Panagiotis Athanasopoulous Philip Brisk Yusuf Leblebici Paolo Ienne LAP EPFL LSM, LAP EPFL UCR LSM EPFL Epfl and iis logo Ecole Politechique Federale De lausanne (EPFL) University of California Riverside (UCR)

2 Motivation and contribution
New DSP block for high performance FPGAs Increased flexibility PPG Bypassable PPG What are you doing? Why doing that, and why is important Programmable Compressor Tree Enchance FPGA arithmetic performance

3 Motivation and contribution
Data flow transformation automatically expose compressor tree 19 E1 E2 M1 M2 48 4 S1 S2 out sign xor neg 1 not and Fused multiply-addition operations cannot use current DSP blocks in a single-cycle Arithmetic transformations DSP blocks cannot accelerate multi-operand addition (a) (b) Dat flow transformation [Verma et al , TCAD 08]

4 Outline Related work DSP Block Architecture Experimental methodology
Limitations DSP Block Architecture Experimental methodology Results Conclusions Not really sure it is useful ????

5 FPGA commentary  IP cores [Xilinx, Altera]
Logic cells with dedicated addition circuitry and fast carry chains Compressor tree synthesis on 6-LUT FPGAs [Parandeh-Afshar et. al, ASPDAC 08, DATE 08, FPL 09] IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV] Σ 9

6 FPGA commentary  IP cores [Xilinx, Altera]
Logic cells with dedicated addition circuitry and fast carry chains Compressor tree synthesis on 6 LUTs FPGAs [Parandeh-Afshar et al, DATE 08, ASPDAC 08, FPL 09] IP cores [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 08] DSP Blocks [Altera Stratix III-IV] Σ 9

7 Field Programmable Compressor Tree (FPCT)
User-configurable multi operand adder Compressor tree + bypassable CPA 15 16 CSlice 6 128 = 816 input bits 48 = 86 output bits Carry-in 15 Carry-out Dedicated to FPCT and how fpct today map a multiplier Previous wok has established the ability of FPCT to accellerarate multi-input addtion operation. 1.6x speed up was observed [Cevrero et al, FPGA 08, TRETS 09]

8 FPCT limitations PPG soft logic
9x9-bit signed multiplier [Baugh Wooley] Soft-Logic 9x9-bit PPG (81 LUTs) 82 wires  1 FPCT 18 bit output Put low counter utilization

9 FPCT limitations PPG soft logic Low input utilization for multipliers
9x9-bit signed multiplier [Baugh Wooley] 64% input utilization  Soft-Logic 9x9-bit PPG (81 LUTs) 2 3 C0 C1 C2 C3 C4 C5 C6 82 wires  1 FPCT 18 bit output Put low counter utilization

10 DSP block architecture
11 DSP block architecture FPCT (8 CSlices) 128 48 Put the constroibution

11 DSP block architecture
11 DSP block architecture ½-FPCT (4 CSlices) A B PPG PPG* 5 61 21 15 3 90 18 128 61 6 ½-FPCT (4 CSlices) Put the constroibution Two 9x9 signed PPGs One modified to support larger multiplier Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers

12 DSP block architecture
11 DSP block architecture ½-FPCT (4 CSlices) A B PPG PPG* 5 61 21 15 3 90 18 128 C4 C3 C2 C1 5 2 3 Fixed Logic (A) Logic (B) 61 6 ½-FPCT (4 CSlices) Put the constroibution Two 9x9 signed PPGs One modified to support larger multiplier Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers

13 DSP block architecture
11 DSP block architecture ½-FPCT (4 CSlices) A B PPG PPG* 5 61 21 15 3 90 18 128 Only 8% larger that traditional FPCT in 90nm CMOS (ARTISAN cell library with TSMC process)  61 6 ½-FPCT (4 CSlices) Put the constroibution Two 9x9 signed PPGs One modified to support larger multiplier Hard compression circuits ‘A’ and ‘B’ Efficient Synthesis of large multipliers

14 Experimental methodology
Input Pins Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F* Same area and I/0 as our DSP IP IP To asses the DPS blcok performances we used the VEB IP Output Pins

15 Experimental methodology
Input Pins Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F* Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay Estimated proposed DSP block delay ASIC design flow (90nm CMOS) F* F* To asses the DPS blcok performances we used the VEB F* Output Pins

16 Experimental methodology
Input Pins Virtual Embedded blocks (VEB) [Ho et al, FCCM 06] Define a preplaced soft IP core: F* Same area and I/0 as our DSP Replace our DSP block with F* Map benchmark on Stratix II Extract F* delay Estimated proposed DSP block delay ASIC design flow (90nm CMOS) For each proposed DSP block in the circuit Subtract delay of F* Add proposed DSP block delay New-DPS New-DPS To asses the DPS blcok performances we used the VEB New-DPS Output Pins

17 Results Critical Path Delay Ternary
GPC [Parandeh-Afshar et al, ASPDAC 08] Stratix II DSP Block FPCT w/ Soft PPG Proposed DSP Block ns

18 Normalized Area (to Stratix II DSP block area)
Results Normalized Area (to Stratix II DSP block area) Stratix II DSP Block FPCT w/ Soft PPG Proposed DSP Block

19 Conclusion New DSP block proposed
Accelerate multiplication and multi-operand addition More flexibility Competitive with Stratix II DSP block Intends to replace compressor tree in existing DSP block Only 8% area overhead respect to original FPCT


Download ppt "A Flexible DSP Block to Enhance FGPA Arithmetic Performance"

Similar presentations


Ads by Google