Download presentation

Presentation is loading. Please wait.

Published byKayla Hoskinson Modified over 2 years ago

1
**Architecture-Specific Packing for Virtex-5 FPGAs**

Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008

2
**Overview Virtex-5 6-LUT Packing Virtex-5 DSP and Block RAM Packing**

Results Summary

3
**Simplified FPGA Logic Element**

4-LUT A4 A3 A2 A1 O4 FF

4
**Simplified FPGA Logic Block**

FF 4-LUT General Interconnect General Interconnect

5
**Virtex-5 Logic Block CLB General Interconnect General Interconnect**

SLICE FF 6-LUT General Interconnect General Interconnect SLICE FF 6-LUT

6
Dual-Output 6-LUT 6-LUT A6 A5 A4 A3 A2 A1 O6 O5

7
**Dual-Output 6-LUT Usage**

8
**Dual-Output Packing 6-LUT 6-LUT Number of 6-LUTs used: 2**

VCC a b Y Logic A6 A5 A4 A3 A2 A1 O6 5-LUT O5 x y X Logic x y b a Y Logic X Number of 6-LUTs used: 1! Number of 6-LUTs used: 2

9
Virtex-5 LUT/FF Pair CY F7 F7 A O6 XOR AMUX 6-LUT O6 O5 CIN AX FF AQ

10
**Dual-Output Packing Tradeoff**

AX 6-LUT F7 O5 O6 FF 6-LUT

11
**Dual-Output Packing in Placer**

Goal: To reduce area without performance hit Can be done pre-placement Will be sub-optimal without delay estimates Use delay estimates available during placement to make good decisions on when to merge two LUTs Approach: Allow second 5-LUT to be used, when performance impact is small Incorporate LUT packing in placer’s cost function

12
**Placer Cost Function Previous cost function:**

Cost = a * W + b * T W: wirelength cost T: timing performance cost Extend cost function with two new terms One based on 6-LUT utilization (L) One based on SLICE utilization (S) Cost = a * W + b * T + c * L + d * S

13
**6-LUT Utilization Term L is computed based on all the used 6-LUT slots**

Where

14
**SLICE Utilization Term**

S is computed based on all the available SLICEs Let: Ni = Number of used 5-LUTs in SLICE i (at most 8) m S = Si i=0

15
Performance Recovery Helpful to prohibit pack in certain cases for performance reasons Other used elements in a SLICE may block the “good” path from the O5 output to external interconnect.

16
**Performance Recovery: XOR**

AX LUT6 CY F7 O5 O6 CIN FF AQ AMUX A LUT6 FF

17
**Performance Recovery: F7**

XOR AX LUT6 CY F7 O5 O6 CIN FF AQ AMUX A LUT6 F7 FF

18
6-LUT Reduction 5.5% 6-LUT Reduction

19
SLICE Reduction 10.23% SLICE Reduction

20
Performance Results 3.3% Performance Degradation

21
**Overview Virtex-5 6-LUT Packing Virtex-5 DSP and Block RAM Packing**

Summary

22
**New Type of Packing Problem**

Traditionally, packing is considered to be a problem of just LUTs and flops However, Virtex-5 contains large IP blocks that present their own packing problem

23
**Virtex-5 Block RAMs A 36 Kbit block RAM tile can store:**

36Kb RAM A 36 Kbit block RAM tile can store: a) single 36 Kb RAM b) two independent 18 Kb RAMs Block RAM has configurable “aspect ratio” 18 Kb RAM can be configured as: 16K x 1, 8K x 2, 2K x 9, or 1K x 18 Tools decide which independent 18 Kb block RAMs to locate in which tile 18 Kb RAM 18 Kb RAM

24
Virtex-5 DSP48E Block A multiply-accumulate operation, pervasive in DSP circuits, can be realized in a single DSP48E. Multiple DSP48Es can be chained together to form more complex functions through the PCIN and PCOUT ports PCOUT 48-bit 25x18 B (18-bit) X ALU A (25-bit) Optional pipeline register/ routing logic Routing logic Optional pipeline register/ routing logic P C (48-bit) = Pattern detect PCIN

25
**Block RAM and DSP Floorplan**

Block RAM and DSP48E tiles are organized in columns Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile Virtex-5 DSP tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E

26
Block RAM/DSP Packing Problem: Placer algorithms are heuristic and sometimes do not find an optimal block RAM packing Goal: Leverage preferred block RAM packing patterns to achieve high performance Target area: DSP designs DSP designs make heavy use of block RAMs and DSP blocks

27
DSP Block RAM Designs Most common DSP application is the Finite Impulse Response Filter or FIR filter FIR filters have multiple instances of a “tap” which involve DSP and block RAMs

28
FIR Filter A Finite Impulse Response or FIR filter is a digital filter that takes a weighted average of the signals in a delay line An N-tap filter can be expressed as: y[n] = c0*x[n] + c1*x[n-1]+…+cn*[n-N+1] Where: y[n] is the output of the filter at time n x[n] is the data input “signal” at time n Ci is the coefficient Each coefficient/data product in sum is referred to as a “tap” DSP units used for the multiply and accumulate Block RAMs used to store the data and coefficients

29
**FIR Designs – Use Case 1 2-tap FIR filter involving small block RAMs**

RAMD1 RAMC1 Data RAM 18 Kb block RAM RAMD0 RAMC0 Coefficient RAM DSP Tap 0 DSP Tap 1 PCOUT PCIN A B data input data output 36 Kb block RAM Tile

30
Packing for Use Case 1 Packing both 18k Block RAMs into a Block RAM tile permits a natural alignment between the DSP and Block RAMs Operates as two independent 18 Kb block RAMs Block RAM tile DSP48E Virtex-5 DSP tile DSP48E Block RAM tile DSP48E DSP48E High Performance! Block RAM tile DSP48E DSP48E Block RAM tile DSP48E DSP48E

31
**FIR Designs – Use Case 2 2-tap FIR filter involving larger block RAMs**

DSP0 DSP1 PCOUT PCIN RAMD0 RAMD1 A B 18 Kb block RAM 36 Kb block RAM RAMC0 RAMC1 Data RAM Coefficient RAM Tap 1 Tap 0

32
**Packing for Use Case 2 Two Block RAM columns feed one DSP column**

Again provides a natural alignment between the DSP and Block RAMs Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Block RAM tile DSP48E Virtex-5 DSP tile

33
Block RAM Chains Use Case: 18k Block RAM’s data input and output pins connected together (e.g. FIFO) Algorithm: Look for such chains and pack them together into single block RAM tile Special Case: 18k block RAMs separated by registers in RAM0 dia doa addra RAM1 dib dob addrb out 18 Kb block RAM

34
**Block RAM/DSP Packing Results**

Circuit Perf RAM Packing (MHz) Perf. Baseline (MHz) Percent Improvement Circuit 1 500 400 25% Circuit 2 450 365 23% Circuit 3 470 6% Circuit 4 425 435 -2% Circuit 5 215 200 8% Geomean 359 11%

35
Summary Described two architecture specific packing approaches for a 65nm commercial FPGA: Xilinx Virtex-5 Dual-output LUT packing in placement: Achieves 10.2% SLICE reduction and 5.5% LUT reduction Packing for DSPs and block RAMs: Achieves 11% performance improvement

36
Questions

Similar presentations

OK

Built-In Self-Test for Multipliers Mary Pulukuri Dept. of Electrical & Computer Engineering Auburn University.

Built-In Self-Test for Multipliers Mary Pulukuri Dept. of Electrical & Computer Engineering Auburn University.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Motivational ppt on life Ppt on history of irrational numbers Ppt on isobars and isotopes Ppt on world environment day slogans Ppt on information technology security Ppt on word association test pdf Ppt on adjectives for grade 3 Ppt on understanding by design Ppt on water pollution and its control Problem solution reading ppt on ipad