# 7 Series DSP Resources Part 1.

## Presentation on theme: "7 Series DSP Resources Part 1."— Presentation transcript:

7 Series DSP Resources Part 1

Objectives After completing this module, you will be able to:
Describe the primary usage models of DSP slices Describe the DSP slice in the 7 series FPGAs

Lessons DSP Overview 7 Series FPGA DSP Slice

Growing DSP Performance Gap
Performance requirements are outpacing traditional DSP solutions 4G LTE Need a solution to fill the gap SDR 3G Imaging Radar SD/HD Video Algorithm Complexity (Algorithmic & Processor Forecast) Performance DSP/GPP* Traditional Processor Architectures * Source: Jan Rabaey, BWRC

Typical DSP Operation   Diagram of a typical FIR filter Y(n) =
Parallel computing process by nature N number of taps N multiplications should happen in parallel Viewed as an Equation Viewed as a Diagram X(n) i=N-1 Coefficients Delay (Register) Z-1 Z-1 Z-1 Z-1 Y(n) = ki.X(n-i) i=0 k0 k1 k2 k3 kN-1 Coefficient Multiply Accumulate N times Multiply Y(n) Summation

Sequential vs. Parallel DSP Processing
Standard DSP Processor – Sequential (Generic DSP) FPGA - Fully Parallel Implementation (7 Series FPGA) Data In Data In Reg Reg Reg Reg Coefficients X C0 C0 X C1 X C2 X C3 X C2015 X To implement the operation described in the previous slide, either use: A single multiplier looping through N iterations. N multipliers and adders in parallel. A combination of both. A standard DSP processor has one (or a few) multipliers; hence, it must loop through the iterations. A FPGA has many multipliers; hence, it can perform complex operations in parallel. The largest Virtex-7 XT FPGA has 3960 DSP slices that can run at 600 MHz; hence, it can perform a 3960-tap filter at 600 Million Samples per Second (MSPS). Single-MAC Unit 3960 clock cycles needed + + 3960 operations in 1 clock cycle Reg Data Out Data Out 1.2 GHz 3960 clock cycles = 303 KSPS 600 MHz 1 clock cycle = 600 MSPS

DSP Slice Features OP CTL Z-1 MULT Z-1 ADD Z-1 Z-2 =
48 A:B 48 CE A REG D Q 2-Deep B REG A 18 ALUMode CE M REG D Q X 4 72 36 B CE P REG D Q P PATTERN DETECT Input Conditioning 36 Y 48 In order to implement most DSP functions, certain features are required of a DSP slice: Input conditioning – including pre-addition and pipeline control. Pipelining – for maximum performance and the implementation of sample delays (Z-1). Multiplication – the basis of most DSP functions. Operation control – to support different operations. Addition – for multiply-accumulation. Product register – for performance, sample delays, and accumulation. Cascade paths (not shown) – for chaining together DSP slices. 1 25 OpMode 7 C 17-bit shift Z 17-bit shift = CarryIn D CE C REG D Q C or MC 48 OP CTL Z MULT Z ADD Z-1 Z-2

FIR Filter Mapped to DSP Slices
The input time delay series is created inside the DSP slice for maximum performance irrespective of the number of coefficients This filter structure, while referred to as a Systolic FIR filter, is really a Direct Form Type I with one extra stage of pipelining K0 K1 K2 K3 DSP48E1 Slice opmode = opmode = x(n) y(n) 38 18 This is officially called the Direct Form Type I with pipelining, but the term Systolic FIR filter keeps things simpler. The structure uses the cascade paths to exploit the DSP slice architecture. The input data is fed into a cascade of registers that act as the data buffer. Each register delivers a sample to a multiplier to be multiplied by the respective coefficient. The adder chain stores the partial products that are gradually combined to form the final result. No external logic is required to support the filter, and the structure is extendable to support any number of coefficients. Coefficients are from left to right, causing the latency to be as large and grow with the increase of coefficients Dedicated cascade connections (PCOUT and PCIN) are exploited to achieve maximum performance

START: This is the typical adder tree found in many signal processing designs Remove all pipelining from the tree. This makes it easier to understand and visualize the changes Rearrange the tree. Notice that functionally has not changed. The diagram has just been redrawn 1 2 The DSP slice is not only useful for DSP operations—many other operations can be implemented using this structure: Multiplication Wide adders/accumulators Pipelined adders Wide multiplexers This example illustrates how to rearrange a multi-input pipelined adder tree for implementation as an adder chain implementation, which is better suited for DSP48 slices. 3 Pipelining is required for performance. Adding one in the chain requires one in the data path delay as well. Determining mapping to DSP48E is easy now DSP48E Slice opmode = opmode = in out

Lessons DSP Review 7 Series FPGA DSP Slice

Summary All 7 series FPGAs contain the same DSP48E1 cell
The DSP48E1 is identical to the one used in the Virtex-6 FPGA The DSP48E1 cell has the following features 25x18 signed multiplier 48-bit add/subtract/accumulate Pipeline registers for high speed Pattern detector SIMD operators Cascade paths 25 bit pre-adder Dynamic pipeline control

Slice description Design consideration How to design to optimize for power and performance How to use advanced design techniques Design recommendations for XST This guide has example inferences of many architectural resources XST User Guide Refer to the Coding Techniques chapter Xilinx Education Services courses Xilinx tools and architecture courses and other Free videos!

7 Series DSP Resources Part 2

Objectives After completing this module, you will be able to:
Describe the DSP slice in the 7 series FPGAs

Lessons DSP Overview 7 Series FPGA DSP Slice

7 Series DSP48E1 Slice = 25x18 signed multiplier
BCOUT ACOUT CARRY CASCOUT MULT SIGNOUT PCOUT 18 Dual B Register 18 48 A:B B 18 4 18 6 CARRY OUT 30 25 X 18 X P 86 43 Dual A, D Register With Pre-adder 30 M A 30 43 48 25 1 Y P D 25 P All 7 series FPGAs contain the same DSP48E1 slice. Only the number of slices and maximum frequency vary from family to family. C’ denotes the output of the optional C pipeline register. This bus is one input of the pattern detect multiplexer; the other input is the 48-bit PATTERN attribute. C’ C 2 = P C 48 >>17 PATTERN_ DETECT Z >>17 Carry 25x18 signed multiplier 48-bit add/subtract/accumulate 48-bit logic operations Pipeline registers for high speed Pattern detector SIMD operations (12/24 bit) Cascade paths for wide functions Pre-adder 18 30 5 7 3 4 48 INMODE CarryInSel PATTERN ALUMode OpMode CarryIn C’ CARRY CASCIN MULT SIGNIN BCIN ACIN PCIN

X, Y, and Z Multiplexers Adder/subtractor operates on X, Y, Z and CIN operands Table shows basic operations X, Y, and Z multiplexers allow for dynamic OPMODEs Multiplier output requires both X and Y multiplexers ALUMODE Operation 0000 Z + X + Y + CIN 0001 -Z + (X + Y + CIN) – 1 0010 -Z – X – Y – CIN – 1 0011 Z – (X + Y + CIN) Others Logic Operations Normal or 17-bit right shifted with MSB fill for multi-precision arithmetic

Controls the behavior of X, Y, and Z multiplexers
Apply Your Knowledge OPMODE Controls the behavior of X, Y, and Z multiplexers OPMODE of each DSP48E is individually controllable. OPMODE can change dynamically on each cycle. Editor Note: Pull out table in LGP 1) Given this OPMODE table, what is the OPMODE for the following functions? C + A:B OPMODE = or A*B + C OPMODE = P + C + PCIN OPMODE = 1) Given this OPMODE table, what is the OPMODE for the following functions? C + A:B A*B + C P + C + PCIN

Dual B Register B input to multiplier is controlled by INMODE[4]
Dynamically selects B1/B2 pipeline level B input to X MUX and BCOUT cascade outputs are statically controlled by bitstream options Bitstream Controlled Dynamically Controlled 18 BCOUT 18 X MUX B 18 B2 18 B MULT B1 BCIN INMODE[4]

Dual A, D, Registers and Pre-Adder
A input to multiplier is controlled by INMODE[3:0] Dynamically selects A1/A2 pipeline level Dynamically selects add/subtract Dynamically selects Zero for A or D ACOUT and X MUX input are statically controlled Bitstream Controlled Dynamically Controlled 30 ACOUT The pre-adder doubles the efficiency of symmetrical filters and convolutions over previous technologies. Fine-grain access to the A and B pipelines optimizes the implementation of certain algorithms, like short FFTs and sequential complex multiplications. 30 X MUX INMODE[1] A 30 A2 A1 ACIN 25 INMODE[0] A MULT AD D 25 D 25 25 INMODE[3] INMODE[2]

Two-Input Logic Functions
ALUMODEs 48-bit logic operations XOR, XNOR, AND, NAND, OR, NOR, NOT Logic Unit Mode OPMODE[3:2] ALUMODE[3:0] X XOR Z 00 0100 X XNOR Z 0101 0110 0111 X AND Z 1100 X AND (NOT Z) 1101 X NAND Z 1110 (NOT X) OR Z 1111 10 X OR Z X OR (NOT Z) X NOR Z (NOT X) AND Z ALUMODE[3:0] P X A:B 1 Y P PCIN Z P C OPMODE[3:0]

Pattern Detect and SIMD
Pattern detection Pattern and mask operation on output of adder Pattern can be constant (set by attribute) or C input Enables Symmetric rounding for multi-precision operations Convergent rounding Saturation Accumulator terminal count SIMD operations 48-bit adder broken into 2x24 bits or 4x12 bits Allows two or four independent additions to be done Carry bits brought out independently and disabled between sections Carry bits can be cascaded between DSP48E1 slices = P C or MC

Cascade Paths Cascade paths exist from each DSP48E1 slice to the slice above it A input, B input, P output, and carry out P cascade path can be shifted by 17 bits by slice above Enables common functions with little or no additional resources Wider accumulators, wider multipliers, complex multipliers, and FIR filters Example: 35-bit x 25-bit multiplier with two DSP48E1s 25 DSP48_1 OPMODE ALUMODE 0000 ACIN B P P[42:0] = OUT[59:17] B[34:17] 18 SHIFT 17 A 25 DSP48_0 OPMODE ALUMODE 0000 A[24:0] B P 0,B[16:0] 18 P[16:0] = OUT[16:0]

Lessons DSP Review 7 Series FPGA DSP Slice

Summary All 7 series FPGAs contain the same DSP48E1 cell
The DSP48E1 is identical to the one used in the Virtex-6 FPGA The DSP48E1 cell has the following features 25x18 signed multiplier 48-bit add/subtract/accumulate Pipeline registers for high speed Pattern detector SIMD operators Cascade paths 25 bit pre-adder Dynamic pipeline control

7 Series DSP Resources Part 3

Objectives After completing this module, you will be able to:
Describe the basic usage models of DSP slices Describe the DSP slice in the 7 series FPGAs

Lessons DSP Overview 7 Series FPGA DSP Slice

Pre-Adder The pre-adder can add or subtract the two 25-bit operands on the A and the D inputs before the result drives the multiplier Benefits Perfect for operations using symmetrical coefficients Doubles the efficiency of symmetric FIR and symmetric IIR and transpose convolution filters Half the power consumption compared to architectures without a pre-adder Smaller total logic footprint A small change with a big benefit

Symmetrical Filters When the coefficients are symmetrical
The pre-adders either reduce the number of multiplications by 50% or double the sample rate Factorizing the taps replaces one multiplication by a pre-addition (or pre-subtraction) Symmetrical Filter Example k13 k17 Non symmetrical filter (k13≠k17) : (tap13×k13) + (tap17×k17) 2 mults and one post-add Symmetrical filter (k13=k17) : (tap13+tap17) × k13 Direct benefit: saves 50% of the DSP slices 1 pre-add 1 mult

Six-Tap Transpose FIR Filter Without Pre-Adder
x(n) z-2 z-2 z-2 z-2 z-2 z-2 x(n-2) k0 X k1 X k2 X k2 X k1 X k0 X x(n-7) x(n-6) x(n-5) x(n-4) x(n-3) z-1 z-1 z-1 z-1 z-1 z-1 + z-1 + z-1 + z-1 + z-1 + z-1 + z-1 Uses six legacy DSP slices (without pre-adder) y(n-4)

Six-Tap Transpose FIR Filter Using the Pre-Adder
x(n) z-2 z-2 z-2 z-1 z-1 z-1 + + + z-1 z-1 z-1 x(n-5)+x(n-4) x(n-6)+x(n-3) x(n-7)+x(n-2) k2 X k1 X k0 X z-1 z-1 z-1 + z-1 + z-1 + z-1 Optimized implementation supported by XST using only three slices instead of six y(n-4)

Dynamic Pipeline Control
The 7 series FPGA DSP slice has dynamic pipeline control on the A and B registers User can select which of the two pipeline registers to use for calculations on a clock-by-clock basis Benefits Allows an operation to reuse the same operand in subsequent cycles

Application: Sequential Complex Multiply
(A + ai) * (B + bi) = (AB-ab) + (Ab+aB)i Use the two AB registers to locally store the real and imaginary parts of the operands Read each component of the complex operands out of memory only once Fewer memory reads because A, a, B, and b are then stored locally A a CEA2 CEA1 INMODE B b CEB2 CEB1 X m + USE_DPORT=FALSE Dynamic routing is controlled by an FSM updating the INMODE register on the fly Real and complex portions of each operand are needed twice—once for calculation of the real part of the result and once for calculation of the imaginary part of the result. With dynamic control, the operands can be read only once, but used over several cycles to generate the result. Needs only four clocks for 18-bit data using a single slice

Application: Sequential Large Multiply
Four-step large multiplication 42 bits * 34 bits = (A:a) * (B:b) = A*B + sh17(A*0b + B* a + sh17(0b* a) Needs only four clocks for 18-bit data using a single slice

Lessons DSP Review 7 Series FPGA DSP Slice

IP Support and Inference
Some basic functions can be inferred Example: Multiplier, Multiply-Accumulate, … Other functions are supported through the CORE Generator™ interface Examples: FFT, FIR Compiler, and DDS Compiler New IP cores become available with each service pack Visit the IP Center for information on the newest IP cores

Inferring a 16 x 16 Multiplier
-- Example: 16x16 Multiplier, inputs registered once, -- outputs twice Matches 1 DSP48 slice OpMode(Z,Y,X):Subtract (xxx,01,01):0 p1 <= a1*b1; process (clk) is begin if clk'event and clk = '1' then if rst = '1' then a1 <= (others => '0'); b1 <= (others => '0'); p <= (others => '0'); elsif ce = '1' then a1 <= a; b1 <= b; p <= p1; end if; end process; /////////////////////////////////////////////////////////////////// // Example: 16x16 Multiplier, inputs registered // once, outputs twice // Matches 1 DSP48 slice // OpMode(Z,Y,X):Subtract // (xxx,01,01):0 assign p1 = a1*b1; clk) if (rst == 1'b1) begin a1 <= 0; b1 <= 0; p <= 0; end else if (ce == 1'b1) a1 <= a; b1 <= b; p <= p1;

Inferring a Multiply Accumulate (MACC)
-- Example: Multiply add function, single level of register Matches 1 DSP48 slice OpMode(Z,Y,X):Subtract (011,01,01):0 p1 <= a*b + c; process (clk) is begin if clk'event and clk = '1' then if rst = '1' then p <= (others => '0'); elsif ce = '1' then p <= p1; end if; end process; //////////////////////////////////////////////////////////// // Example: Multiply add function, single level of register // Matches 1 DSP48 slice // OpMode(Z,Y,X):Subtract // (011,01,01):0 assign p1 = a*b + c; clk) if (rst == 1'b1) p <= 0; else if (ce == 1'b1) begin p <= p1; end

-- Example: 16 bit adder 2 inputs, input and output -- registered once -- Mapping to DSP48 should be driven by timing as -- DSP48 are limited resources. The -use_dsp48 XST -- switch must be set to YES -- Matches 1 DSP48 slice OpMode(Z,Y,X):Subtract (000,11,11):0 or (011,00,11):0 p1 <= a1 + b1; process (clk) is begin if clk'event and clk = '1' then if rst = '1' then p <= (others => '0'); a1 <= (others => '0'); b1 <= (others => '0'); elsif ce = '1' then a1 <= a; b1 <= b; p <= p1; end if; end process; //////////////////////////////////////////////////////////////////////// // Example: 16 bit adder 2 inputs, input and output // registered once // Mapping to DSP48 should be driven by timing as // DSP48 are limited resources. The -use_dsp48 XST // switch must be set to YES // Matches 1 DSP48 slice // OpMode(Z,Y,X):Subtract // (000,11,11):0 or // (011,00,11):0 assign p1 = a1 + b1; clk) if (rst == 1'b1) begin p <= 0; a1 <= 0; b1 <= 0; end else if (ce == 1'b1) a1 <= a; b1 <= b; p <= p1;

-- Example: Loadable Multiply Accumulate with one level -- of registers Map into 1 DSP48 slice Funtion: OpMode(Z,Y,X):Subtract load (011,00,00):0 mult_acc (010,01,01):0 -- Restriction: Since C input of DSP48 slice is used, then -- adjacent DSP cannot se a different c input (c input are -- shared between 2 adjacent DSP48 slices) -- Expected mapping: AREG: no, BREG: no, CREG: no, MREG: no, PREG: yes with load select p_tmp <= signed(c) when '1' , p_reg + a1*b1 when others; process(clk) begin if clk'event and clk = '1' then if p_rst = '1' then p_reg <= (others => '0'); a1 <= (others => '0'); b1 <= (others => '0'); elsif p_ce = '1' then p_reg <= p_tmp; a1 <= signed(a); b1 <= signed(b); end if; end process; p <= std_logic_vector(p_reg); /////////////////////////////////////////////////////////////////////////////// // Example: Loadable Multiply Accumulate with one level // of registers // Map into 1 DSP48 slice // Funtion: OpMode(Z,Y,X):Subtract // load (011,00,00):0 // mult_acc (010,01,01):0 // Restriction: Since C input of DSP48 slice is used, then // adjacent DSP cannot use a different c input (c input are // shared between 2 adjacent DSP48 slices) // Expected mapping: // AREG: no, BREG: no, CREG: no, MREG: no, PREG: yes assign p_tmp = load ? c:p + a1*b1; clk) if (p_rst == 1'b1) begin p <= 0; a1 <=0; b1 <=0; end else if (p_ce == 1'b1) begin p <= p_tmp; a1 <=a; b1 <= b;

Lessons DSP Review 7 Series FPGA DSP Slice

Summary All 7 series FPGAs contain the same DSP48E1 cell
The DSP48E1 is identical to the one used in the Virtex-6 FPGA The DSP48E1 cell has the following features 25x18 signed multiplier 48-bit add/subtract/accumulate Pipeline registers for high speed Pattern detector SIMD operators Cascade paths 25 bit pre-adder Dynamic pipeline control DSP48E1 slices can be inferred, instantiated or accessed using IP cores