 # Distributed Arithmetic

## Presentation on theme: "Distributed Arithmetic"— Presentation transcript:

Distributed Arithmetic
Dr Sumam David S. Dept. of E&C, NITK Surathkal Courtesy for slides – Xilinx Professor’s Workshop Resources

Objective Distributed arithmetic What ? Where ? How ?

What is DA? Multiplication using LUT
Used to implement multipliers in LUT rich FPGAs

Twos Complement Multiplication
One bit at a time: Multiplicand (in this example = -127) is added to the partial sum after sign extending for every multiplier bit except for the last multiplier bit For the last multiplier bit, multiplicand is subtracted to handle negative multiplier numbers For each multiplier bit, one product bit result is determined and output

SDA 1-Tap FIR Filter Z-1 +/- Partial Product ROM Parallel to serial
N BITS WIDE SAMPLE DATA Partial Product ROM A0 Z-1 X0 +/- 1 Parallel to serial converter Scaling Accumulator LUT contains two locations C0 A0 1 Distributed arithmetic is based on saving partial products in memories. Because the coefficients are known ahead of time, it is possible to pre-calculate the result of a multiplication. In this example, we are looking at a 1-tap FIR filter. The result of the multiplication is either 0 x coef or 1 x coef. Hence, the LUT, used in ROM mode, will be initialized with 0 at location 0 and C0 at location 1. Taking this further for 2 taps

Distributed Arithmetic for a 2-Tap Filter
Partial products of equal weight are added together before being summed to next higher partial product weight Create look-up table of summed partial products C0 = (-7) C1 = ( 6) X X0 = ( 7) X X1 = ( 5) + ( ( ( ( ) ) ) ) = (-1) (-14) (-4) (0) (-19) (-49) ( 30) Basically involves changing the order of the computations. Calculate the partial product formed by multiplying bit 0 by the first coefficient and the second coefficient, then add them together. = Sign Extension (Serial-Data / Tap-Parallel Multiply)

SDA 2-Tap FIR Filter Partial Product ROM Z-1 +/- Scaling Accumulator
N BITS WIDE SAMPLE DATA A0 Partial Product ROM X0 Z-1 +/- A1 X1 1 Scaling Accumulator LUT contains all possible sums of the partial products 00 01 10 11 C0 C0 + C1 C1 This shows the 2 tap version. Shows the partial products output

SDA 4-Tap FIR Filter Partial Product + ROM + + 0000...0 C0 0000...0 C1
N BITS WIDE SAMPLE DATA Partial Product ROM A0 X0 C0 1 + +/- Z-1 Scaling Accumulator A1 X1 C1 1 + A2 X2 C2 1 + A3 X3 C3 Here is the 4 tap showing 4 ROMs, each with two locations used out of 16 to store the coefficient and the 0 values. But the LUT has four inputs and so the four ROMs and adders are pre-programmed within a single 16x1ROM with the four address bits provided by the outputs of the parallel to serial converters.

SDA 8-Tap FIR Filter Partial Product ROM + Partial Product ROM
N BITS WIDE SAMPLE DATA A0 X0 Partial Product ROM 1 A1 X1 1 A2 X2 Pre-Adder 1 A3 X3 Z-1 + +/- 1 A0 X4 Partial Product ROM Scaling Accumulator 1 A1 X5 1 A2 Due to the FPGA 4-input look-up tables, taps are grouped by four in order to efficiently address the LUTs preloaded with partial products. Based on the above block diagram, you can imagine that there may be an advantage to use multiple of 4 taps to make full use of the distributed memory. More design tricks are covered in the DSP Implementation Techniques course. X6 4 -input LUT contains all possible sums of the partial products 1 A3 X7

Xilinx DA FIR Performance
10 20 30 40 50 60 Sample Rate (MSPS) Single MAC DA FIR B=8 DA FIR B=12 DA FIR B=16 100 150 200 250 Serial FPGA FIR 6000 Dual MAC 5000 DA FIR B=8 DA FIR B=12 4000 DA FIR B=16 3000 Performance (MMACs/s) Serial FPGA FIR 2000 1000 50 100 150 200 250 Filter Length (Taps) Filter Length (Taps) fclk = 200 MHz for both processor and FPGA B = data sample precision for FPGA As number of taps increases, MAC-based filter’s sample rate decreases exponentially whereas serial DA-based FIR filter will have constant sample rate independent of number of taps. The sample rate depends on the sample size in case of DA FIR filter. Hence as the B increases, sample rate decreases. Note that the hardware resources is a function of sample size and number of taps. In the right side figure, performance is given in terms of mega MACS per slice.

Trade Clock Cycles for Logic Area
20Ms/s Multi bits per clock cycle 160Ms/s b7 b7 b7 Serial-DA Parallel-DA b4 b3 Hardware Over-sampling = 8 b0 Hardware Over-sampling = 4 b0 Hardware Over-sampling = 2 b0 b0 b7 b3 Hardware Over-sampling = 1 b4 b0 The sample is serialized and processed 1 bit per clock cycle. 8 clock cycles are thus required to process the whole sample The sample is serialized and processed 2 bits per clock cycle. 4 clock cycles are thus required to process the whole sample Processing the data serially, one-bit-at-a-time, can result in slow computation rates. When the input variables are B bits in length, B clock cycles are required to complete an inner-product calculation. Additional speed may be obtained in several ways. One approach is to partition the input words into L subwords and process these subwords in parallel. This method requires L-times as memory look up tables and so comes at a cost of a linear increase in storage requirements. Maximum speed is achieved by factoring the input variables into single bit subwords. With this factoring, a new output sample is computed on each clock cycle. This factoring results in a fully parallel DA FIR (PDAFIR) architecture. The sample is processed in parallel 8 bits per clock cycle The sample is serialized and processed 4 bits per clock cycle b0

Conclusion Efficiency of computation Slow as its bit serial
Memory requirements

References The role of Distributed Arithmetic in FPGA based signal processing,