4What is DA? Multiplication using LUT Used to implement multipliers in LUT rich FPGAs
5Twos Complement Multiplication One bit at a time:Multiplicand (in this example = -127) is added to the partial sum after sign extending for every multiplier bit except for the last multiplier bitFor the last multiplier bit, multiplicand is subtracted to handle negative multiplier numbersFor each multiplier bit, one product bit result is determined and output
6SDA 1-Tap FIR Filter Z-1 +/- Partial Product ROM Parallel to serial N BITS WIDESAMPLE DATAPartialProductROMA0Z-1X0+/-1Parallelto serialconverterScaling AccumulatorLUT contains two locationsC0A01Distributed arithmetic is based on saving partial products in memories.Because the coefficients are known ahead of time, it is possible to pre-calculate the result of a multiplication.In this example, we are looking at a 1-tap FIR filter. The result of the multiplication is either 0 x coef or 1 x coef. Hence, the LUT, used in ROM mode, will be initialized with 0 at location 0 and C0 at location 1.Taking this further for 2 taps
7Distributed Arithmetic for a 2-Tap Filter Partial products of equal weight are added together before being summed to next higher partial product weightCreate look-up table of summed partial productsC0 = (-7)C1 = ( 6)XX0 = ( 7)XX1 = ( 5)+(((())))=(-1)(-14)(-4)(0)(-19)(-49)( 30)Basically involves changing the order of the computations.Calculate the partial product formed by multiplying bit 0 by the first coefficient and the second coefficient, then add them together.= Sign Extension(Serial-Data / Tap-Parallel Multiply)
8SDA 2-Tap FIR Filter Partial Product ROM Z-1 +/- Scaling Accumulator N BITS WIDESAMPLE DATAA0PartialProductROMX0Z-1+/-A1X11Scaling AccumulatorLUT contains all possible sums of the partial products00011011C0C0 + C1C1This shows the 2 tap version. Shows the partial products output
9SDA 4-Tap FIR Filter Partial Product + ROM + + 0000...0 C0 0000...0 C1 N BITS WIDESAMPLE DATAPartialProductROMA0X0C01++/-Z-1Scaling AccumulatorA1X1C11+A2X2C21+A3X3C3Here is the 4 tap showing 4 ROMs, each with two locations used out of 16 to store the coefficient and the 0 values. But the LUT has four inputs and so the four ROMs and adders are pre-programmed within a single 16x1ROM with the four address bits provided by the outputs of the parallel to serial converters.
10SDA 8-Tap FIR Filter Partial Product ROM + Partial Product ROM N BITS WIDESAMPLE DATAA0X0PartialProductROM1A1X11A2X2Pre-Adder1A3X3Z-1++/-1A0X4PartialProductROMScaling Accumulator1A1X51A2Due to the FPGA 4-input look-up tables, taps are grouped by four in order to efficiently address the LUTs preloaded with partial products.Based on the above block diagram, you can imagine that there may be an advantage to use multiple of 4 taps to make full use of the distributed memory.More design tricks are covered in the DSP Implementation Techniques course.X64 -input LUT contains all possible sums of the partial products1A3X7
11Xilinx DA FIR Performance 102030405060Sample Rate (MSPS)Single MACDA FIR B=8DA FIR B=12DA FIR B=16100150200250Serial FPGA FIR6000Dual MAC5000DA FIR B=8DA FIR B=124000DA FIR B=163000Performance (MMACs/s)Serial FPGA FIR2000100050100150200250Filter Length (Taps)Filter Length (Taps)fclk = 200 MHz for both processor and FPGAB = data sample precision for FPGAAs number of taps increases, MAC-based filter’s sample rate decreases exponentially whereas serial DA-based FIR filter will have constant sample rate independent of number of taps. The sample rate depends on the sample size in case of DA FIR filter. Hence as the B increases, sample rate decreases. Note that the hardware resources is a function of sample size and number of taps.In the right side figure, performance is given in terms of mega MACS per slice.
12Trade Clock Cycles for Logic Area 20Ms/sMulti bits per clock cycle160Ms/sb7b7b7Serial-DAParallel-DAb4b3HardwareOver-sampling = 8b0HardwareOver-sampling = 4b0HardwareOver-sampling = 2b0b0b7b3HardwareOver-sampling = 1b4b0The sample is serialized and processed 1 bit per clock cycle. 8 clock cycles are thus required to process the whole sampleThe sample is serialized and processed 2 bitsper clock cycle. 4 clock cycles are thus required to process the whole sampleProcessing the data serially, one-bit-at-a-time, can result in slow computation rates. When the input variables are B bits in length, B clock cycles are required to complete an inner-product calculation. Additional speed may be obtained in several ways.One approach is to partition the input words into L subwords and process these subwords in parallel. This method requires L-times as memory look up tables and so comes at a cost of a linear increase in storage requirements. Maximum speed is achieved by factoring the input variables into single bit subwords. With this factoring, a new output sample is computed on each clock cycle. This factoring results in a fully parallel DA FIR (PDAFIR) architecture.The sample is processed in parallel 8 bits per clock cycleThe sample is serialized and processed 4 bits per clock cycleb0
13Conclusion Efficiency of computation Slow as its bit serial Memory requirements
14ReferencesThe role of Distributed Arithmetic in FPGA based signal processing,