Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1.

Similar presentations


Presentation on theme: "A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1."— Presentation transcript:

1 A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1

2  DA is a bit-serial technique to greatly reduce resource requirements for the dot product calculation  So-called because the resources are not easily recognizable: “Where’s the MAC module?”  Takes advantage of small tables of pre- computed coefficients and clever rearrangement of the math 2

3  In signal processing the most common operation is the dot product  DA lends itself well to FPGA implementation due its use of lookup tables  DA can reduce gate count by 50%-80% in signal processing arithmetic! 3

4  It turns out that the dot product is used extensively in DSP (FIR, FFT, etc)  Recall that dot product is a sum of products:  Written as a summation: 4

5  Simple example: smoothing data via DSP (low-pass filter)  Accomplished with an FIR filter. General form:  So we could implement a “3-tap (K=4) moving average filter”: (In this special case, A 1 =A 2 =A 3 =0.33) 5

6  Recall the goal:  X is the filter input, (digital!), so let’s consider two’s complement representation (scaled x<1 for cleanliness)  Putting them together N – total bits 6

7  Expand the summation:  We can precompute all terms that depend on the input data (b k0..b kK ) and store them in a ROM of size 2 K+1  The x inputs can then be used to address the ROM directly: LUT! Since b kn is 0 or 1, this has only 2 K possible values Two possible values 7

8  Non-DA Hardware Implementation 8-bit Multiplier 8-bit Adder Based on the original equation 8

9  We said this is ‘bit-serial’ technique, so how can we perform multiplication? Here, x is 4-bit input and A is 8-bit constant Example Multiplication x = 1011 A = 1011001 1 1011001 0 0000000 1 1011001 1 +1011001 10010000101 Shift right by 1 Result register x A AND with 1 parallel and 1 serial input 9

10  So, now we substitute the scaling accumulator into our original design. Getting closer... 10

11  Let’s rearrange the hardware to match our expanded eq n : We first sum the products of each input bit and its constant Then we add and scale each of those terms 11

12  Now recall that we had the clever idea to use pre- computed sums in a LUT for the bitwise addition AddressData 00000 0001C0C0 0010C1C1 0011C 0 +C 1... 1110C 0 +C 1+ C 2 1111C 0 +C 1 +C 2 +C 3 12

13  We need to accommodate the negative term, so we add one more address line to the LUT called T s. ROM size now 2 K+1  T s is a timing signal. T s =1 during sign bit time, 0 otherwise  We also need this bit to know when the final result is ready AddressData 100000 10001-C 0 11111-(C 0 +C 1 +C 2 +C 3 ) For all T s = 1 the ROM contains the negative of the appropriate sum 13

14 This is an example of K=4 DA dot-product hardware ROM Size = 2 K+1 =2 5 =32 Here is our scaling accumulator Switch SWA in pos 2 after Ts=1, at which point y contains final result 14

15  Computes N-bit dot product in N cycles  Reduced area and high speed due to the ROM  However, requires 2 K+1 size ROM (grows exponentially with input lines)  Input sizes often 16 bits -> Need 128K ROM! 15

16  Bit-serial means N-bit dot product requires N cycles... Slower than parallel?  N HW multipliers not generally practical due to large area\power!  Time-multiplexing your parallel HW multiplier means you lose the speed gain: N vs K  Example: K=8, N=8 takes the same time on time multiplexed parallel HW vs DA bit-serial 16

17  We can reduce the ROM size to 2 K with some tricks  There are other math tricks to reduce the size further to 2 K-1 Replace adder with adder/subtractor T s becomes control line for adder/subtractor ROM size is reduced by half 17

18  Speed determined by serial nature of input – 1 BAAT  We can expand the HW to do multi-bit at a time Introduce input as bit pairs x 10 x 11, x 12 x 13, etc Shift LSB of pair result by 1 Shift accumulator feedback by 2 Requires 2 ROMs instead of 1 18

19  DA lends itself easily to DSP because of its easy application to the dot product  DA is easily implementable on FPGA because of the similar architecture-> LUTs (of course better on custom hardware)  DA is not limited to dot product; will work for any algorithm where pre-computed values can be leveraged 19

20  DA is a very efficient means of mechanizing the dot product  The use of DA can save 50-80% area over the parallel approach  Like everything, DA has tradeoffs: ROM size  input lines Speed  area (multi ROM) 20

21  Application of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review. White, Stanley. IEEE ASSP Magazine July 1989 (I pulled most of the basic talk info from here)  Parallel and Pipelined Architecture Designs for Distributed Arithmetic-Based Recursive Digital Filters. Hwang, H. and Su. C. IEEE Xplore VLSI Signal Processing IX, 1996 35-44 (this has some slight remarks about bit parallel vs bit serial, also auto-regressive moving average filter example)  Distributed Arithmetic for Efficient Base-Band Processing in Real-Time GNSS Software Receivers. Waelchli, G et al. Journal of Electrical and Computer Engineering volume 2010 (application to GPS)  An FPGA-Based Parallel Distributed Arithmetic Implementation of the 1-D Discrete Wavelet Transform. Al-Haj, Ali. Informatica 29 (2005) 241-247 (DSP example using a Virtex FPGA) 21


Download ppt "A Bit-Serial Method of Improving Computational Efficiency of Dot-Products 1."

Similar presentations


Ads by Google