A COMPARATIVE STUDY OF MULTIPLY ACCCUMULATE IMPLEMENTATIONS ON FPGAS Using Distributed Arithmetic and Residue Number System

Project Scope To compare the implementation efficiencies (area times delay) of Distributed Arithmetic (DA), RNS and DA- RNS based parallel multiply accumulate architectures on FPGAs

Background and Context FPGAs increasingly used for DSP computations FPGAs have potential for parallelism FPGAs architecture exploitation (LUT based) Novel MAC architectures especially suitable for FPGAs

Some More Background In DSP MACs use constant coefficient (Fixed Multiplicand) Full Multiplier Implementation Not Required Not All Multiplier Architecture Efficient for FPGAs

Motivation Distributed Arithmetic and Residue Arithmetic techniques are LUT based techniques Explore the “synergy” between FPGA architecture and above mentioned techniques

Distributed Arithmetic Overview

Basic Serial Architecture

Residue Arithmetic Overview (z1, z2,..., zn) = ( x1, x2, …, xn) (y1,y2, …, yn) zi = (xi yi) mod mi denotes any of the modulo operations of addition, subtraction or multiplication

Modulo Adder

Modulo Constant Multiplier Due to the small sizes of residues and a constant multiplicand, a direct LUT based implementation is very efficient 4-bit Constant Modulo Multiplier A0 A1 A2 A3 X[3:0] 5-bit Constant Modulo Multiplier A0 A1 A2 A3 X[4:0] A4

RNS MAC Architecture

Conversion Issues in RNS Binary to RNS and RNS to Binary Conversion are significant overheads Binary to RNS relatively simple RNS to Binary Using a Direct CRT Implementation Requires Modulo M adders

Forward Conversion

Reverse Conversion

DA-RNS Coupling

Scaling Accumulator Design

DA 8-bits 8 Taps 12-bits Coefficients Implementation

Critical Path Results Source: PSC8_0_PSC_0/I_Q7 (FF) Destination: SACC24_REG2/I_Q3 (FF) Data Path: PSC8_0_PSC_0/I_Q7 to SACC24_REG2/I_Q3e)

