A COMPARATIVE STUDY OF MULTIPLY ACCCUMULATE IMPLEMENTATIONS ON FPGAS Using Distributed Arithmetic and Residue Number System
Project Scope To compare the implementation efficiencies (area times delay) of Distributed Arithmetic (DA), RNS and DA- RNS based parallel multiply accumulate architectures on FPGAs
Background and Context FPGAs increasingly used for DSP computations FPGAs have potential for parallelism FPGAs architecture exploitation (LUT based) Novel MAC architectures especially suitable for FPGAs
Some More Background In DSP MACs use constant coefficient (Fixed Multiplicand) Full Multiplier Implementation Not Required Not All Multiplier Architecture Efficient for FPGAs
Motivation Distributed Arithmetic and Residue Arithmetic techniques are LUT based techniques Explore the “synergy” between FPGA architecture and above mentioned techniques
Distributed Arithmetic Overview
Basic Serial Architecture
Residue Arithmetic Overview (z1, z2,..., zn) = ( x1, x2, …, xn) (y1,y2, …, yn) zi = (xi yi) mod mi denotes any of the modulo operations of addition, subtraction or multiplication
Modulo Adder
Modulo Constant Multiplier Due to the small sizes of residues and a constant multiplicand, a direct LUT based implementation is very efficient 4-bit Constant Modulo Multiplier A0 A1 A2 A3 X[3:0] 5-bit Constant Modulo Multiplier A0 A1 A2 A3 X[4:0] A4
RNS MAC Architecture
Conversion Issues in RNS Binary to RNS and RNS to Binary Conversion are significant overheads Binary to RNS relatively simple RNS to Binary Using a Direct CRT Implementation Requires Modulo M adders
Forward Conversion
Reverse Conversion
DA-RNS Coupling
Scaling Accumulator Design
DA 8-bits 8 Taps 12-bits Coefficients Implementation
Critical Path Results Source: PSC8_0_PSC_0/I_Q7 (FF) Destination: SACC24_REG2/I_Q3 (FF) Data Path: PSC8_0_PSC_0/I_Q7 to SACC24_REG2/I_Q3e)