A Survey on Low Power Multiplication / Accumulation Speaker : Byoung-Woon Kim
Contents Introduction [1] Interlaced Accumulation Programming [2] Operand Swapping [3] Selective Coefficient Negation [4] Coefficient Optimization [5] Coefficient Reordering Conclusion & Future Works
Power Distribution of a DSP Hirotsugu [ISLPED ‘96] : For each test programs Control Address Generation Data Op. Memory Pin Peripheral Clocking Bus Misc. Normalized Power Consumption (%) Variation due to Data Dependency
ALU MULT ACC PR XY MUL > (5 * ALU) X Y [ Modified Booth Encoding ] One of 0, X, -X, 2X, -2X based on each 2 bits of Y Multiplication and Accumulation: MAC Major operation in DSP PR CSA CPA
Power Consumption by a Multiplier Power Consumption by Data Dependency X : Energy per cycle Y : # of input transitions (nJ) 36-bit ALU (nJ) 16x16 MPY Average = 7nJ Little Correlation
Power Consumption by a Multiplier What is an important input in terms of power ? (nJ) 0x8000 x (random) (nJ) (random) x 0x Average = 5nJ Average = 1nJ
Power Consumption by a Multiplier Booth encoding is a significant overhead (nJ) 0x5555 x (random) (nJ) (random) x 0x Average = 6nJ Average = 4nJ
Interlaced Accumulation Programming(1/2) Hirotsugu [ISLPED ‘96] Y(k) = C0 * X(k ) + C1 * X(k-1) + C2 * X(k-2) Y(k+1) = C0 * X(k+1) + C1 * X(k ) + C2 * X(k-1) Y(k+2) = C0 * X(k+2) + C1 * X(k+1) + C2 * X(k ) tap FIR filter (n=3) Y(k) = C0 * X(k ) + C1 * X(k-1) + C2 * X(k-2) Y(k+1) = C0 * X(k+1) + C1 * X(k ) + C2 * X(k-1) Y(k+2) = C0 * X(k+2) + C1 * X(k+1) + C2 * X(k )
Interlaced Accumulation Programming(2/2) More than 40% power is saved by –Keeping a constant at one operand of multiplier X is kept: 7nJ -> 5 ~ 6nJ Y is kept: 7nJ -> 1 ~ 3nJ –Reducing the number of memory access by a half Traditional : two memory operands Interlaced : one memory operand ( data re-use by temporary register )
Operand Swapping (1/2) Weight= how many additions are needed ? By Booth Encoding X000X0 Y= Weight = 2 7FFFAAAA 0001AAAA 7FFF AAAA 7FFFAAAA 0001 ABA*BB*A Saving 54% 68% 58% Current (mW)Operands Low Weight High Switching
Operand Swapping (2/2) For filter operations, one operand is usually is constant. => Operand swapping in compile-time. X Y LowS HighS LowSHighSLowSHighS LowW ->LowWHighW ->HighW LowW ->HighW LowS: Low switching HighS: High switching Current (mA) LowW: Low weight HighW: High weight Candidate for Operand Swapping
Selective Coefficient Negation To reduce the toggle –store Coeff[i] or -Coeff[i] on memory According to the negation, –use `multiply and add’ (MAC+ instruction) –use `multiply and sub’ (MAC- instruction) GSM Vocoder : 11% power reduction ACC = ACC + (X * Y) ACC = ACC - (X * Y)
Coefficient Optimization Mahesh [TVLSI ‘98] The design of the finite wordlength FIR filter –Given N coefficients and constraints, –Find a new set of coefficients such that the total Hamming distance between successive coefficients is minimized. => using a coefficient perturbation & an algorithm similar to simulated annealing But, Hamming distance is not a good cost-function !!!
Coefficient Ordering MAC operation : commutative, associative Finding a good ordering –N! cases for a N-tap filter Y(k) = C0 * X(k ) + C1 * X(k-1) + C2 * X(k-2) Y(k) = C1 * X(k-1 ) + C0 * X(k ) + C2 * X(k-2)
Conclusion & Future Works Power characteristics of a multiplier Some techniques for low power MACs –Interlaced accumulation programming –Operand swapping –Selective coefficient negation –Coefficient optimization & ordering Find an accurate power model for a multiplier –Cost function for coefficient optimization & instruction-level power optimization An implementation of a multiplier supporting –Selective ‘operand swapping’ & ‘negation’