# Recent Developments in Theory and Implementation of Parallel Prefix Adders Neil Burgess Division of Electronics Cardiff School of Engineering Cardiff University.

## Presentation on theme: "Recent Developments in Theory and Implementation of Parallel Prefix Adders Neil Burgess Division of Electronics Cardiff School of Engineering Cardiff University."— Presentation transcript:

Recent Developments in Theory and Implementation of Parallel Prefix Adders Neil Burgess Division of Electronics Cardiff School of Engineering Cardiff University

Motivation Parallel Prefix Adders (e.g. Kogge- Stone) mostly ignored for deep submicron VLSI –large fan-out points –wide wiring channels Recent insights: can remove both and do... –absolute difference –late increment –media processing

Prefix adder structure A(0:w-1) Bit propagate and generate cells g(0:w-1)p(0:w-1) B(0:w-1) c(1:w) Prefix carry tree s(0:w) Sum cells (XOR gates)

Prefix Equations - 1 g(i) = a(i)  b(i)“carry generate” p(i) = a(i)  b(i)“carry propagate” k(i) =  {a(i)  b(i)}“carry kill” g(i), p(i), & k(i) are mutually exclusive –Use any two:  g(i) & k(i) = NAND & NOR –p(i) needed as well: s(i) = p(i)  c(i)

Prefix Equations - 2 Generate and Not Kill signals are com- bined to form “Group Signals” G x z  K x z interpretation 0 0c(x+1) = 0 0 1c(x+1) = c(z) 1 0Don’t care 1 1c(x+1) = 1

Prefix Equations - Interpretation Group signals yield carry signals: Tree outputs: c(i+1) = G i 0 Tree inputs: G i i = g(i) ;  K i i =  k(i)

Prefix Equations - characteristics Associative –sub-terms may be pre-computed in parallel

Prefix equations - characteristics Idempotent –sub-terms may be “overlapped” g(0), k(0)g g(1), k(1)g g(2), k(2)g GK 1 0 22 11 22 00 c(3)c c(2)c c(1)c

4-bit Ladner-Fisher prefix tree 1 sub-term pre-computed Logarithmic depth Fan-out = 2 in 2 nd row (laterally)

8-bit Ladner-Fisher prefix tree Log depth; lateral fan-out = 4 in 3 rd row No exploitation of idempotency

16-bit Ladner-Fisher prefix tree Log depth with large fan-out in final row

4-bit Kogge-Stone prefix graph Fan-out = 1 (laterally) 1 extra cell parallel wires in 2 nd row

8-bit Kogge-Stone prefix graph More cells & wiring than Ladner-Fisher

16-bit Kogge-Stone prefix graph Low fan-out but wider wiring channels No exploitation of idempotency

Black cells and grey cells Carries, c(i) = G i-1 0 ; K i-1 0 terms not needed G-only cells called and coloured “grey”

The story so far… Parallel prefix adders available in VLSI Log-depth adders possible: –high fan-outs {1,2,4,8…} & low cell count –low fan-outs {1,1,1,1…} & high cell count Problematic in VLSI (buffering, area) Idempotency of ‘  ’ operator not exploited

Log-depth prefix trees In VLSI: –L-F trees require too much buffering  delay –K-S trees require too much area (wire flux) Fan-outs characterised as: –{1,2,4,8…} Ladner-Fisher –{1,1,1,1…} Kogge-Stone

Knowles’ insight Use other fan-out schemes 5 possible 8-bit log-depth prefix trees: –{1,1,1}17 cellsKogge-Stone –{1,1,2}17 cellsuses idempotency –{1,1,4}14 cellsno idempotency –{1,2,2}14 cellsno idempotency –{1,2,4} 12 cellsLadner-Fisher

Knowles’ 8-bit prefix trees All trees are log-depth

Tree construction rules Levels are labelled 0,1,2... Fan-out at j th level, 2 k, satisfies 2 k  2 j Fan-out at j th level  fan-out at j+1 th level Lateral wire length at j th level is 2 j

Knowles’ 16-bit trees - I {1,1,1,1} 49 cells{1,1,1,8}42 cells {1,1,1,2} 49cells {1,2,2,2} 42 cells {1,1,1,4} 49cells {1,1,4,4} 40 cells {1,1,2,2} 49cells {1,1,4,8} 36 cells {1,1,2,4} 49cells {1,2,2,8} 36 cells {1,1,2,8} 42cells {1,2,4,4} 36 cells {1,2,2,4} 42cells {1,2,4,8} 32 cells

Knowles’ 16-bit trees - II {1,1,1,1} {1,1,1,8} {1,1,1,2} Idempotent{1,2,2,2} {1,1,1,4} Idempotent {1,1,4,4} {1,1,2,2} Idempotent {1,1,4,8} {1,1,2,4} Idempotent {1,2,2,8} {1,1,2,8} Idempotent {1,2,4,4} {1,2,2,4} Idempotent{1,2,4,8}

Knowles’ 16-bit trees - III {1,1,1,1} {1,1,1,8}R {1,1,1,2} I{1,2,2,2} R {1,1,1,4} I{1,1,4,4} R {1,1,2,2} I{1,1,4,8} R {1,1,2,4} I{1,2,2,8} R {1,1,2,8} R, I{1,2,4,4} R {1,2,2,4} R, I{1,2,4,8} R

Quick way of spotting R, I Define span(l) as distance from start of wire to first cell in l th level span(l) = 2 l  fanout(l)  1 tree characteristics –R if span(j)  span(k) for j < k –I if span(i) + span(j) = span(k) for i < j < k

Examples of R & I spotting fanout(l)span(l) characteristic [1,1,1,1]  [1,2,4,8] neither R nor I [1,1,2,2]  [1,2,3,7] I only [1,2,2,2]  [1,1,3,7] R only [1,2,2,4]  [1,1,3,5] R & I Are R & I adders “best”?

VLSI design of prefix adders Adders laid out as rectangular array of prefix cells (and gaps) Assume cells measure 10  m  4  m –2 cells per significance  20  m / bit Key design parameters: –buffering (area & delay) –wiring channels (area)

16-bit adder example Assumptions Maximum fan-out without buffering: –3 cells + 80  m wire (4 cell widths) Maximum fan-out with buffering: –9 cells + 240  m wire (12 cell widths) Employ {1,2,2,4} architecture

Area vs Time for 32-bit adders Delay 1212.51313.514 24 26 28 30 32 34 36 38 40 Area K-S {1,1,1,1,1} {1,1,2,2,2} L-F {1,2,4,8,16} {1,2,2,4,4}  [1,1,3,5,13]

III. New applications of prefix adders

Other addition operations Late increment –Mod 2 w -1 addition for Reed-Solomon coding –floating-point rounding Late complement –absolute difference for video motion estimation –sign-magnitude addition Typically use 2 adders and a MUX

Increments in prefix trees Row of prefix cells = ‘late +1’ operation Ladner-Fisher comprises many late +1’s –1 8-bit, 2 4-bit, 4 2-bit, & 8 1-bit

Late increment tree Adder returns A+B if inc = 0 Adder returns A+B+1 if inc = 1 inc

Late increment logic “Late Carry” lc(i) set high if: –c(i) = 1 or –inc = 1 and a(n),b(n)  0,0  n: 0  n < i p(i)p(i) s(i)s(i) inc  K i 0 c(i) = G i 0 lc(i)

Late complement theory In 2’s-complement,  N = -(N+1) A +  B = A  B  1 * late increment then yields A  B  (A +  B) = -(A  B  1+1) = B  A Absolute difference readily available

Absolute difference logic If c(w) = 0, result negative –if c(w) = 0, invert all the bits –else always perform late increment with  K i-1 0 p(i) s(i)s(i) c(w)c(w)  K i 0 c(i)

Summary of “late” ops Available on all prefix adders Extra delay: 1 gate’s delay + buffering Extra hardware:  w black cells This technique used in floating-point units –late increment for rounding –late complement for true subtraction

Media (“packed”) arithmetic Fundamental strategy: Use full wordlength hardware for multiple sub-wordlength computations Examples: –32-bit adder  4 8-bit adders –32-bit multiplier  2 16-bit multipliers

Partitioning an adder Criteria: –support carries propagating within sub-adders –prevent carries propagating between sub- adders Solutions: –put AND gates on carry chains  slower adder –put dummy 0’s on operand bits  larger adder Use prefix adder!!

Packed prefix adder - 1 Force  k(n) = 0 at partition points –prevents carries propagating across bit n –exploits don’t care condition (g,  k) = (1,0) Implementation –change  k(n) gate to (2,1) OR-AND gate –delay-neutral modification

Packed prefix adder - 2 Force c(n) = G n-1 0 = 0 at partition points –prevents c(n)  s(n) errors Implementation –insert AND gates (off critical path) or –change G n-1 0 gate to ({2,1},1) complex gate –BUT need G n-1 0 signal for sub-adder overflows

Packed prefix adder - 3 Sub-adder carries complete early Extraneous cells automatically do nothing

Last Slide Recent developments in prefix adders: –new “family” of log-depth trees –late operations –packed arithmetic for media processing Future possibilities: –systematic exploitation of idempotency –trees with reduced buffering –combine packed arithmetic/late ops