Arithmetic Circuits for Number Crunching: Adders, Subtractors,

Arithmetic Circuits for Number Crunching: Adders, Subtractors,
Multipliers and Dividers Dr. Tassadaq Hussain Instructor: Dr. Rehan Ahmed

Roadmap Arithmetic Circuits Addition Subtraction Multiplication
Division Approximate Computing Fractional Numbers Fixed Point Floating Point Allegory of Arithmetic, from "Margarita Philosophica," 1504 by Gregor Reisch

ADDITION and SUBTRACTION

Design Tradeoffs There are many ways to build adders (or any function). Which is the right implementation? Depends on your system’s requirements Optimization Metrics: power Speed Power Area area time

Single Cycle vs. Multi Cycle Arithmetic
Arithmetic units can be written as: Combinational Blocks Outputs depend only on inputs Results available in one clock-cycle Multi-Cycle Result is computed over multiple cycles Can be much smaller (require fewer logic resources) Can be pipelined for higher clock frequency

Common Adder Architectures
Carry Ripple/Propagate Adder Carry Select Adder Carry Look-ahead Adder Carry Save Adder

Carry Ripple/Propagate Adder

Review:1-Bit (Half) Adder
C S 1 A B S C What if we want to add more than one bit?

Review: Full Adder A B Cin Cout S 1 S A B Cout Cin

Aside: How do we perform Subtraction?

A  B  A  B 1 Aside: Subtraction A  B  A  (B) B  B 1
Can perform subtraction with addition A  B  A  (B) Recall, Twos Complement Numbers: B  B 1 (Invert all bits of B and add 1) A  B  A  B 1

Back to Carry Propagate/Ripple Adder…

Carry Propagate Adder (CPA)
Connect Full Adders to make Carry Propagate Adder (Ripple Adder) Cin Cout B3 A3 B A2 B A1 B A0 C4 S3 S S S0 Right-most stage is least-significant bit (LSB) Carry-out of previous stage feeds into Carry-In of next stage Can extend to any number of bits – well, we’ll see that …

How fast is this Ripple Carry Adder?

Delay of a Full Adder What is the critical path delay of a Full Adder?
B Cout Cin Assume all gates have the same gate propagation delay tPD If inputs A, B, Cin arrive at time 0, S is ready after tPD A, B, Cin  XOR Gate  S Cout is ready after 3 tPD A, B  OR  AND  OR  Cout Critical Path Delay is 3 tPD

Delay of Carry Propagate Adder
Consider 2-bit Carry Propagate Adder S1 A1 B1 Cout Cin S0 A0 B0 Cout Cin

Consider 2-bit Carry Propagate Adder C C out in SS00 AA00 BB00 Cin 00 SS11 AA11 BB11 Cout Inputs A0, A1, B0, B1 arrive at time 0, Cout of first bit is ready after 3 tPD Delay of next Cout is ready after another 2 tPD Critical Path Delay is 5 tPD

Delay for an N-bit Carry Propagate Adder is 2 N 1tPD  3tPD … Cin A 0 B0 Cout B 1 A 1 B N-1 A N-1 SN S N-1 2*t gate S1 S0 3*t gate Delay is proportional to N  SLOW for large N When 32- or 64-bit numbers are used, this delay may become unacceptably high.

Aside: Combined Adder/Subtractor
A  B  A  B` 1 A3 B3 A2 B2 A1 B1 A0 B0 MODE 1 0 1 0 1 0 1 0 Cout ADDER Ci n C 4 S 3 S2 Subtraction: Mode = 1 S1 S0 Addition: Mode = 0 B inputs are inverted Carry-in is 1 B inputs are NOT inverted Carry-in is 0 Exercise: How does this affect delay?

Can we make faster Adders i.e.
Do something about the Carry Generation?

Carry Select Adder (CSA)

Carry-Select Adder Carry Propagate Adders are slow because the high-order bits need to wait for the carry-in from lower-order bits. Calculate high-order bits for BOTH cases of carry-in. Then select the correct case when carry-in is ready Trade-off area (use more gates) for faster performance

Carry Select Adder: 8-Bit Example
1 Cout Cin 4 4 1 0 4 4-bit Carry Propagate Adder S A 7..4 B A B 3..0 A 7..4 B 7..4 S S 3..0 Bits 7..4 computed in parallel with bits 3..0

Carry Look-Ahead Adder (CLA)

Approach Gi  f (Ai, Bi ) Pi  f (Ai, Bi )
Pre-compute parts of carry logic For each bit of the addition, independently calculate two terms: Generate term Gi  f (Ai, Bi ) Propagate term Pi  f (Ai, Bi ) Gi and Pi are independent of Carry terms Ci Then, by using the various Generate and Propagate terms, we can compute the carry terms at each bit without the ripple effect.

Generate Term Does adding the i-th bits of A, B generate a carry?
Example: A = 011, B = 010 Look at each bit-position INDEPENDANTLY Does adding bit 0 generate a Carry? A + B = NO, = 01 (carry NOT generated) Does adding bit 2 generate a Carry? A + B = NO, = 00 G2= 0 G = 0 Does adding bit 1 generate a Carry? A + B = YES, = 10 (carry generated) G1= 1

Gi  Ai and Bi Generate Term Ai Bi Gi 1
Ai, Bi would generate a carry if both are 1 Ai Bi Gi 1 Gi  Ai and Bi

Propagate Term If there was a carry-in at the i-th bit, would it propagate to the next stage? Example: A = 011, B = 010 Look at each bit-position INDEPENDANTLY Would bit 0 propagate a Carry? A + B = YES, = 10 (carry propagated) Would bit 2 propagate a Carry? A + B = NO, = 01 P2= 0 P =1 hypothetical carry-in hypothetical carry-in Would bit 1 propagate a Carry? A + B = YES, = 11 (carry propagated) P1= 1 hypothetical carry-in

Pi  Ai or Bi Propagate Term Ai Bi P i 1
Ai, Bi would propagate a carry if either is 1 Ai Bi P i 1 Pi  Ai or Bi

Will the i+1 Position Produce a Carry Out?
Couti+1  Gi or Pi and Cini  A Carry was GENERATED A Carry was PROPAGATED

4-Bit Example C1  G0  P0  C0 C2  G1  P1  C1
Let Ci denote the carry-in of stage i this means that it is also the carry-out of the previous stage A3 B3 A2 B2 A1 B1 A0 B0 C0 Bit 0 S0 Bit 1 S1 Bit 2 S2 Bit 3 S3 C1 C2 C3 C4 C1  G0  P0  C0 C2  G1  P1  C1 Note: We are using logical operators here: + means OR . means AND C3  G2  P2  C2 C4  G3  P3  C3

Carry Look-Ahead Logic
C1  G0  P0  C0 C2  G1  P1  C1 C3  G2  P2  C2 C4  G3  P3  C3 Perform Forward Substitution C1  G0  P0  C0 C2  G1  P1  G0  P1  P0  C0 C3  G2  P2  G1  P2  P1  G0  P2  P1  P0  C0 C4  G3  P3  G2  P3  P2  G1  P3  P2  P1  G0  P3  P2  P1  P0  C0

Carry Look-Ahead Delay
G1 P1 G0 P1 P0 C0 G0 P0 C0 C1  G0  P0  C0 C2  G1  P1  G0  P1  P0  C0 C3  G2  P2  G1  P2  P1  G0  P2  P1  P0  C0 C4  G3  P3  G2  P3  P2  G1  P3  P2  P1  G0  P3  P2  P1  P0  C0 AND-OR networks C C1 If C0 and all Gi and Pi terms are available at the same time, ALL Ci terms will be ready after 2 gate delays No Ripple Effect!

Overall Critical Path Delay
A3 B3 A2 B2 A1 B1 A0 B0 C0 Inputs arrive at time 0 Generate Propagate G3 P Generate Propagate G2 P Generate Propagate G1 P Generate Propagate G0 P 3 2 1 C4 Carry Look-Ahead Unit C3 C2 C1 A2 B2 A1 B1 A3 B3 A0 B0 Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) S4 S3 S2 S1 S0

A3 B3 A2 B2 A1 B1 A0 B0 C0 Inputs arrive at time 0 Generate Propagate G0 P Generate Propagate Generate Propagate Generate Propagate G1 P1 tPD G3 P3 G2 P2 C4 Carry Look-Ahead C3 C2 A3 B3 A2 B2 A1 Unit C1 G P B1 A0 B0 Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) S4 S3 S2 S1 S0 All G, P terms available after single gate delay

A3 B3 A2 B2 A1 B1 A0 B0 C0 Inputs arrive at time 0 Generate Propagate G3 P Generate Propagate G2 P Generate Propagate G1 P Generate Propagate G0 P tPD + 2 tPD 3 2 1 C4 Carry Look-Ahead Unit C3 C2 C1 A2 B2 A1 B1 A3 B3 A0 B0 Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) S4 S3 S2 S1 S0 As discussed in Slide 34

A3 B3 A2 B2 A1 B1 A0 B0 C0 Inputs arrive at time 0 Generate Propagate G1 P Generate Propagate G0 P Generate Propagate Generate Propagate G2 P2 tPD + 2 tPD G3 P3 C4 C C3 A3 B3 A2 1 arry Look-Ahead Unit C2 C1 B2 A1 B1 A0 B0 + Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) tPD S4 S3 S2 S1 S0

CLA - Overall Critical Path Delay
A3 B3 A2 B2 A1 B1 A0 B0 C0 Inputs arrive at time 0 Generate Propagate G3 P Generate Propagate G2 P Generate Propagate G1 P Generate Propagate G0 P tPD + 2 tPD 3 2 1 C4 Carry Look-Ahead Unit C3 C2 C1 A2 B2 A1 B1 A3 B3 A0 B0 + Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) tPD S4 ALL outputs ready after 4 tPD

CLA - Scalability Typically do not extend beyond 4-bits
In theory, we could build Carry Look-Ahead Adders of any size N However, equations get more complex very quickly, and we need wider and wider gates (slow) in the carry logic. C1  G0  P0  C0 C2  G1  P1  G0  P1  P0  C0 C3  G2  P2  G1  P2  P1  G0  P2  P1  P0  C0 C4  G3  P3  G2  P3  P2  G1  P3  P2  P1  G0  P3  P2  P1  P0  C0 Typically do not extend beyond 4-bits

Hierarchical Carry Look-Ahead Adder

Building Larger Carry Look-Ahead Adder
Hierarchically Build Bigger CLAs Create a 4-bit CLA like we discussed The 4-bit CLA now also computes a group generate term GG and a group propagate term PG PG  P3  P2  P1  P0 GG  G3  P3  G2  P3  P2  G1  P3  P2  P1  G0 Combine 4 x 4-bit CLAs to build a 16-bit CLA 16-bit CLA can also compute it’s own GG and PG terms Can then take 4 x 16-bit CLAs to build 64-bit CLA and so on…

4-Bit CLA Alternative Schematic
A3 B3 A2 B2 A1 B1 A0 B0 C0 Generate Propagat e Sum Logic S3 Generate Propagat e Sum Logic S2 Generate Propagat e Sum Logic S1 Generate Propagat e Sum Logic S0 G3 P3 C3 G2 P2 C2 G1 P1 C1 Carry Look-Ahead Unit G0 P0 C4 Same circuit as slide 39 but drawn differently. Use this as building block for 16-bit CLA

16-bit Carry Look-Ahead Adder
A B A8..11 B8..11 A7..4 B7..4 A3..0 B 3..0 C0 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4 S 3..0 4 4 4 S15..12 S8..11 S 7..4 G8 P8 C8 G4 P4 C4 Carry Look-Ahead Unit G0 P0 G12 P12 C12 C16

Carry Look-Ahead logic is the same C4  G0  P0  C0 C8  G4  P4  C4 C12  G8  P8  C8 C16  G12  P12  C12 Perform Forward Substitution C1  G0  P0  C0 C8  G4  P4  G0  P4  P0  C0 C12  G8  P8  G4  P8  P4  G0  P8  P4  P0  C0 C16  G12  P12  G8  P12  P8  G4  P12  P8  P4  G0  P12  P8  P4  P0  C0 Page 45

16-bit CLA Critical Path Delay
A B A8..11 B8..11 A7..4 B7..4 A3..0 B 3..0 C0 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4 S 3..0 4 4 4 S15..12 S8..11 S 7..4 G8 P8 C8 G4 P4 C4 Carry Look-Ahead Unit G0 P0 G12 P12 C12 C16

A B A8..11 B8..11 A7..4 B7..4 A3..0 B 3..0 C0 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4 3 tP S 3..0 4 4 4 S15..12 S8..11 S 7..4 D G8 P8 C8 G4 P4 C4 Carry Look-Ahead Unit G0 P0 G12 P12 C12 C16 Inputs arrive at time 0 GG of 4-bit CLAs ready after 3 tPD

A B A8..11 B8..11 A7..4 B7..4 A3..0 B 3..0 C0 5 t PD 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4 3 tP S 3..0 4 4 4 S15..12 S8..11 S 7..4 D G8 P8 C8 G4 P4 C4 Carry Look-Ahead Unit G0 P0 G12 P12 C12 C16 Carries from CLA Unit ready after another 2 tPD

A B A8..11 B8..11 A7..4 4-bit CLA 4-bit CLA 4-bit Adder Adder Ad GG PG GG PG G 4 4 4 S S8..11 S 7..4 G12 P12 C12 G8 P8 C8 C16 Carry Look-Ahead C0 B7..4 A3..0 B 3..0 A B A B A B A B C0 3 3 2 2 1 1 0 0 5 tPD Generate Propagate G3 P Generate Propagate G2 P Generate Propagate G1 P Generate Propagate G0 P 4 4 4 CLA er G PG 4-bit CLA Adder GG PG 4 3 tP S 3..0 3 2 1 d C4 Carry Look-Ahead Unit C3 C2 C1 A2 B2 A1 B1 D A3 B3 A0 B0 Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) G4 P4 C4 Unit G0 P0 S3 S2 S1 S0

A B A8..11 B8..11 A7..4 4-bit CLA 4-bit CLA 4-bit Adder Adder Ad GG PG GG PG G 4 4 4 S S8..11 S 7..4 G12 P12 C12 G8 P8 C8 C0 B7..4 A3..0 B 3..0 A B A B A B A B C0 3 3 2 2 1 1 0 0 5 tPD Generate Propagate G3 P Generate Propagate G2 P Generate Propagate G1 P Generate Propagate G0 P 4 4 4 CLA er G PG 4-bit CLA Adder GG PG 4 3 tP S 3..0 3 2 1 d C4 Carry Look-Ahead Unit C3 C2 C1 D 7 t A3 B3 A2 B2 A1 B1 PDA0 B0 Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) G4 P4 C4 nit G0 P0 S3 S2 S1 S0 Carries in 4-bit CLAU ready after another 2 tPD

B7..4 A3..0 B 3..0 A B A B A B A B C0 3 3 2 2 1 1 0 0 5 tPD Generate Propagate G3 P Generate Propagate G2 P Generate Propagate G1 P Generate Propagate G0 P 4 4 4 CLA er G PG 4-bit CLA Adder GG PG 4 3 tP S 3..0 3 2 1 d C4 Carry Look-Ahead Unit C3 C2 C1 D 7 t A3 B3 A2 B2 A1 B1 PDA0 B0 Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) Full Adder (Sum Only) G4 P4 C4 nit G0 P0 GG PG S3 S2 S1 8 tPD S0 Sum bits ready after another tPD

A B A8..11 B8..11 A7..4 B7..4 A3..0 B 3..0 C0 5 t PD 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4-bit CLA Adder GG PG 4 3 tP S 3..0 4 4 4 8 tPD S S 15..12 S 7..4 D 8..11 G8 P8 C8 G4 P4 C4 Carry Look-Ahead Unit G0 P0 G12 P12 C12 C16 Overall Critical Path Delay is 8 tPD

A B A B A B A15..0 B15..0 C0 16-bit CLA Adder GG PG 16-bit CLA Adder GG PG 16-bit CLA Adder GG PG 16-bit CLA Adder GG PG 16 S15..0 16 16 16 S63..48 S47..32 S G0 P0 GG PG G48 P48 C48 G32 P32 C32 G16 P16 C16 Carry Look-Ahead Unit C64 Exercise: What is the Critical Path Delay of this 64-bit CLA?

CLA Critical Path Delay
Delay for an N-bit Carry Look-Ahead Adder is 4 log4 N tPD + + + + G/P G/P G/P G/P Group Gi Pi terms propagate down Group carry terms propagate up Carry into Full Adder 4-bit CLAU 4-bit CLAU 4-bit CLAU 4-bit CLAU 4-bit CLAU 4-bit CLAU 4-bit CLAU 4-bit CLAU 16-bit Carry Look-Ahead Unit 16-bit Carry Look-Ahead Unit 64-bit Carry Look-Ahead Unit Delay increases logarithmically w.r.t. N Proportional to number of hierarchical “levels”

CPA vs CLA Critical Path Delay
140 Critical Path Delay (gate propagation delays) Carry Propagate Adder O N  120 100 80 60 40 Carry Look-Ahead Adder O log N  20 10 20 30 Adder Size 40 50 60

Carry Save Adder (CSA): Adding Multiple Operands

Motivation How do you add more than two numbers?
There are many applications where you need to add more than two numbers at a time Multiplication is a good example which we will see soon … Say we want to add M N-bit numbers…

M 1 tCLA  O M  log N  One Approach … CLA
Add first two numbers. Add the next number to the previous sum. Repeat. Need M-1 adders Assuming CLA adders, delay is M 1 tCLA  O M  log N 

Another Approach… log2 M  tCLA  O log M  log N 
Use Full Tree Topology CLA CLA Add pairs of numbers in parallel. Add pairs of partial sums in parallel. Repeat for each level. CLA CLA CLA CLA CLA CLA CLA Still need M-1 adders CLA CLA CLA Assuming CLA adders, delay is log2 M  tCLA  O log M  log N 

Better Approach: Carry Save Adder
Defer the addition of carry bits until later Basic Idea: Add 3 operands at a time (A, B, D) Produce 2 output numbers: partial sum bits (P) and carry bits (C) Add partial sum and carry with a fast adder A  B  D  P  2  C Example: A = 9, B = 12, D = 13  P = 8, C = 13  P+2*C = 34 } A 1001 9 B 1100 12 D 1101 13 P 1000 8 C 1101_ S 100010 34 Do this with Full Adders Do this with Carry Look-Ahead Adder

Add 3 Numbers – Produce 2 Outputs
Remember, full adders already add 3 operands and produce 2 outputs Use Full Adders but don’t connect the carry chain A N-1 BN-1 DN-1 A1 B1 D1 A0 B0 D0 … CN-1 P N-1 C1 P1 C0 P0

Critical Path Delay … Critical Path Delay is just the Full Adder Delay
A N-1 BN-1 DN-1 A1 B1 D1 A0 B0 D0 … CN-1 P N-1 C1 P1 C0 P0 All carry bits ready after 3 tPD independent of N

Adding More than 3 Operands
Use 3-input CSA as building block A B D N N N N-Bit CSA N C P Delay of a CSA adder INDEPENDENT of N O 1

Adding 4 Operands Need a standard adder such as CLA in the end to
A B D N-Bit CSA E N-Bit CSA CLA S Need a standard adder such as CLA in the end to add the two remaining numbers into one.

Adding 6 Operands A B D E F G S N-Bit CSA N-Bit CSA N-Bit CSA
CLA S

Implement through wiring.
What about the 2C part? A  B  D  P  2  C Multiplying by 2 can be done with bit-shifting. Implement through wiring. A3 B3 D3 A2 B2 D2 A1 B1 D1 A0 B0 D0 A B D N-Bit CSA C P FA FA FA FA C P C P C P C P E E E E E 3 2 1 N-Bit CSA C P FA FA FA FA FA C P C P C P C P C P CLA CLA 5 S S

What about the 2C part? CSA adder width increases by 1-bit per level to prevent overflow No problem! CSA delay does not scale with width! A3 B3 D3 A2 B2 D2 A1 B1 D1 A0 B0 D0 A B D N-Bit CSA C P 4-bit CSA FA FA FA FA C P C P C P C P E E E E E 3 2 1 N-Bit CSA C P FA FA FA FA FA 5-bit CSA C P C P C P C P C P CLA CLA 5-bit CLA 5 S S

Multiplication

Multiplication 0 x 0 = 0 0 x 1 = 0 A 1 x 0 = 0 B 1 x 1 = 1 A x B
Multiplying 1-bit numbers is AND operation 0 x 0 = 0 0 x 1 = 0 1 x 0 = 0 1 x 1 = 1 A B A x B

Multiplication Multiplying 1-bit x N-bit is AND operation
A3 A2 A1 A0 B A x B = P x 0 = 0000 1011 x 1 = 1011 P3 P2 P1 P0

N-Bit x N-Bit (Unsigned) Multiplication
Consider 4-bit x 4-bit Multiplication A3 B3 A2 B2 A1 B1 A0 B0 X

Consider 4-bit x 4-bit Multiplication A3 B3 A2 B2 A1 B1 A0 B0 X A . B A . B A . B A . B 3 0 2 0 1 0 0 0

Consider 4-bit x 4-bit Multiplication A3 B3 A2 B2 A1 B1 A0 B0 X A . B A . B A . B A . B 3 0 2 0 1 0 0 0 A . B A . B A . B A . B 3 1 2 1 1 1 0 1

Consider 4-bit x 4-bit Multiplication A3 B3 A2 B2 A1 B1 A0 B0 X A . B A . B A . B A . B 3 0 2 0 1 0 0 0 A . B A . B A . B A . B 3 1 2 1 1 1 0 1 A . B A . B A . B A . B 3 2 2 2 1 2 0 2

Consider 4-bit x 4-bit Multiplication A3 B3 A2 B2 A1 B1 A0 B0 X A . B A . B A . B A . B 3 0 2 0 1 0 0 0 A . B A . B A . B A . B 3 1 2 1 1 1 0 1 A . B A . B A . B A . B 3 2 2 2 1 2 0 2 A . B A . B A . B A . B 3 3 2 3 1 3 0 3

Consider 4-bit x 4-bit Multiplication A3 B3 A2 B2 A1 B1 A0 B0 X A . B A . B A . B A . B 3 0 2 0 1 0 0 0 A . B A . B A . B A . B 3 1 2 1 1 1 0 1 A . B A . B A . B A . B 3 2 2 2 1 2 0 2 + A . B A . B A . B A . B 3 3 2 3 1 3 0 3 P7 P6 P5 P4 P3 P2 P1 P0

Consider 4-bit x 4-bit Multiplication A3 B3 A2 B2 A1 B1 A0 B0 X A . B A . B A . B A . B 3 0 2 0 1 0 0 0 A . B A . B A . B A . B 3 1 2 1 1 1 0 1 A . B A . B A . B A . B 3 2 2 2 1 2 0 2 + A . B A . B A . B A . B 3 3 2 3 1 3 0 3 P7 P6 P5 P4 P3 P2 P1 P0 Sounds familiar, Right! : Multiplication by Hand How to express this multiplication in Hardware?

Hardware Implementation: N-Bit x N-Bit (Unsigned) Multiplication

Hardware Implementation: N-Bit x N-Bit (Unsigned) Multiplication
Do you see any problem in this circuit?

Using Fast Adders…. Fast Adder Fast Adder HA Fast Adder A3 B1 0
A2 B1 A3 B0 A1 B1 A2 B0 A0 B1 A1 B0 A0 B0 Fast Adder A3 B2 A2 B2 A1 B2 A0 B2 HA Fast Adder A3 B3 A2 B3 A1 B3 A0 B3 Fast Adder P7 P6 P5 P4 P3 P2 P1 P0

Large Multipliers [aka Decomposed Multipliers]
Construct large multipliers out of smaller multipliers Construct 2N-bit x 2N-bit multiplier using N-bit x N-bit multipliers Let A and B be 2N-bit numbers such that: A  A2 N 1A2 N 2 …ANAN 1AN 2 …A0 AH AL B  B2 N 1B2 N 2 …BNBN 1BN 2 …B0 BH BL { {

2N-Bit x 2N-Bit Multiplier
AH is the upper N bits of A and AL is the lower N bits of A BH is the upper N bits of B and BL is the lower N bits of B Therefore, A  A  2 N N A B  B  2  B H L H L And so, A  B  A     2  A N  B  2 N B H L H L    A B 2 2 N A B  A B 2  A B N H H H L L H L L

  A  B  A B 2 A A  A B 2  A B N H H H L L H L L { N-Bit x N-Bit Multipliers

Bit Shifting { { 2 N   A  B  A B 2 A A  A B 2  A B N H H H L L H L L

  A  B  A B 2 A A  A B 2  A B N H H H L L H L L { { 2N-Bit Adders

AH AL BH BL AL BL AH BL AL BH AH BH 4N-Bit Result

Fmax For A Combinational Multiplier
In Verilog, P = A * B; (make sure P has twice the number of bits of A and B) Synthesis tool will create a combinational multiplier if DSP Block inference is turned off. Measured on our Cyclone FPGA: 8 bits x 8 bits: 16 bits x 16 bits: 32 bits x 32 bits: 64 bits x 64 bits: Fmax = 464 MHz Fmax = 396 MHz Fmax = 104 MHz Fmax = 66 MHz

Serial (Multi-cycle) Multiplier

Multi Cycle Multiplier
How do you multiply by hand? 1101 A 1011 B 1101 11010 000000 P

One Possible Algorithm
1101 11010 000000 P P = 0 while B != 0: if B(0) == 1: P = P + A A = << 1 B >> Note: This is NOT Verilog

Example Pold P 1311  143 11011011  10001111 A B P = P + A A = A
A Pold P 1 B P = 0 while B != 0: if B(0) == 1: P = P + A A = A << 1 B = B >> 1

A Pold P 1 B 1 P = 0 while B != 0: if B(0) == 1: P = P + A A = A << 1 B = B >> 1

A Pold P 1 B 1 1 1 1 P = 0 while B != 0: if B(0) == 1: P = P + A A = A << 1 B = B >> 1

A Pold P 1 B 1 1 1 1 1 1 1 P = 0 while B != 0: if B(0) == 1: P = P + A A = A << 1 B = B >> 1

A Pold P 1 B 1 1 P = 0 while B != 0: if B(0) == 1: P = P + A A = A << 1 B = B >> 1

Continue until B is ZERO

Final Result Pold P 1311  143 11011011  10001111 A B P = P + A A =
A Pold P B 1 1 1 1 1 1 1 P = 0 while B != 0: if B(0) == 1: P = P + A A = A << 1 B = B >> 1

shiftA loadB shiftB loadP selP z
Top – Level Schematic Controller (State Machine) Datapath b0 s done P n 2n loadA shiftA loadB shiftB loadP selP z B A When s goes high, n-bit values are available on A and B. The machine then multiplies, and when it is finished, asserts done and puts the result on P. It then waits for s to go low.

Datapath selP P = 0 while B != 0 if B(0) == 1 P = P + A
DataB (input) n n n 2n loadA shiftA clk load shiftleft loadB shiftB clk load shiftright 2n-bit Register A n-bit Register B 2n n adder 2n 2n z b0 selP Psel 1 P = 0 while B != 0 if B(0) == 1 P = P + A A = A << 1 B = B >> 1 loadP load 2n-bit Register P clk 2n P (output) Page 40

State Machine reset s=1 z=0 s=0 s=1 z=1 loadP=1 selP=0 done=1 s=0
shiftA=1 shiftB=1 selP=1 if (b0) loadP=1 else loadP=0 loadP=1 selP=0 done=1 s=0

Fmax For A Multi-Cycle Multiplier
Measured on our Cyclone FPGA: Combinational Multiplier Fmax from earlier slide: 8 bits x 8 bits: 16 bits x 16 bits: 32 bits x 32 bits: 64 bits x 64 bits: Fmax = 464 MHz Fmax = 396 MHz Fmax = 104 MHz Fmax = 66 MHz Multi-Cycle Multiplier Fmax: 64 bits x 64 bits: Fmax = 400 MHz Faster Fmax, but there is now LATENCY Takes up to N clock cycles to compute product

Division

Fmax For A Combinational Divider
In Verilog you can infer a combinational divider: Q = A / B; Measured on our Cyclone FPGA: 8 bits / 8 bits: 16 bits / 16 bits: 32 bits / 32 bits: 64 bits / 64 bits: Fmax = 79 MHz Fmax = 25 MHz Fmax = 9 MHz Fmax = 3 MHz (uses 13% of the logic resources!) Too slow and too large. Almost always use multi-cycle divider in practice

Review: Pencil and Paper Division
100 10 01100 1001 001 01 10000 1110 101 Q A B R 9 140 50 45 5 15

Restoring Division Algorithm
Dividend A (N bit) Divisor B (N-1 bit) Quotient Q (N bit) Remainder R (N-1 bit) S = A << 1 N times: repeat = - B S2N-1..N S2N-1..N if S < 0: = + B A  Q  B  R S2N-1..N S2N-1..N S << 1 S0 = 0 else S0 = 1 S is an internal variable that is 2N bits wide At the end, SN is not used Q = R = SN S2N-1..N+1 Note: This is NOT Verilog

(note that bit 8 is ignored)
Flow Chart Start 1. Load in the divisor into the 8-bit Divisor register and the dividend into the 16-bit Remainder register. Shift the remainder left by 1 bit. 2. Subtract the Divisor register from the left half of the Remainder register, and place the result in the left half of the Remainder register The 16-bit Remainder register referred to in this flowchart is S from the pseudo-code. Remainder >= 0 Remainder < 0 Test Remainder 3b. Restore the original value by adding the Divisor Register to the left half of the remainder register and place the sum in the left half of the Remainder Register. Also, shift the Remainder register one bit to the left, setting the new rightmost bit to 0 3a. Shift the Remainder register one bit to the left, setting the new rightmost bit to 1 No 8th Repetition? Yes Done. Quotient is in bits (7:0) of Remainder Register and Remainder is in bits (15:9) of Remainder Register (note that bit 8 is ignored)

Example = S15..8 - S15..8 S15..8 = S << 1 S15..8 + S7..0 S15..9
9 140 remainder  5 1111 S = A << 1 N times: = S < 0: remainder  101 repeat S15..8 if S B S15..8 = S << 1 R0 = 0 else S B 1 B S << 1 S0 = 1 S7..0 S15..9 1 S Q = R =

Example - = S15..8 - S15..8 S15..8 = S << 1 S15..8 + S7..0
9 140 remainder  5 1111 S = A << 1 N times: = S < 0: remainder  101 repeat S15..8 if S B S15..8 = S << 1 S0 = 0 else S B 1 S old - 1 B S << 1 S0 = 1 S7..0 S15..9 1 S Q = R = ITERATION 1

Example + = S15..8 - S15..8 S15..8 = S << 1 S15..8 + S7..0
9 140 remainder  5 1111 S = A << 1 N times: = S < 0: remainder  101 repeat S15..8 if S B S15..8 = S << 1 S0 = 0 else S B 1 S old + 1 B S << 1 S0 = 1 S7..0 S15..9 1 S Q = R = ITERATION 1

Example = S15..8 - S15..8 S15..8 = S << 1 S15..8 + S7..0 S15..9
9 140 remainder  5 1111 S = A << 1 N times: = S < 0: remainder  101 repeat S15..8 if S B S15..8 = S << 1 S0 = 0 else S B 1 S old S << 1 S0 = 1 S7..0 S15..9 1 S Q = R = ITERATION 1

Example - = S15..8 - S15..8 S15..8 = S << 1 S15..8 + S7..0
9 140 remainder  5 1111 S = A << 1 N times: = S < 0: remainder  101 repeat S15..8 if S A S15..8 = S << 1 S0 = 0 else S A 1 S old - 1 B S << 1 S0 = 1 S7..0 S15..9 1 S Q = R = ITERATION 2

Example + = S15..8 - S15..8 S15..8 = S << 1 S15..8 + S7..0
9 140 remainder  5 1111 S = A << 1 N times: = S < 0: remainder  101 repeat S15..8 if S B S15..8 = S << 1 S0 = 0 else S B 1 S old + 1 B S << 1 S0 = 1 S7..0 S15..9 1 S Q = R = ITERATION 2