VLSI Arithmetic Adders & Multipliers

VLSI Arithmetic Adders & Multipliers
Prof. Vojin G. Oklobdzija University of California Prof. V.G. Oklobdzija VLSI Arithmetic

Introduction Digital Computer Arithmetic belongs to Computer Architecture, however, it is also an aspect of logic design The objective of Computer Arithmetic is to develop appropriate algorithms that are utilizing available hardware in the most efficient way. Ultimately, speed, power and chip area are the most often used measures, making a strong link between the algorithms and technology of implementation. Digital computer arithmetic is an aspect of logic design with the objective of developing appropriate algorithms in order to achieve an efficient utilization of the available hardware. Given that the hardware can only perform relatively simple and primitive set of Boolean operations, arithmetic operations are based on a hierarchy of operations that are built upon the simple ones. Since ultimately, speed, power and chip area are the most often used measures of the efficiency of an algorithm, there is a strong link between the algorithms and technology used for its implementation. Prof. V.G. Oklobdzija VLSI Arithmetic

Basic Operations Addition Multiplication Multiply-Add Division
Evaluation of Functions Digital computer arithmetic is an aspect of logic design with the objective of developing appropriate algorithms in order to achieve an efficient utilization of the available hardware. Given that the hardware can only perform relatively simple and primitive set of Boolean operations, arithmetic operations are based on a hierarchy of operations that are built upon the simple ones. Since ultimately, speed, power and chip area are the most often used measures of the efficiency of an algorithm, there is a strong link between the algorithms and technology used for its implementation. Prof. V.G. Oklobdzija VLSI Arithmetic

Addition of Binary Numbers
Full Adder. The full adder is the fundamental building block of most arithmetic circuits: The sum and carry outputs are described as: ai bi Cout Full Adder Cin si Digital computer arithmetic is an aspect of logic design with the objective of developing appropriate algorithms in order to achieve an efficient utilization of the available hardware. Given that the hardware can only perform relatively simple and primitive set of Boolean operations, arithmetic operations are based on a hierarchy of operations that are built upon the simple ones. Since ultimately, speed, power and chip area are the most often used measures of the efficiency of an algorithm, there is a strong link between the algorithms and technology used for its implementation. Prof. V.G. Oklobdzija VLSI Arithmetic

Addition of Binary Numbers
Inputs Outputs ci ai bi si ci+1 1 Propagate Generate Digital computer arithmetic is an aspect of logic design with the objective of developing appropriate algorithms in order to achieve an efficient utilization of the available hardware. Given that the hardware can only perform relatively simple and primitive set of Boolean operations, arithmetic operations are based on a hierarchy of operations that are built upon the simple ones. Since ultimately, speed, power and chip area are the most often used measures of the efficiency of an algorithm, there is a strong link between the algorithms and technology used for its implementation. Propagate Generate Prof. V.G. Oklobdzija VLSI Arithmetic

Full-Adder Implementation
Full Adder operations is defined by equations: Carry-Propagate: and Carry-Generate gi First we should examine a realization of a one-bit adder which represents a basic building block for all the more elaborate addition schemes. Operation of a Full Adder is defined by the Boolean equations for the sum and carry signals shown in this slide: ai, bi, and ci are the inputs to the i-th full adder stage, and si and ci+1 are the sum and carry outputs from the i-th stage, respectively. From the above equation it is clear that the realization of the Sum function requires two XOR logic gates. The expression for Carry function could be rewritten using the Carry-Propagate pi and Carry-Generate gi terms. If Carry-Propagate is 1, the Carry out of the stage will be equal to the Carry signal into the stage: ci+1 = ci regardless of the carry inside the stage. If Carry-Generate is 1, there will be a Carry signal out of the stage will be 1 regardless of the value of the incoming Carry signal. The logical implementation of the full adder stage is shown in figure (a.) of this slide. This implementation results from a direct application of the logic equations. The implementation (b) is more clever because it utilizes a multiplexer in the carry path. Given that the multiplexer block is often faster than a single gate, using multiplexer in the critical path helps to achieve better performance. One-bit adder could be implemented as shown Prof. V.G. Oklobdzija VLSI Arithmetic

High-Speed Addition First we should examine a realization of a one-bit adder which represents a basic building block for all the more elaborate addition schemes. Operation of a Full Adder is defined by the Boolean equations for the sum and carry signals shown in this slide: ai, bi, and ci are the inputs to the i-th full adder stage, and si and ci+1 are the sum and carry outputs from the i-th stage, respectively. From the above equation it is clear that the realization of the Sum function requires two XOR logic gates. The expression for Carry function could be rewritten using the Carry-Propagate pi and Carry-Generate gi terms. If Carry-Propagate is 1, the Carry out of the stage will be equal to the Carry signal into the stage: ci+1 = ci regardless of the carry inside the stage. If Carry-Generate is 1, there will be a Carry signal out of the stage will be 1 regardless of the value of the incoming Carry signal. The logical implementation of the full adder stage is shown in figure (a.) of this slide. This implementation results from a direct application of the logic equations. The implementation (b) is more clever because it utilizes a multiplexer in the carry path. Given that the multiplexer block is often faster than a single gate, using multiplexer in the critical path helps to achieve better performance. One-bit adder could be implemented more efficiently because MUX is faster Prof. V.G. Oklobdzija VLSI Arithmetic

The Ripple-Carry Adder
Prof. V.G. Oklobdzija VLSI Arithmetic

The Ripple-Carry Adder
From Rabaey Prof. V.G. Oklobdzija VLSI Arithmetic

Inversion Property From Rabaey Prof. V.G. Oklobdzija VLSI Arithmetic

Minimize Critical Path by Reducing Inverting Stages
From Rabaey Prof. V.G. Oklobdzija VLSI Arithmetic

Manchester Carry-Chain Realization of the Carry Path
Simple and very popular scheme for implementation of carry signal path Manchester Carry Chain is a simple schemes for addition that was very popular at the time of emerging LSI nMOS technology. It is an alternative switch based technique implemented using pass-transistor logic. The speed realized using Manchester Carry Chain is impressive which is due to its simplicity and the properties of the pass-transistor logic. Manchester Carry Chain does not require a large area for its implementation, consuming substantially less power as compared to Carry-Lookahead or other more elaborate schemes. A realization of the Manchester Carry Chain is shown in the slide. Due to the RC delay properties of the Manchester Carry Chain the signal needs to be regenerated by inserting inverters at appropriately chosen locations in the carry chain. Prof. V.G. Oklobdzija VLSI Arithmetic

Manchester Carry Chain
Implement P with pass-transistors Implement G with pull-up, kill (delete) with pull-down Use dynamic logic to reduce the complexity and speed up Kilburn, et al, IEE Proc, 1959. Prof. V.G. Oklobdzija VLSI Arithmetic

Ripple Carry Adder Critical Path
Carry-Chain of an RCA implemented using multiplexer from the standard cell library: Critical Path A ripple carry adder for N-bit numbers is implemented by concatenating N full adders as shown in this slide. At the i-th bit position, the i-th bits of operands A and B and a carry signal from the preceding adder stage are used to generate the i-th bit of the sum, si, and a carry, ci+1, to the next adder stage. This scheme is called a Ripple Carry Adder, since the carry signal “ripple” from the least significant bit position to the most significant one. If the ripple carry adder is implemented by concatenating N full adders, the delay of such an adder is 2N gate delays from Cin-to-Cout. The path from the input to the output signal that is likely to take the longest time is designated as a "critical path". In the case of a Ripple Carry Adder, this is the path from the least significant input a0 or b0 to the last sum bit sn. Assuming multiplexer based XOR gate implementation, this critical path will consist of N+1 pass transistor delays. However, such a long chain of transistors will significantly degrade the signal, thus some amplification points are necessary. In practice, we can use a multiplexer cell to build this critical path using standard cell library as shown in this slide. Oklobdzija, ISCAS’88 Prof. V.G. Oklobdzija VLSI Arithmetic

Pass-Transistor Realization in DPL
The ability of pass-transistor logic to provide an efficient multiplexer implementation has been exploited in CPL and DPL logic families. Even an XOR gate is more efficiently implemented using multiplexer topology. A Full-Adder cell which is entirely multiplexer based was published by Hitachi and it is shown in this slide. Such a Full-Adder realization contains only two transistors in the Input-to-Sum path and only one transistor in the Cin-to-Cout path (not counting the buffer). The short critical path is a factor that contributes to a remarkable speed of this implementation. Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-Skip Adder MacSorley, Proc IRE 1/61
Lehman, Burla, IRE Trans on Comp, 12/61 Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-Skip Adder Bypass From Rabaey Prof. V.G. Oklobdzija
VLSI Arithmetic

Carry-Skip Adder: N-bits, k-bits/group, r=N/k groups
Since the Cin-to-Cout represents the longest path in the ripple-carry-adder an obvious attempt is to accelerate carry propagation through the adder. This is accomplished by using Carry-Propagate pi signals within a group of bits. If all the pi signals within the group are set to pi = 1, the condition exist for the carry to bypass the entire group: Carry Skip Adder divides the words to be added into groups of equal size of k-bits. The basic structure of an N-bit Carry Skip Adder is shown here. Within the group, carry propagates in a ripple-carry fashion. In addition, an AND gate is used to form the group propagate signal. If group propagate signal is “true” the condition exists for carry to bypass, the group as shown in this slide. The maximal delay of a Carry Skip Adder is encountered when carry signal is generated in the least-significant bit position, rippling through k-1 bit positions, skipping over N/k-2 groups in the middle, rippling through the k-1 bits of most significant group and being assimilated in the Nth bit position to produce the sum SN: Thus, Carry Skip Adder is faster than Ripple Carry Adder at the expense of a few relatively simple modifications. The delay of the Carry Skip Adder is still linearly dependent on the size of the adder N, however this linear dependence is reduced by a factor of 1/k. Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-Skip Adder k Prof. V.G. Oklobdzija VLSI Arithmetic

Variable Block Adder (Oklobdzija, Barnes: IBM 1985)

Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
The idea behind Variable Block Adder is to minimize the longest critical path in the carry chain of Carry Skip Adder, while allowing the groups to take different sizes. Such optimization in general does not result in an enhanced complexity as compared to the Carry Skip Adder. A carry-chain of a 32-bit Variable Block Adder is shown. The first and the last blocks are smaller, and the intermediate blocks are larger. That compensates for the critical paths originating from the ends by shortening the length of the path used for the carry signal to ripple in the end groups, allowing carry to skip over larger groups in the middle. There are two important consequences of this optimization: First, the total delay is reduced as compared to Carry Skip Adder Second, the delay dependency is not a linear function of the adder size N as in Carry Skip Adder. This dependency follows a square root function of N instead. It is also possible to extend this approach to multiple levels of carry skips which represents a linear programming problem, that does not yield a closed form solution. The speed of such a multiple-level adder surpasses that of fixed group Carry-Lookahead Adder. It also exhibits the lower area and power consumption while retaining its speed. Variable Block Adder has the lowest energy-delay product as compared to the other adders in its class. Prof. V.G. Oklobdzija VLSI Arithmetic

Any-point-to-any-point delay = 9 D as compared to 12 D for CSKA
Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) 6 5 5 4 4 1 2 3 D=9 3 2 1 Any-point-to-any-point delay = 9 D as compared to 12 D for CSKA The idea behind Variable Block Adder is to minimize the longest critical path in the carry chain of Carry Skip Adder, while allowing the groups to take different sizes. Such optimization in general does not result in an enhanced complexity as compared to the Carry Skip Adder. A carry-chain of a 32-bit Variable Block Adder is shown. The first and the last blocks are smaller, and the intermediate blocks are larger. That compensates for the critical paths originating from the ends by shortening the length of the path used for the carry signal to ripple in the end groups, allowing carry to skip over larger groups in the middle. There are two important consequences of this optimization: First, the total delay is reduced as compared to Carry Skip Adder Second, the delay dependency is not a linear function of the adder size N as in Carry Skip Adder. This dependency follows a square root function of N instead. It is also possible to extend this approach to multiple levels of carry skips which represents a linear programming problem, that does not yield a closed form solution. The speed of such a multiple-level adder surpasses that of fixed group Carry-Lookahead Adder. It also exhibits the lower area and power consumption while retaining its speed. Variable Block Adder has the lowest energy-delay product as compared to the other adders in its class. Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-chain block size determination for a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Prof. V.G. Oklobdzija VLSI Arithmetic

Delay Calculation for Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Delay model: Prof. V.G. Oklobdzija VLSI Arithmetic

Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Variable Group Length Oklobdzija, Barnes, Arith’85 Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Variable Block Lengths No closed form solution for delay It is a dynamic programming problem Prof. V.G. Oklobdzija VLSI Arithmetic

Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985)

Delay Comparison: Variable Block Adder
VBA CLA VBA- Multi-Level Prof. V.G. Oklobdzija VLSI Arithmetic

Fan-Out Dependency Prof. V.G. Oklobdzija VLSI Arithmetic

Fan-In Dependency Prof. V.G. Oklobdzija VLSI Arithmetic

Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985)

Carry-Lookahead Adder (Weinberger and Smith)
Weinberger and J. L. Smith, “A Logic for High-Speed Addition”, National Bureau of Standards, Circ. 591, p.3-12, 1958. Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-Lookahead Adder
One gate delay D to calculate p, g One D to calculate P and two for G Three gate delays To calculate C4(j+1) Compare that to 8 D in RCA ! Prof. V.G. Oklobdzija VLSI Arithmetic

Additional two gate delays C16 will take a total of 5D vs. 32D for RCA ! Prof. V.G. Oklobdzija VLSI Arithmetic

32-bit Carry Lookahead Adder
A significant speed improvement in the implementation of a parallel adder was introduced by a Carry-Lookahead-Adder developed by Weinberger and Smith in It is theoretically one of the fastest schemes, since the delay to add two numbers depends on the logarithm of the size of the operands. The Carry Loookahead Adder uses modified full adders for each bit position and Lookahead modules which are used to generate carry signals independently for a group of k-bits. In most common case the group size is 4-bits. In addition to carry signal for the group, Lookahead modules produce group carry generate G and group carry propagate P outputs that indicate that a carry is generated within the group, or that an incoming carry would propagate across the group. The carry out from a 4-bit wide group ci+4 can be computed in four gate delays: one gate delay to compute pi and gi for i = i through i+3, a second gate delay to evaluate Pj, the second and the third to evaluate Gj, and the third and fourth to calculate carry signals ci+1, ci+2 , ci+3 and ci+4. Actually, if not limited by fan-in constraints, ci+4 could be calculated concurrently with Gj and will be available after three gate delays. In a recursive fashion, we can create a "group of groups" or a "super-group". The inputs to the "super-group" are G and P signals from the previous level. The "super-group" produces P* and G* signals indicating that the carry signal will be propagated across, or generated in the groups within the "super-group" domain. A "super-group" produces a carry signal out of the "super-group" as well as an input carry signal for each of the groups in the level above. Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-Lookahead Adder (Weinberger and Smith: original derivation )

Carry-Lookahead Adder (Weinberger and Smith) please notice the similarity with Parallel-Prefix Adders ! Prof. V.G. Oklobdzija VLSI Arithmetic

Journal of VLSI Signal Processing, Vol.3, No.4, October 1991
Delay Optimized CLA B. Lee, V. G. Oklobdzija Journal of VLSI Signal Processing, Vol.3, No.4, October 1991 Prof. V.G. Oklobdzija VLSI Arithmetic

Delay Optimized CLA: Lee-Oklobdzija ‘91
(a.) Fixed groups and levels (b.) variable-sized groups, fixed levels (c.) variable-sized groups and fixed levels (d.) variable-sized groups and levels Prof. V.G. Oklobdzija VLSI Arithmetic

Two-Levels of Logic Implementation of the Carry Block

Two-Levels of Logic Implementation of the Carry-Lookahead Block

Three-Levels of Logic Implementation of the Carry Block (restricted fan-in)

Three-Levels of Logic Implementation of the Carry Lookahead (restricted fan-in)

Delay: Two-level BCLA Delay: Three-level BCLA Prof. V.G. Oklobdzija VLSI Arithmetic

(a.) 2-level BCLA D=8.5nS (b.) 3-level BCLA D=8.9nS Prof. V.G. Oklobdzija VLSI Arithmetic

Motorola: CLA Implementation Example
A. Naini, D. Bearden and W. Anderson, “A 4.5nS 96b CMOS Adder Design”, Proceedings of the IEEE Custom Integrated Circuits Conference, May 3-6, 1992. Prof. V.G. Oklobdzija VLSI Arithmetic

Critical path in Motorola's 64-bit CLA
As opposed to Ripple or Carry-Skip Adders the critical path in the Carry-Lookahead-Adder travels in vertical direction rather than a horizontal one as shown in the previous slide. Therefore the delay of Carry-Lookahead-Adder is not directly proportional to the size of the adder N, but to the number of levels used. Given that the groups and super-groups in the Carry-Lookahead-Adder resemble a tree structure the delay of a Carry-Lookahead-Adder is thus proportional to the log function of the size N. This log dependency makes Carry-Lookahead-Adder one of the theoretically fastest structures for addition. However, it can be argued that the speed efficiency of the Carry-Lookahead-Adder has passed the point of diminishing returns given the fan-in and fan-out dependencies of the logic gates and inadequacy of the delay model based on counting number of gates in the critical path. In reality, Carry-Lookahead-Adder is indeed achieving lesser speed than expected, especially when compared to some techniques that consume less hardware for the implementation. An example of a Carry Lookahead Adder, and a critical path as implemented in Motorola processor is shown in this slide. Prof. V.G. Oklobdzija VLSI Arithmetic

Motorola's 64-bit CLA conventional PG Block

Motorola's 64-bit CLA Modified PG Block
Intermediate propagate signals Pi:0 are generated to speed-up C3 Prof. V.G. Oklobdzija VLSI Arithmetic

Ling’s Adder Huey Ling, “High-Speed Binary Adder”
IBM Journal of Research and Development, Vol.5, No.3, 1981. Prof. V.G. Oklobdzija VLSI Arithmetic

Ling Adder Ling’s equations: Variation of CLA:
Ling, IBM J. Res. Dev, 5/81 Prof. V.G. Oklobdzija VLSI Arithmetic

Ling Adder Ling’s equation Propagates information on two bits
Doran, Trans on Comp 9/88 Prof. V.G. Oklobdzija VLSI Arithmetic

Ling Adder Conventional: Ling: Prof. V.G. Oklobdzija VLSI Arithmetic

S. Naffziger, ISSCC’96 Prof. V.G. Oklobdzija VLSI Arithmetic

Results: S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ 96
0.5u Technology Speed: nS Nominal process, 80C, V=3.3V Prof. V.G. Oklobdzija VLSI Arithmetic

ConditionalSum Adder J. Sklansky, “Conditional-Sum Addition Logic”, IRE Transactions on Electronic Computers, EC-9, p , 1960. Prof. V.G. Oklobdzija VLSI Arithmetic

ConditionalSum Adder Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-Select Adder O. J. Bedrij, “Carry-Select Adder”, IRE Transactions on Electronic Computers, June 1962, p Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-Select Adder Addition under assumption of Cin=0 and Cin =1.
The theoretically fastest scheme for addition of two numbers is "Conditional-Sum Addition" proposed by Sklansky in The essence of this scheme is in the realization that we can add two numbers without waiting for the carry signal to arrive. Simply, the numbers are added in two instances: one assuming Cin = 0 and the other assuming Cin = 1. The conditionally produced results: Sum0, Sum1 and Carry0, Carry1 are selected by a multiplexer using an incoming carry signal Cin as a multiplexer control. Similarly to the Carry-Lookahead Adder the input bits are divided into groups which are in this case added "conditionally". It is apparent that while building Conditional-Sum Adder the hardware complexity starts to grow rapidly starting from the Least Significant Bit position. Therefore, in practice, the full-blown implementation of the CNSA is not found. However, the idea of adding the Most Significant portion of the operands conditionally and selecting the results once the carry-in signal is computed in the Least Significant portion, is attractive. Such a scheme, which is a subset of Conditional-Sum Adder, is known as "Carry-Select Adder". Carry Select Adder divides the words to be added into blocks and forms two sums for each block in parallel: -one with a carry in of ZERO and the other with a carry in of ONE. In this slide an example of a 16 bit carry select adder in shown: The carry-out from the Least Significant 4-bit block controls a multiplexer that selects the sum from the Most Significant portion. The carry out is computed using the equation for the carry out of the group, since the group propagate signal Pi is the carry out of an adder with a carry input of ONE and the group generate Gi signal is the carry out of an adder with a carry input of ZERO. This speeds-up the computation of the carry signal which is necessary for selection in the next block. The upper 8-bits are computed conditionally using two Carry-Select Adders similar to the one used in the Least Significant 8-bit portion. The delay of this adder is determined by the speed of the Least Significant k-bit block (4-bit RCA in this example) and delay of multiplexers in the Most Significant path. Generally the delay of such adder is proportional to the log function of the size of the adder. Prof. V.G. Oklobdzija VLSI Arithmetic

Carry Select Adder: combining two 32-b VBAs in select mode
Delay =DVBA32+ DMUX Prof. V.G. Oklobdzija VLSI Arithmetic

Addition Under Non-equal Signal Arrival Profile Assumption
P. Stelling , V. G. Oklobdzija, "Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier", special issue on VLSI Arithmetic, Journal of VLSI Signal Processing, Kluwer Academic Publishers, Vol.14, No.3, December 1996 Prof. V.G. Oklobdzija VLSI Arithmetic

Signal Arrival Profile form the Parallel Multiplier Partial-Product Recuction Tree

Oklobdzija, Villeger, IEEE Transactions on VLSI Systems, June, 1995

Oklobdzija and Villeger, IEEE Transactions on VLSI Systems, June, 1995

Performing Multiply-Add Operation in the Multiply Time
P. Stelling, V. G. Oklobdzija, " Achieving Multiply-Accumulate Operation in the Multiply Time", Thirteenth International Symposium on Computer Arithmetic, Pacific Grove, California, July 5 - 9, 1997. Prof. V.G. Oklobdzija VLSI Arithmetic

Final Adder: Implementation

Recurrence Solver Based Adders
Koggie and Stone, IEEE Trans on Computers, August 1973 Bilgory and Gajski, 18th DAC, 1981 Brent and Kung, IEEE Trans on Computers, March 1982 Prof. V.G. Oklobdzija VLSI Arithmetic

1973, Koggie and Stone published a general recurrence scheme for parallel computation 1979, Brent and Kung published Tech. Report on regular layout for parallel adders 1980, Guibas and Vuillemin, developed a layout scheme based on recurrence equation for addition 1980, Ladner and Fisher published “parallel prefix computation”, Jo of ACM 1981, Bilgory and Gajski published a paper on recurrence structures for automatic cell generation Prof. V.G. Oklobdzija VLSI Arithmetic

They are based on recurrence equation for P,G (what is new there since Weinberger ?!!): Or: and Prof. V.G. Oklobdzija VLSI Arithmetic

The class of adders based on solving recurrence equations was first introduced by Bilgory and Gajski and Brent and Kung drawing on the previous work by Koggie and Stone. They realized that if Cin=0 can be assumed, the carry-lookahead equations can be written in a simple form of a recurrence. They define recurrence relation as an operator termed "black" operator. By application of this recurrence equation various topologies of an adder can be obtained with several desirable properties such as: - good layout - fan-out that can be controlled and limited to no more than 2 - trade-offs between fan-in, fan-out tied into hierarchical layout topologies. For those reasons "recurrence-solver" schemes are relatively popular schemes resulting in fast and efficient implementation. In essence, "recurrence solver" based adders are nothing else but a variation of many possible different CLA topologies. An example of a "recurrence solver" adder is shown in this slide. Prof. V.G. Oklobdzija VLSI Arithmetic

Carry-Lookahead Adder (Weinberger and Smith) Just to remind you
Carry-Lookahead Adder (Weinberger and Smith) Just to remind you ! please notice the similarity with Parallel-Prefix Adders ! Prof. V.G. Oklobdzija VLSI Arithmetic

Multiplexer Based Adder
Farooqui and Oklobdzija 1999 Int’l Sym. on VLSI Technology, Taipei, Taiwan, June 8-10, 1999 Prof. V.G. Oklobdzija VLSI Arithmetic

Multiplexer Based Adder
Based on the realization that MUX circuit is faster than a logic gate due to its transmission gate implementation Based on Carry-Lookahead method (W-S), or recurrence solver. Prof. V.G. Oklobdzija VLSI Arithmetic

Multiplexer Based Adder A. A. Farooqui, V. G. Oklobdzija , F
Multiplexer Based Adder A. A. Farooqui, V. G. Oklobdzija , F. Chechrazi, 1999 Int’l Sym. on VLSI Technology, Taipei, Taiwan, June 8-10, 1999. Prof. V.G. Oklobdzija VLSI Arithmetic

Multiplexer Based Adder A. A. Farooqui, V. G. Oklobdzija , F
Multiplexer Based Adder A. A. Farooqui, V. G. Oklobdzija , F. Chechrazi, 1999 Int’l Sym. on VLSI Technology, Taipei, Taiwan, June 8-10, 1999. Results in a very fast structure 7-MUX delays for a 64-b adder Delay using standard cell 0.25u, 2.5V, 25oC : Adder Size (bits) Delay (pS) 8 625 16 665 32 710 64 903 Prof. V.G. Oklobdzija VLSI Arithmetic

DEC "Alpha" 21064 Adder Combination:
8-bit tapered pre-discharged Manchester Carry Chains, with Cin = 0 and Cin = 1 32-bit LSB Carry Lookahead Adder 32-bit MSB Conditional-Sum Adder Carry-Select on most significant 32-bits Latches in the middle: pipelined addition Prof. V.G. Oklobdzija VLSI Arithmetic

DEC "Alpha" 21064 Adder Prof. V.G. Oklobdzija VLSI Arithmetic
The block diagram of a 64-bit adder used in the first 200MHz Digital's RISC microprocessor is shown in this slide. This adder employs a combination of four different techniques used on the various levels of this 64-bit adder: First: on the 8-bit level Manchester Carry Chain technique was used. Manchester Carry Chain seems to be the most effective for the short adders, especially when the word length is below 16-bits. The carry chain was further optimized by tapering down each chain stage in order to reduce the load caused by the remainder of the chain. The chain was pre-discharged at the beginning of the operation and three signals were used: Propagate P, Generate G and Carry-Kill (or carry-assimilate) K. The local carry signals were amplified using ratioed inveters. There were two Manchester Carry Chain employed: one that assumes Cin = 0 and other that assumes Cin = 1. Second, Carry-Lookahead Addition was used on the least significant 32-bits of the adder. The Carry-Lookahead section was implemented as a distributed differential circuit producing the carry signal that controls the most-significant 32-bit portion of the adder. Third, Conditional Sum Addition was used for the most-significant 32-bits of the adder. There were six 8-bit select switches used to implement conditional summation on the 8-bit level. Finally, Carry Select method was used in order to produce the most-significant 32-bits of the 64-bit word. The selection of the final result was done using nMOS byte carry-select switches. Prof. V.G. Oklobdzija VLSI Arithmetic

DEC "Alpha" 21064 Adder: Results
The first 200MHz processor Built using 0.75u technology V=3.3V, 30W Pipelined (two-latches) allowing 5nS throughput and 10nS latency Prof. V.G. Oklobdzija VLSI Arithmetic

VLSI Implementation of Addition
Conclusion VLSI Implementation of Addition Prof. V.G. Oklobdzija VLSI Arithmetic

Conclusion: VLSI Implementation of Addition
Currently, implementation parameters are not reflected in algorithms used for development Layout and wire delays effects are largely neglected and this is becoming intolerable in the next generation of technology Transistor sizing has a large effect which can outweight the algorithm There is a great disconnect between algorithm and implementation New rules and measures of goodness are needed Prof. V.G. Oklobdzija VLSI Arithmetic

Parallel Multiplier Implementation
Multiplication Parallel Multiplier Implementation Prof. V.G. Oklobdzija VLSI Arithmetic

Multiplication Algorithm: initially for j=0,....,n-1
Traditionally multiplication operation is performed in a variety of forms, in hardware and software, depending on the cost and transistor budget allocated for this particular operation. Today it is more likely to find a full hardware implementation of the multiplication because of growing demand for speed and decreasing cost of hardware. We show a basic multiplication algorithm which operates on positive n-bit long integers X and Y resulting in the product P which is 2n - bits long. This expression indicates that the multiplication process is performed by summing n terms of a partial product Pi. This product Pi is obtained by simple arithmetic left shift of X for the i positions and multiplication by the single digit yi. For the binary radix (r=2), yi can only be 0 or 1 and multiplication by the digit yi is a simple AND operation. The addition of n terms can be performed at once, by passing the partial products through a network of adders or sequentially, by adding partial products using an adder n times. The algorithm to perform the multiplication of X and Y is shown in the slide. It can be proved without difficulties that after n steps this recurrence results in a product p(n)=XY. p(n)=XY after n steps Prof. V.G. Oklobdzija VLSI Arithmetic

Parallel Multipliers Parallel Multipliers Prof. V.G. Oklobdzija
An alternative approach to sequential multiplication involves simultaneous generation of all bit products and their summation with an array of full adders. This approach uses an n by n array of AND gates to form the bit products, an array of n x n adders (and half adders) to sum the bit products in a carry-save fashion. Finally a 2n Carry-Propagate Adder is used in the final step to finish the summation and produce the result. Wallace introduced a way of summing the partial product bits in parallel using a tree of Carry Save Adders which became generally known as the “Wallace Tree” . A suggestion for improved efficiency of addition of the partial products was published by Dadda in 1965. Dadda introduces a notion of a counter which will take a number of bits p in the same bit position and output a number q which represent the count of ones at the input. This process is shown in the slide illustrating 8 by 8 multiplication process. An input of 8 by 8 matrix of dots, each dot represents a bit product, is shown as a Matrix 0. Columns having more than six dots are reduced by the use of half adders. Each half adder takes in two dots and outputs one in the same column and one in the next more significant column. Each full adder takes in three dots and outputs one in the same column and one in the next more significant column. No column in Matrix 1 will have more than six dots. Half adders are shown by a “crossed” line in the succeeding matrix and full adders are shown by a line in the succeeding matrix. In each case the right most dot of the pair that are connected by a line is in the column from which the inputs were taken for the adder. In the succeeding steps reduction is performed to Matrix 2 with no more than four dots per column, Matrix 3 with no more than three dots per column, and finally Matrix 4 with no more than two dots per column is. The height of the matrices is determined by working back from the final matrix and limiting the height of each matrix to the largest integer that is no more than 1.5 times the height of its successor. Each matrix is produced from its predecessor in one adder delay. Since the number of matrices is logarithmically related to the number of bits in the words to be multiplied, the delay of the matrix reduction process is proportional to log(n). The adder that reduces the final two row matrix to the final product can be implemented as a fast adder, which also has logarithmic delay. The total delay for this multiplier is proportional to the logarithm of the size of its operands. The effort of improving the speed of the parallel multiplier continued for almost 30 years. Prof. V.G. Oklobdzija VLSI Arithmetic

4:2 Compressor Prof. V.G. Oklobdzija VLSI Arithmetic
In 1981 Weinberger disclosed a structure which he called "4-2 carry-save module". This structure contained a combination of Full Adder cells in an intricate interconnection structure which yields a faster partial product compression than the use of 3:2 counters. The structure actually compresses five partial product bits into three, however it is connected in such a way that four of the inputs are coming from the same bit position of the weight j while one bit is fed from the neighboring position j-1 also known as carry-in. The output of such a 4:2 module consists of one bit in the position j and two bits in the position j+1. This structure does not represent a counter, though it became erroneously known as "4:2 counter“, but a "compressor" that would compress four partial product bits into two. The structure of 4:2 compressor is shown in this slide. The efficiency of such a structure to reduce partial product bits is higher. It reduces the number of partial product bits by one half at each stage. The speed of such a 4:2 compressor has been determined by the speed of 3 XOR gates in series, in the redesigned version of 4:2 compressor, making such a scheme more efficient that the one using 3:2 counters in a regular "Wallace Tree". The other equally important feature of the use of 4:2 compressor is that the interconnections between 4:2 cells follow more regular pattern than it is the case of the "Wallace Tree". Prof. V.G. Oklobdzija VLSI Arithmetic

Re-designed 4:2 Compressor with 3 XOR Delay
in I1 I2 I3 I4 1 S out This slide shows a re-design 4:2 compressor as introduced by Toshiba. The advantage of this compressor is that it results in 3 XOR gate maximal delay as opposed to 4 XOR delay in a regular implementation using Full-Adder cells. In the next slide we will show how this re-design was not necessary. The missing point was in the lack of understanding of how to balance delays of individual full-adders. Prof. V.G. Oklobdzija VLSI Arithmetic

Three-Dimensional optimization Method: TDM (Oklobdzija, Villeger, Liu, 1996)
The further improvement in speed of a parallel multiplier was achieved by introduction of TDM method in 1996. TDM optimizes the entire Partial Product Reduction Tree in one pass, thus the name Three Dimensional optimizaiton Method. The important aspect of this method is in sorting of fast inputs and fast outputs. It was realized that the most important step in achieving fast partial product reduction is to properly interconnect the elements. Thus, appropriate counters, 3:2 adders in this particular case, were characterized in a way which identifies delay of each input to each output. Interconnecting of the Partial Product Reduction Tree was done in a way in which signals with large delays are connected to "fast inputs" and signals with small delay to "slow inputs" . This slide illustrates how an 4:1 compressor with 3 XOR delay can be obtained by a simple application of TDM method without the need for redesign. Prof. V.G. Oklobdzija VLSI Arithmetic

Generation of the Partial Product Reduction Tree in TDM multiplier
In order to optimize the speed of Partial Product Reduction Tree the sorting starts from the least-significant bit cross-section and and continues toward the most significant part. Algorithm produces 2N-1 lists each one consisting of the signal names and delay associated to each signal. The lists are sorted so that the nodes with the longest delay are on the top. The partial product reduction process starts by taking the top k nodes and connecting them to the k-bit counter in such a way that the “fastest” input is connected to the signal with the longest delay. The resulting signals are entered to the current and to the next list in order of significance. The lists are sorted again and the process continues until there are only two nodes left in the list. The process progresses from the least significant bit position toward the most significant. When the end is reached we are left With 2N-1 lists containing only two signals. This represents an input to a 2N-1 bit carry propagate adder. This adder is further optimized to adopt to the uneven signal arrival profile originating from the Partial Product Reduction Tree. It is also possible to feed an additional operand into the Partial Product Reduction Tree without incurring additional delay. This process results in a Multiply-Add or operation. Thus TDM makes possible Multiply-Accumulate, also known as MAC operation, in multiply time. Prof. V.G. Oklobdzija VLSI Arithmetic

Speed of Partial Product Reduction for Various Schemes
This slide shows the speed of Partial Product Reduction for various known schemes. The delay in terms of equivalent XOR delays is shown as a function of the multiplier size of N input bits. TDM method produces the fastest Partial Product Reduction. Prof. V.G. Oklobdzija VLSI Arithmetic

Booth Recoding Algorithm
xi+2xi+1xi Add to partial product 000 +0Y 001 +1Y 010 011 +2Y 100 -2Y 101 -1Y 110 111 -0Y One of the best known variations of the multiplication algorithm is “Booth Recoding Algorithm” described by Booth in A modification of the Booth algorithm was proposed by Mac Sorley in which a triplet of bits is scanned instead of two bits. This technique has the advantage of reducing the number of partial products by roughly one half at the start. The process of further reduction of the partial product bits continues in the way described in previous slides. Booth method is actually an application of a sign-digit representation in radix 4. The Booth-MacSorley Algorithm, usually called the Modified Booth Algorithm or simply the Booth Algorithm, can be generalized to any radix. However, a 3-bit recoding (case of radix 8) would require the following set of digits to be multiplied by the multiplicand : 0, ±1, ±2, ±3, ±4. The difficulty lies in the fact that ±3Y is computed by summing (or subtracting) Y to ±2Y, which means that a carry propagation occurs. The delay caused by the carry propagation renders this scheme to be slower than a conventional one. Consequently, only the 2 bit (radix 4) Booth recoding is used. Booth recoding is performed within two steps: encoding and selection. The purpose of the encoding is to scan the triplet of bits of the multiplier and define the operation to be performed on the multiplicand, as shown in this slide. Prof. V.G. Oklobdzija VLSI Arithmetic

Organization of Hitachi's DPL multiplier
And example of a fast multiplier is Hitachi's DPL multiplier which was the first one to achieve under 5nS speed for a 54-bit floating-point mantissa. This multiplier is of a regular structure including: (a.) A Booth Recoder (b.) A Partial Product Reduction Tree and (c.) A final Carry Propagate Adder (CPA) as shown in this slide. Prof. V.G. Oklobdzija VLSI Arithmetic

Hitachi's 4:2 compressor structure
The key to performance of Hitachi's multiplier lays in the use of DPL circuits and the efficiency with which DPL can realize 4:2 compressor. The structure of Hitachi's 4:2 compressor is shown in this slide. The realization of the 4:2 function consists entirely of DPL multiplexers which introduce only one pass-transistor delay in the critical path as shown in the next slide. Indeed this structure is one of the fastest transistor realizations for the Partial Product Reduction Tree. The speed of this multiplier can be further optimized by applying TDM algorithm realizing optimal interconnections. Such a structure yields 4.1nS delay in 0.25u technology. For larger size multipliers this structure may start showing degraded performance because of the long pass-transistor chain in 4:2 compressors used. However, this is of not much concern since 54-bit represents double precession floating point format and we rarely use larger multiplier sizes in practice. Prof. V.G. Oklobdzija VLSI Arithmetic

DPL multiplexer circuit
An efficient pass transistor realization of multiplexers using Hitachi’s DPL logic is shown in this slide. This multiplexer introduces only one pass-transistor delay in the critical path, thus allowing three pass transistor delays for one 4:2 compressor. The entire partial product reduction tree results in 12 pass-transistor delays in the critical path. The use of TDM reduces this number to 10. Prof. V.G. Oklobdzija VLSI Arithmetic

Conclusion References:
E. Swartzlander, "Computer Arithmetic". Vol. 1&2, IEEE Computer Society Press, 1990. K. Hwang, "Computer Arithmetic : Principles, Architecture and Design", John Wiley and Sons, 1979. M. Ercegovac, “Digital Systems and Hardware/Firmware Algorithms”, Chapter 12: Arithmetic Algorithms and Processors, John Wiley & Sons, 1985. A. Chandrakasan, W. Bowhill, F Fox, Editors, "Design of High Performance Microprocessors Circuits", IEEE Press, July 2000. V. G. Oklobdzija, “High-Performance System Design: Circuits and Logic”, IEEE Press, July 1999. Also: Given the time limitation of this presentation we can only briefly touch on some of the issues that are subject of ongoing research in computer arithmetic. In order to go into more depth we strongly encourage you to consult many of the published papers. You can start with the following references and move in-depth further. Prof. V.G. Oklobdzija VLSI Arithmetic

VLSI Arithmetic Adders & Multipliers

Similar presentations

Presentation on theme: "VLSI Arithmetic Adders & Multipliers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

VLSI Arithmetic Adders & Multipliers

Similar presentations

Presentation on theme: "VLSI Arithmetic Adders & Multipliers"— Presentation transcript:

Similar presentations

About project

Feedback