VLSI Arithmetic Adders & Multipliers Prof. Vojin G. Oklobdzija University of California
Prof. V.G. OklobdzijaVLSI Arithmetic2 Introduction Digital Computer Arithmetic belongs to Computer Architecture, however, it is also an aspect of logic design. The objective of Computer Arithmetic is to develop appropriate algorithms that are utilizing available hardware in the most efficient way. Ultimately, speed, power and chip area are the most often used measures, making a strong link between the algorithms and technology of implementation.
Prof. V.G. OklobdzijaVLSI Arithmetic3 Basic Operations Addition Multiplication Multiply-Add Division Evaluation of Functions Multi-Media
Addition of Binary Numbers
Prof. V.G. OklobdzijaVLSI Arithmetic5 Addition of Binary Numbers Full Adder. The full adder is the fundamental building block of most arithmetic circuits: The sum and carry outputs are described as: Full Adder C in C out sisi aiai bibi
Prof. V.G. OklobdzijaVLSI Arithmetic6 Addition of Binary Numbers Propagate Generate InputsOutputs cici aiai bibi sisi c i
Prof. V.G. OklobdzijaVLSI Arithmetic7 Full-Adder Implementation Full Adder operations is defined by equations: One-bit adder could be implemented as shown Carry-Propagate: and Carry-Generate g i
Prof. V.G. OklobdzijaVLSI Arithmetic8 High-Speed Addition One-bit adder could be implemented more efficiently because MUX is faster
Prof. V.G. OklobdzijaVLSI Arithmetic9 The Ripple-Carry Adder
Prof. V.G. OklobdzijaVLSI Arithmetic10 The Ripple-Carry Adder From Rabaey
Prof. V.G. OklobdzijaVLSI Arithmetic11 Inversion Property From Rabaey
Prof. V.G. OklobdzijaVLSI Arithmetic12 Minimize Critical Path by Reducing Inverting Stages From Rabaey
Prof. V.G. OklobdzijaVLSI Arithmetic13 Ripple Carry Adder Carry-Chain of an RCA implemented using multiplexer from the standard cell library: Critical Path Oklobdzija, ISCAS’88
Prof. V.G. OklobdzijaVLSI Arithmetic14 Manchester Carry-Chain Realization of the Carry Path Simple and very popular scheme for implementation of carry signal path
Prof. V.G. OklobdzijaVLSI Arithmetic15 Original Design T. Kilburn, D. B. G. Edwards, D. Aspinall, "Parallel Addition in Digital Computers: A New Fast "Carry" Circuit", Proceedings of IEE, Vol. 106, pt. B, p. 464, September 1959.
Prof. V.G. OklobdzijaVLSI Arithmetic16 Manchester Carry Chain (CMOS) Kilburn, et al, IEE Proc, Implement P with pass-transistors Implement G with pull-up, kill (delete) with pull-down Use dynamic logic to reduce the complexity and speed up
Prof. V.G. OklobdzijaVLSI Arithmetic17 Pass-Transistor Realization in DPL
Prof. V.G. OklobdzijaVLSI Arithmetic18 Carry-Skip Adder MacSorley, Proc IRE 1/61 Lehman, Burla, IRE Trans on Comp, 12/61
Prof. V.G. OklobdzijaVLSI Arithmetic19 Carry-Skip Adder Bypass From Rabaey
Prof. V.G. OklobdzijaVLSI Arithmetic20 Carry-Skip Adder: N-bits, k-bits/group, r=N/k groups
Prof. V.G. OklobdzijaVLSI Arithmetic21 Carry-Skip Adder k
Prof. V.G. OklobdzijaVLSI Arithmetic22 Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Prof. V.G. OklobdzijaVLSI Arithmetic23 Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Prof. V.G. OklobdzijaVLSI Arithmetic24 Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) =9 Any-point-to-any-point delay = 9 as compared to 12 for CSKA
Prof. V.G. OklobdzijaVLSI Arithmetic25 Carry-chain block size determination for a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Prof. V.G. OklobdzijaVLSI Arithmetic26 Delay Calculation for Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Delay model:
Prof. V.G. OklobdzijaVLSI Arithmetic27 Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Variable Group Length Oklobdzija, Barnes, Arith’85
Prof. V.G. OklobdzijaVLSI Arithmetic28 Carry-chain of a 32-bit Variable Block Adder (Oklobdzija, Barnes: IBM 1985) Variable Block Lengths No closed form solution for delay It is a dynamic programming problem
Prof. V.G. OklobdzijaVLSI Arithmetic29 Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Prof. V.G. OklobdzijaVLSI Arithmetic30 Delay Comparison: Variable Block Adder VBA- Multi-Level CLA VBA
Prof. V.G. OklobdzijaVLSI Arithmetic31 Fan-Out Dependency
Prof. V.G. OklobdzijaVLSI Arithmetic32 Fan-In Dependency
Prof. V.G. OklobdzijaVLSI Arithmetic33 Delay Comparison: Variable Block Adder (Oklobdzija, Barnes: IBM 1985)
Prof. V.G. OklobdzijaVLSI Arithmetic34
Prof. V.G. OklobdzijaVLSI Arithmetic35 Carry-Lookahead Adder (Weinberger and Smith) A. Weinberger and J. L. Smith, “A Logic for High-Speed Addition”, National Bureau of Standards, Circ. 591, p.3-12, 1958.
Prof. V.G. OklobdzijaVLSI Arithmetic36 Carry-Lookahead Adder (Weinberger and Smith)
Prof. V.G. OklobdzijaVLSI Arithmetic37 Carry-Lookahead Adder One gate delay to calculate p, g One to calculate P and two for G Three gate delays To calculate C 4(j+1) Compare that to 8 in RCA !
Prof. V.G. OklobdzijaVLSI Arithmetic38 Carry-Lookahead Adder (Weinberger and Smith) Additional two gate delays C 16 will take a total of 5 vs. 32 for RCA !
Prof. V.G. OklobdzijaVLSI Arithmetic39 32-bit Carry Lookahead Adder
Prof. V.G. OklobdzijaVLSI Arithmetic40 Carry-Lookahead Adder (Weinberger and Smith: original derivation )
Prof. V.G. OklobdzijaVLSI Arithmetic41 Carry-Lookahead Adder (Weinberger and Smith: original derivation )
Prof. V.G. OklobdzijaVLSI Arithmetic42 Carry-Lookahead Adder (Weinberger and Smith) please notice the similarity with Parallel-Prefix Adders !
Prof. V.G. OklobdzijaVLSI Arithmetic43 Carry-Lookahead Adder (Weinberger and Smith) please notice the similarity with Parallel-Prefix Adders !
Delay Optimized CLA B. Lee, V. G. Oklobdzija Journal of VLSI Signal Processing, Vol.3, No.4, October 1991
Prof. V.G. OklobdzijaVLSI Arithmetic45 Delay Optimized CLA: Lee-Oklobdzija ‘91 (a.) Fixed groups and levels (b.) variable-sized groups, fixed levels (c.) variable-sized groups and fixed levels (d.) variable-sized groups and levels
Prof. V.G. OklobdzijaVLSI Arithmetic46 Two-Levels of Logic Implementation of the Carry Block
Prof. V.G. OklobdzijaVLSI Arithmetic47 Two-Levels of Logic Implementation of the Carry-Lookahead Block
Prof. V.G. OklobdzijaVLSI Arithmetic48 Three-Levels of Logic Implementation of the Carry Block (restricted fan-in)
Prof. V.G. OklobdzijaVLSI Arithmetic49 Three-Levels of Logic Implementation of the Carry Lookahead (restricted fan-in)
Prof. V.G. OklobdzijaVLSI Arithmetic50 Delay Optimized CLA: Lee-Oklobdzija ‘91 Delay: Two-level BCLA Delay: Three-level BCLA
Prof. V.G. OklobdzijaVLSI Arithmetic51 Delay Optimized CLA: Lee-Oklobdzija ‘91 (a.) 2-level BCLA =8.5nS (b.) 3-level BCLA =8.9nS
Motorola: CLA Implementation Example A. Naini, D. Bearden and W. Anderson, “A 4.5nS 96b CMOS Adder Design”, Proceedings of the IEEE Custom Integrated Circuits Conference, May 3-6, 1992.
Prof. V.G. OklobdzijaVLSI Arithmetic53 Critical path in Motorola's 64-bit CLA
Prof. V.G. OklobdzijaVLSI Arithmetic54 Motorola's 64-bit CLA conventional PG Block
Prof. V.G. OklobdzijaVLSI Arithmetic55 Motorola's 64-bit CLA Modified PG Block Intermediate propagate signals P i:0 are generated to speed-up C 3
Ling’s Adder Huey Ling, “High-Speed Binary Adder” IBM Journal of Research and Development, Vol.5, No.3, 1981.
Prof. V.G. OklobdzijaVLSI Arithmetic57 Ling Adder Variation of CLA: Ling, IBM J. Res. Dev, 5/81 Ling’s equations:
Prof. V.G. OklobdzijaVLSI Arithmetic58 Ling Adder Ling’s equation Doran, Trans on Comp 9/88 Propagates information on two bits
Prof. V.G. OklobdzijaVLSI Arithmetic59 Ling Adder Conventional: Ling:
Prof. V.G. OklobdzijaVLSI Arithmetic60 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic61 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic62 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic63 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic64 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic65 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic66 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic67 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic68 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic69 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic70 S. Naffziger, ISSCC’96
Prof. V.G. OklobdzijaVLSI Arithmetic71 Results: S. Naffziger, “A Subnanosecond 64-b Adder”, ISSCC ‘ u Technology Speed: nS Nominal process, 80C, V=3.3V
ConditionalSum Adder J. Sklansky, “Conditional-Sum Addition Logic”, IRE Transactions on Electronic Computers, EC-9, p , 1960.
Prof. V.G. OklobdzijaVLSI Arithmetic73 Conditional Sum Adder
Prof. V.G. OklobdzijaVLSI Arithmetic74 ConditionalSum Adder
Carry-Select Adder O. J. Bedrij, “Carry-Select Adder”, IRE Transactions on Electronic Computers, June 1962, p
Prof. V.G. OklobdzijaVLSI Arithmetic76 Carry-Select Adder O.J. Bedrij, IBM Poughkeepsie, 1962
Prof. V.G. OklobdzijaVLSI Arithmetic77 Carry-Select Adder Addition under assumption of C in =0 and C in =1.
Prof. V.G. OklobdzijaVLSI Arithmetic78 Carry Select Adder: combining two 32-b VBAs in select mode Delay = VBA32 + MUX
Addition Under Non-equal Signal Arrival Profile Assumption P. Stelling, V. G. Oklobdzija, "Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier", special issue on VLSI Arithmetic, Journal of VLSI Signal Processing, Kluwer Academic Publishers, Vol.14, No.3, December 1996
Prof. V.G. OklobdzijaVLSI Arithmetic80 Signal Arrival Profile form the Parallel Multiplier Partial-Product Recuction Tree
Prof. V.G. OklobdzijaVLSI Arithmetic81 Oklobdzija, Villeger, IEEE Transactions on VLSI Systems, June, 1995
Prof. V.G. OklobdzijaVLSI Arithmetic82 Oklobdzija and Villeger, IEEE Transactions on VLSI Systems, June, 1995
Prof. V.G. OklobdzijaVLSI Arithmetic83
Prof. V.G. OklobdzijaVLSI Arithmetic84
Prof. V.G. OklobdzijaVLSI Arithmetic85
Prof. V.G. OklobdzijaVLSI Arithmetic86
Prof. V.G. OklobdzijaVLSI Arithmetic87
Prof. V.G. OklobdzijaVLSI Arithmetic88
Prof. V.G. OklobdzijaVLSI Arithmetic89
Prof. V.G. OklobdzijaVLSI Arithmetic90
Performing Multiply-Add Operation in the Multiply Time P. Stelling, V. G. Oklobdzija, " Achieving Multiply-Accumulate Operation in the Multiply Time", Thirteenth International Symposium on Computer Arithmetic, Pacific Grove, California, July 5 - 9, 1997.
Prof. V.G. OklobdzijaVLSI Arithmetic92
Prof. V.G. OklobdzijaVLSI Arithmetic93 Final Adder: Implementation
Prof. V.G. OklobdzijaVLSI Arithmetic94 Final Adder: Implementation
Prof. V.G. OklobdzijaVLSI Arithmetic95 Final Adder: Implementation
Prof. V.G. OklobdzijaVLSI Arithmetic96 Final Adder: Implementation
Recurrence Solver Based Adders Koggie and Stone, IEEE Trans on Computers, August 1973 Bilgory and Gajski, 18 th DAC, 1981 Brent and Kung, IEEE Trans on Computers, March 1982
Prof. V.G. OklobdzijaVLSI Arithmetic98 Recurrence Solver Based Adders 1973, Koggie and Stone published a general recurrence scheme for parallel computation 1979, Brent and Kung published Tech. Report on regular layout for parallel adders 1980, Guibas and Vuillemin, developed a layout scheme based on recurrence equation for addition 1980, Ladner and Fisher published “parallel prefix computation”, Jo of ACM 1981, Bilgory and Gajski published a paper on recurrence structures for automatic cell generation
Prof. V.G. OklobdzijaVLSI Arithmetic99 Recurrence Solver Based Adders They are based on recurrence equation for P,G (what is new there since Weinberger ?!!): Or:and
Prof. V.G. OklobdzijaVLSI Arithmetic100 Recurrence Solver Based Adders
Prof. V.G. OklobdzijaVLSI Arithmetic101 Carry-Lookahead Adder (Weinberger and Smith) Just to remind you ! please notice the similarity with Parallel-Prefix Adders !
Multiplexer Based Adder Farooqui and Oklobdzija 1999 Int’l Sym. on VLSI Technology, Taipei, Taiwan, June 8-10, 1999
Prof. V.G. OklobdzijaVLSI Arithmetic103 Multiplexer Based Adder Based on the realization that MUX circuit is faster than a logic gate due to its transmission gate implementation. Based on Carry-Lookahead method (W-S), or recurrence solver.
Prof. V.G. OklobdzijaVLSI Arithmetic104 Multiplexer Based Adder A. A. Farooqui, V. G. Oklobdzija, F. Chechrazi, 1999 Int’l Sym. on VLSI Technology, Taipei, Taiwan, June 8-10, 1999.
Prof. V.G. OklobdzijaVLSI Arithmetic105 Multiplexer Based Adder A. A. Farooqui, V. G. Oklobdzija, F. Chechrazi, 1999 Int’l Sym. on VLSI Technology, Taipei, Taiwan, June 8-10, 1999.
Prof. V.G. OklobdzijaVLSI Arithmetic106 Multiplexer Based Adder A. A. Farooqui, V. G. Oklobdzija, F. Chechrazi, 1999 Int’l Sym. on VLSI Technology, Taipei, Taiwan, June 8-10, 1999.
Prof. V.G. OklobdzijaVLSI Arithmetic107 Multiplexer Based Adder A. A. Farooqui, V. G. Oklobdzija, F. Chechrazi, 1999 Int’l Sym. on VLSI Technology, Taipei, Taiwan, June 8-10, Results in a very fast structure 7-MUX delays for a 64-b adder Delay using standard cell 0.25u, 2.5V, 25 o C : Adder Size (bits) Delay (pS)
Prof. V.G. OklobdzijaVLSI Arithmetic108 DEC "Alpha" Adder Combination: –8-bit tapered pre-discharged Manchester Carry Chains, with C in = 0 and C in = 1 –32-bit LSB Carry Lookahead Adder –32-bit MSB Conditional-Sum Adder –Carry-Select on most significant 32-bits –Latches in the middle: pipelined addition
Prof. V.G. OklobdzijaVLSI Arithmetic109 DEC "Alpha" Adder
Prof. V.G. OklobdzijaVLSI Arithmetic110 DEC "Alpha" Adder: Results The first 200MHz processor Built using 0.75u technology V=3.3V, 30W Pipelined (two-latches) allowing 5nS throughput and 10nS latency
Conclusion VLSI Implementation of Addition
Prof. V.G. OklobdzijaVLSI Arithmetic112 Conclusion: VLSI Implementation of Addition Currently, implementation parameters are not reflected in algorithms used for development Layout and wire delays effects are largely neglected and this is becoming intolerable in the next generation of technology Transistor sizing has a large effect which can out weight the algorithm There is a great disconnect between algorithm and implementation New rules and measures of goodness are needed
Multiplication Parallel Multiplier Implementation
Prof. V.G. OklobdzijaVLSI Arithmetic114 Multiplication Algorithm: for j=0,....,n-1 initially p(n)=XY after n steps
Prof. V.G. OklobdzijaVLSI Arithmetic115 Parallel Multipliers
Prof. V.G. OklobdzijaVLSI Arithmetic116 4:2 Compressor
Prof. V.G. OklobdzijaVLSI Arithmetic117 Re-designed 4:2 Compressor with 3 XOR Delay C in I1 I2 I3 I4 0 1 S C C out
A Method for Generation of Fast Parallel Multipliers by Vojin G. Oklobdzija David Villeger Simon S. Liu Electrical and Computer Engineering University of California Davis
Prof. V.G. OklobdzijaVLSI Arithmetic119
Idea !!!!!
Prof. V.G. OklobdzijaVLSI Arithmetic121
Prof. V.G. OklobdzijaVLSI Arithmetic122 Three-Dimensional optimization Method: TDM (Oklobdzija, Villeger, Liu, 1996)
Prof. V.G. OklobdzijaVLSI Arithmetic123
Prof. V.G. OklobdzijaVLSI Arithmetic124
Method
Prof. V.G. OklobdzijaVLSI Arithmetic126
Prof. V.G. OklobdzijaVLSI Arithmetic127
Prof. V.G. OklobdzijaVLSI Arithmetic128
Computer Tools
Prof. V.G. OklobdzijaVLSI Arithmetic130 Algorithm for Automatic Generation of Partial Product Array. Initialize: Form 2N-1 lists Li ( i = 0, 2N-2 ) each consisting of pi elements where: p i = i+1 for i £ N-1 and p i = 2N-1-i for i N An element of a list Li ( j = 0,...,pi-1 ) is a pair: i where: nj : is a unique node identifying name j : is a delay associated with that node representing a delay of a signal arriving to the node nj with respect to some reference point. For i = 0,1 and 2N-2: connect nodes from the corresponding lists Li directly to the CPA.
Prof. V.G. OklobdzijaVLSI Arithmetic131 For i=2 to i=2N-3 {Partial Product Array Generation} Begin For if length of Li is even Then Begin If sort the elements of Li in ascending order by the values of delay j connect an HA to the first 2 elements of Li starting with the slowest input Ds =max { A+ A-s, B+ B-s} Dc =max { A+ A-c, B+ B-c} remove 2 elements from Li insert the pair into Li insert the pair into Li+1 decrement the length of Li increment the length of Li+1 End If;
while length of Li > 3 Begin While sort the elements of Li in ascending order by the values of delay j connect an FA to the first 3 elements of Li starting with the slowest input of the FA: Ds =max { A+ A-s, B+ B-s, Ci+ Ci-s} Dc = max { A+ A-c, B+ B-c, Ci+ Ci-c} remove 3 elements from Li insert the pair into Li insert the pair into Li+1 subtract 2 from the length of Li increment the length of Li+1 End While; sort the elements of Li connect an FA to the last 3 nodes of Li connect the S and C to the bit i and i+1 of the CPA End For; End Method;
Prof. V.G. OklobdzijaVLSI Arithmetic133
Prof. V.G. OklobdzijaVLSI Arithmetic134
Prof. V.G. OklobdzijaVLSI Arithmetic136
Competing Approaches
Prof. V.G. OklobdzijaVLSI Arithmetic138 Organization of Hitachi's DPL multiplier
Prof. V.G. OklobdzijaVLSI Arithmetic139 Hitachi's 4:2 compressor structure
Prof. V.G. OklobdzijaVLSI Arithmetic140 DPL multiplexer circuit
RECOMENDATIONS
Prof. V.G. OklobdzijaVLSI Arithmetic142 Conclusion 1.The key to improving multiplier speed was in optimizing interconnections, not the compressor circuit (as it was believed for so long). 2.With the increase in wire delay it is important to make a connection between layout topology and algorithm for optimal interconnection of the PPRT. 3.Using one of the “fast adders” (CLA) as a final adder was acutally counterproductive. A simple final adder, but optimized for the signal arrival profile yields better results with less hardware. 4.It is possible to further optimize the PPRT and FA so that Multiply-Add operation (fused) can be performed in multiply time. 5.For the larger size multipliers / adders (as used in cryptography) the optimization procedures (described) yields even better results. See:
Prof. V.G. OklobdzijaVLSI Arithmetic143 Read This ! 1.E. Swartzlander, "Computer Arithmetic". Vol. 1&2, IEEE Computer Society Press, K. Hwang, "Computer Arithmetic : Principles, Architecture and Design", John Wiley and Sons, M. Ercegovac, “Digital Systems and Hardware/Firmware Algorithms”, Chapter 12: Arithmetic Algorithms and Processors, John Wiley & Sons, A. Chandrakasan, W. Bowhill, F Fox, Editors, "Design of High Performance Microprocessors Circuits", IEEE Press, July V. G. Oklobdzija, “High-Performance System Design: Circuits and Logic”, IEEE Press, July Also:
Prof. V.G. OklobdzijaVLSI Arithmetic144 THE END
Hollywood