Presentation is loading. Please wait.

Presentation is loading. Please wait.

VLSI Arithmetic Lecture 10: Multipliers

Similar presentations


Presentation on theme: "VLSI Arithmetic Lecture 10: Multipliers"— Presentation transcript:

1 VLSI Arithmetic Lecture 10: Multipliers
Prof. Vojin G. Oklobdzija University of California

2 Multiplication Algorithm*
*from Parhami 27 November 2018

3 Multiplication Algorithm*
*from Parhami 27 November 2018

4 Multiplication Algorithm*
*from Parhami 27 November 2018

5 *from Parhami 27 November 2018

6 Multiplication* *from Parhami 27 November 2018

7 Multiplication* *from Parhami 27 November 2018

8 *from Parhami 27 November 2018

9 *from Parhami 27 November 2018

10 Multiplier Recoding* *from Parhami 27 November 2018

11 *from Parhami 27 November 2018

12 Multiplication by Constants
*from Parhami 27 November 2018

13 Multiplication by Constants
*from Parhami 27 November 2018

14 Fast Multipliers *from Parhami 27 November 2018

15 Using Higher Radix Multiplier
*from Parhami 27 November 2018

16 Using Higher Radix Multiplier
*from Parhami 27 November 2018

17 Higher Radix Multiplier
*from Parhami 27 November 2018

18 *from Parhami 27 November 2018

19 Booth’s Recoding *from Parhami 27 November 2018

20 Booth’s Recoding *from Parhami 27 November 2018

21 Booth’s Recoding *from Parhami 27 November 2018

22 *from Parhami 27 November 2018

23 Higher Radix Multipliers
*from Parhami 27 November 2018

24 Tree and Array Multipliers
*from Parhami 27 November 2018

25 Tree and Array Multipliers
*from Parhami 27 November 2018

26 Generating Partial Products
27 November 2018 *from G. Bewick

27 Generating Partial Products
*from G. Bewick 27 November 2018

28 Generating Partial Products using Booth’s Recoding
*from G. Bewick 27 November 2018

29 Generating Partial Products using Booth’s Recoding
*from G. Bewick 27 November 2018

30 Booth Partial Product Selector Logic
*from G. Bewick 27 November 2018

31 Tree Multipliers *from Parhami 27 November 2018

32 27 November 2018

33 27 November 2018

34 27 November 2018

35 27 November 2018

36 27 November 2018

37 Tree Multipliers *from Parhami 27 November 2018

38 Tree Multipliers *from Parhami 27 November 2018

39 Tree Multipliers *from Parhami 27 November 2018

40 27 November 2018

41 27 November 2018

42 27 November 2018

43 Reduction using 4:2 Compressors
*from G. Bewick 27 November 2018

44 A Method for Generation of Fast Parallel Multipliers
by Vojin G. Oklobdzija David Villeger Simon S. Liu Electrical and Computer Engineering University of California Davis

45 Fast Parallel Multipliers
Objective Improved Speed of Parallel Multiplier via: Improvements in Partial-Product Bit Reduction Techniques Optimization of the Final Adder for the Uneven Signal Arrival Profile from the Multiplier Tree 27 November 2018 Multiplier Design

46 Multiplication Algorithm: initially for j=0,....,n-1
Traditionally multiplication operation is performed in a variety of forms, in hardware and software, depending on the cost and transistor budget allocated for this particular operation. Today it is more likely to find a full hardware implementation of the multiplication because of growing demand for speed and decreasing cost of hardware. We show a basic multiplication algorithm which operates on positive n-bit long integers X and Y resulting in the product P which is 2n - bits long. This expression indicates that the multiplication process is performed by summing n terms of a partial product Pi. This product Pi is obtained by simple arithmetic left shift of X for the i positions and multiplication by the single digit yi. For the binary radix (r=2), yi can only be 0 or 1 and multiplication by the digit yi is a simple AND operation. The addition of n terms can be performed at once, by passing the partial products through a network of adders or sequentially, by adding partial products using an adder n times. The algorithm to perform the multiplication of X and Y is shown in the slide. It can be proved without difficulties that after n steps this recurrence results in a product p(n)=XY. p(n)=XY after n steps 27 November 2018 Multiplier Design

47 27 November 2018 Multiplier Design

48 27 November 2018 Multiplier Design

49 Parallel Multipliers Multiplier Design 27 November 2018
An alternative approach to sequential multiplication involves simultaneous generation of all bit products and their summation with an array of full adders. This approach uses an n by n array of AND gates to form the bit products, an array of n x n adders (and half adders) to sum the bit products in a carry-save fashion. Finally a 2n Carry-Propagate Adder is used in the final step to finish the summation and produce the result. Wallace introduced a way of summing the partial product bits in parallel using a tree of Carry Save Adders which became generally known as the “Wallace Tree” . A suggestion for improved efficiency of addition of the partial products was published by Dadda in 1965. Dadda introduces a notion of a counter which will take a number of bits p in the same bit position and output a number q which represent the count of ones at the input. This process is shown in the slide illustrating 8 by 8 multiplication process. An input of 8 by 8 matrix of dots, each dot represents a bit product, is shown as a Matrix 0. Columns having more than six dots are reduced by the use of half adders. Each half adder takes in two dots and outputs one in the same column and one in the next more significant column. Each full adder takes in three dots and outputs one in the same column and one in the next more significant column. No column in Matrix 1 will have more than six dots. Half adders are shown by a “crossed” line in the succeeding matrix and full adders are shown by a line in the succeeding matrix. In each case the right most dot of the pair that are connected by a line is in the column from which the inputs were taken for the adder. In the succeeding steps reduction is performed to Matrix 2 with no more than four dots per column, Matrix 3 with no more than three dots per column, and finally Matrix 4 with no more than two dots per column is. The height of the matrices is determined by working back from the final matrix and limiting the height of each matrix to the largest integer that is no more than 1.5 times the height of its successor. Each matrix is produced from its predecessor in one adder delay. Since the number of matrices is logarithmically related to the number of bits in the words to be multiplied, the delay of the matrix reduction process is proportional to log(n). The adder that reduces the final two row matrix to the final product can be implemented as a fast adder, which also has logarithmic delay. The total delay for this multiplier is proportional to the logarithm of the size of its operands. The effort of improving the speed of the parallel multiplier continued for almost 30 years. 27 November 2018 Multiplier Design

50 27 November 2018 Multiplier Design

51 27 November 2018 Multiplier Design

52 27 November 2018 Multiplier Design

53 27 November 2018 Multiplier Design

54 27 November 2018 Multiplier Design

55 Minimum Number of Stages (Dada’s Rule)
27 November 2018 Multiplier Design

56 Their Schemes

57 Use of 4:2 Compressors A. Weinberger 1981 M. Santoro 1988
27 November 2018 Multiplier Design

58 4:2 Compressor Multiplier Design 27 November 2018
In 1981 Weinberger disclosed a structure which he called "4-2 carry-save module". This structure contained a combination of Full Adder cells in an intricate interconnection structure which yields a faster partial product compression than the use of 3:2 counters. The structure actually compresses five partial product bits into three, however it is connected in such a way that four of the inputs are coming from the same bit position of the weight j while one bit is fed from the neighboring position j-1 also known as carry-in. The output of such a 4:2 module consists of one bit in the position j and two bits in the position j+1. This structure does not represent a counter, though it became erroneously known as "4:2 counter“, but a "compressor" that would compress four partial product bits into two. The structure of 4:2 compressor is shown in this slide. The efficiency of such a structure to reduce partial product bits is higher. It reduces the number of partial product bits by one half at each stage. The speed of such a 4:2 compressor has been determined by the speed of 3 XOR gates in series, in the redesigned version of 4:2 compressor, making such a scheme more efficient that the one using 3:2 counters in a regular "Wallace Tree". The other equally important feature of the use of 4:2 compressor is that the interconnections between 4:2 cells follow more regular pattern than it is the case of the "Wallace Tree". 27 November 2018 Multiplier Design

59 Critical Signal Path in a 4:2 Compressor Tree
27 November 2018 Multiplier Design

60 Re-designed 4:2 Compressor with 3 XOR Delay (Nagamatsu, Toshiba)
in I1 I2 S I3 I4 1 C This slide shows a re-design 4:2 compressor as introduced by Toshiba. The advantage of this compressor is that it results in 3 XOR gate maximal delay as opposed to 4 XOR delay in a regular implementation using Full-Adder cells. In the next slide we will show how this re-design was not necessary. The missing point was in the lack of understanding of how to balance delays of individual full-adders. C out 27 November 2018 Multiplier Design

61 Critical Path in a 4:2 Compressor
27 November 2018 Multiplier Design

62 Signal Arrival Profile for RWT (3:2) and MWT (4:2)
27 November 2018 Multiplier Design

63 Using 9:2 Compressors (P. Song, G. De Michelli 1991)
27 November 2018 Multiplier Design

64 Compressor Tree Implemented with 9:2 Compressors
27 November 2018 Multiplier Design

65 9:2 Compressor Structure
27 November 2018 Multiplier Design

66 Title Multiplier Design Critical Path: (Equivalent XOR Gate Delays)
24 Title 4:2 Compressor (Redesigned) 22 Delay (XOR Gates) 9to2 Compressor (Redesigned ) 20 3,2 Counter 18 16 14 12 10 8 6 4 2 10 20 30 40 50 60 70 80 90 100 Multiplier Width (bits) 27 November 2018 Multiplier Design

67 Delay Expressed as No. of XOR Gate Delays
27 November 2018 Multiplier Design

68 Use of Higher-Order Compressors
D. Villeger, V.G. Oklobdzija 1993 27 November 2018 Multiplier Design

69 Design of a 13:2 Compressor from a 9:2 Compressor
27 November 2018 Multiplier Design

70 Delay Profile of a 24:2 Compressor Tree
27 November 2018 Multiplier Design

71 Compressor Family Characteristics
27 November 2018 Multiplier Design

72 Using Carry-Propagate Adders
(G. Bewick 1993) (D. Villeger & V. G. Oklobdzija 1993) 27 November 2018 Multiplier Design

73 Column Compression Tree Consisting of 4-bit Adders
27 November 2018 Multiplier Design

74 Bit Reduction Using 4-bit Adders (24X24)
27 November 2018 Multiplier Design

75 Idea !!!!!

76 (Oklobdzija, Villeger, Liu, 1994)
A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach – TDM (Oklobdzija, Villeger, Liu, 1994) 27 November 2018 Multiplier Design

77 27 November 2018 Multiplier Design

78 27 November 2018 Multiplier Design

79 27 November 2018 Multiplier Design

80 Signal Delays in a Full Adder
(3,2) Counter A B Cin Sum * Fast Input Fast Output Carry 27 November 2018 Multiplier Design

81 Three-Dimensional optimization Method: TDM (Oklobdzija, Villeger, Liu, 1996)
The further improvement in speed of a parallel multiplier was achieved by introduction of TDM method in 1996. TDM optimizes the entire Partial Product Reduction Tree in one pass, thus the name Three Dimensional optimizaiton Method. The important aspect of this method is in sorting of fast inputs and fast outputs. It was realized that the most important step in achieving fast partial product reduction is to properly interconnect the elements. Thus, appropriate counters, 3:2 adders in this particular case, were characterized in a way which identifies delay of each input to each output. Interconnecting of the Partial Product Reduction Tree was done in a way in which signals with large delays are connected to "fast inputs" and signals with small delay to "slow inputs" . This slide illustrates how an 4:1 compressor with 3 XOR delay can be obtained by a simple application of TDM method without the need for redesign. 27 November 2018 Multiplier Design

82 Method

83 27 November 2018 Multiplier Design

84 27 November 2018 Multiplier Design

85 27 November 2018 Multiplier Design

86 27 November 2018 Multiplier Design

87 Computer Tools

88 27 November 2018 Multiplier Design

89 Design of Parallel Multipliers
Algorithm for Automatic Generation of Partial Product Array. Initialize: Form 2N-1 lists Li ( i = 0, 2N-2 ) each consisting of pi elements where: p i = i+1 for i £ N-1 and p i = 2N-1-i for i  N An element of a list Li ( j = 0,...,pi-1 ) is a pair: <nj, Dj>i where: nj : is a unique node identifying name Dj : is a delay associated with that node representing a delay of a signal arriving to the node nj with respect to some reference point. For i = 0,1 and 2N-2: connect nodes from the corresponding lists Li directly to the CPA. 27 November 2018 Multiplier Design

90 27 November 2018 Multiplier Design

91 For i=2 to i=2N-3 {Partial Product Array Generation} Begin For if length of Li is even Then Begin If sort the elements of Li in ascending order by the values of delay dj connect an HA to the first 2 elements of Li starting with the slowest input Ds =max {dA+dA-S, dB+dB-S} Dc =max {dA+dA-C, dB+dB-C} remove 2 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li decrement the length of Li increment the length of Li End If; 27 November 2018 Multiplier Design

92 while length of Li > 3 Begin While sort the elements of Li in ascending order by the values of delay dj connect an FA to the first 3 elements of Li starting with the slowest input of the FA: Ds =max {dcA+dcA-S, dcB+dcB-S, dcCi+dcCi-S} Dc = max {dcA+dcA-C, dcB+dcB-C, dcCi+dcCi-C} remove 3 elements from Li insert the pair <Ds,NetName> into Li insert the pair <Dc,NetName> into Li+1 subtract 2 from the length of Li increment the length of Li+1 End While; sort the elements of Li connect an FA to the last 3 nodes of Li connect the S and C to the bit i and i+1 of the CPA End For; End Method;

93 Competing Approaches

94

95 27 November 2018 Multiplier Design

96 27 November 2018 Multiplier Design

97 Algorithm for Implementation of Fast Parallel Multipliers
[1] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A Method For Speed Optimized Partial Product Reduction And Generation Of Fast Parallel Multipliers Using An Algorithmic Approach,” IEEE Transactions on Computers, Vol 45, No.3, March, 1996. [2] V. G. Oklobdzija and D. Villeger, “Improving Multiplier Design By Using Improved Column Compression Tree And Optimized Final Adder In CMOS Technology,” IEEE Transactions on VLSI Systems, Vol.3, No.2, June, 1995, 25 pages. [3] V. G. Oklobdzija and D. Villeger, “Multiplier Design Utilizing Improved Column Compression Tree And Optimized Final Adder In CMOS Technology,” Proceedings of the 1993 International Symposium on VLSI Technology, Systems and Applications, pp , 1993. [4] P. Stelling, C. Martel, V. G. Oklobdzija, R. Ravi, “Optimal Circuits for Parallel Multipliers,” IEEE Transaction on Computers, Vol. 47, No.3, pp , March, 1998. 27 November 2018 Multiplier Design

98 Organization of Hitachi's DPL multiplier
And example of a fast multiplier is Hitachi's DPL multiplier which was the first one to achieve under 5nS speed for a 54-bit floating-point mantissa. This multiplier is of a regular structure including: (a.) A Booth Recoder (b.) A Partial Product Reduction Tree and (c.) A final Carry Propagate Adder (CPA) as shown in this slide. 27 November 2018 Multiplier Design

99 Hitachi's 4:2 compressor structure
 The key to performance of Hitachi's multiplier lays in the use of DPL circuits and the efficiency with which DPL can realize 4:2 compressor. The structure of Hitachi's 4:2 compressor is shown in this slide. The realization of the 4:2 function consists entirely of DPL multiplexers which introduce only one pass-transistor delay in the critical path as shown in the next slide. Indeed this structure is one of the fastest transistor realizations for the Partial Product Reduction Tree. The speed of this multiplier can be further optimized by applying TDM algorithm realizing optimal interconnections. Such a structure yields 4.1nS delay in 0.25u technology. For larger size multipliers this structure may start showing degraded performance because of the long pass-transistor chain in 4:2 compressors used. However, this is of not much concern since 54-bit represents double precession floating point format and we rarely use larger multiplier sizes in practice. 27 November 2018 Multiplier Design

100 DPL multiplexer circuit
An efficient pass transistor realization of multiplexers using Hitachi’s DPL logic is shown in this slide. This multiplexer introduces only one pass-transistor delay in the critical path, thus allowing three pass transistor delays for one 4:2 compressor. The entire partial product reduction tree results in 12 pass-transistor delays in the critical path. The use of TDM reduces this number to 10. 27 November 2018 Multiplier Design

101 Addition Under Non-equal Signal Arrival Profile Assumption
P. Stelling , V. G. Oklobdzija, "Design Strategies for Optimal Hybrid Final Adders in a Parallel Multiplier", special issue on VLSI Arithmetic, Journal of VLSI Signal Processing, Kluwer Academic Publishers, Vol.14, No.3, December 1996 27 November 2018 Multiplier Design

102 Signal Arrival Profile form the Parallel Multiplier Partial-Product Recuction Tree
27 November 2018 Multiplier Design

103 Oklobdzija, Villeger, IEEE Transactions on VLSI Systems, June, 1995
27 November 2018 Multiplier Design

104 Oklobdzija and Villeger, IEEE Transactions on VLSI Systems, June, 1995
27 November 2018 Multiplier Design

105 27 November 2018 Multiplier Design

106 27 November 2018 Multiplier Design

107 27 November 2018 Multiplier Design

108 27 November 2018 Multiplier Design

109 27 November 2018 Multiplier Design

110 27 November 2018 Multiplier Design

111 27 November 2018 Multiplier Design

112 27 November 2018 Multiplier Design

113 Performing Multiply-Add Operation in the Multiply Time
P. Stelling, V. G. Oklobdzija, " Achieving Multiply-Accumulate Operation in the Multiply Time", Thirteenth International Symposium on Computer Arithmetic, Pacific Grove, California, July 5 - 9, 1997. 27 November 2018 Multiplier Design

114 27 November 2018 Multiplier Design

115 Final Adder: Implementation
27 November 2018 Multiplier Design

116 Final Adder: Implementation
27 November 2018 Multiplier Design

117 Final Adder: Implementation
27 November 2018 Multiplier Design

118 Final Adder: Implementation
27 November 2018 Multiplier Design

119 RECOMENDATIONS

120 Fast Parallel Multipliers
Different Counter and Compressor Families were compared. The best way is to build a compressor of the maximal size (i.e. the entire size of the multiplier) The Essence of the optimal tree is optimal wiring and NOT the use of counter/compressor family The use of Carry-Propagate Adders is advantageous for larger size multipliers in the first stage and for particular technology Tuning of the Final Adder into the signal arrival profile is more important than the speed of the Final Adder. 27 November 2018 Multiplier Design

121 THE END 27 November 2018 Multiplier Design

122 Hollywood


Download ppt "VLSI Arithmetic Lecture 10: Multipliers"

Similar presentations


Ads by Google