Reading Assignment: Weste: Chapter 8 Rabaey: Chapter 11

Slides:

Advertisements

Similar presentations

Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.

Advertisements

Datapath Functional Units. Outline  Comparators  Shifters  Multi-input Adders  Multipliers.

UNIVERSITY OF MASSACHUSETTS Dept

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE VLSI Circuit Design Lecture 24 - Subsystem.

Copyright 2008 Koren ECE666/Koren Part.6b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

Chapter 6 Arithmetic. Addition Carry in Carry out

Assembly Language and Computer Architecture Using C++ and Java

UNIVERSITY OF MASSACHUSETTS Dept

Computer ArchitectureFall 2008 © August 25, CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (1)

EE466: VLSI Design Lecture 14: Datapath Functional Units.

Distributed Arithmetic: Implementations and Applications

Introduction to CMOS VLSI Design Datapath Functional Units

Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

Chapter 5 Arithmetic Logic Functions. Page 2 This Chapter..  We will be looking at multi-valued arithmetic and logic functions  Bitwise AND, OR, EXOR,

3-1 Chapter 3 - Arithmetic Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring Computer Architecture.

Lecture 18: Datapath Functional Units

Computer Organization and Architecture Computer Arithmetic Chapter 9.

ECE 4110– Sequential Logic Design

Aug Shift Operations Source: David Harris. Aug Shifter Implementation Regular layout, can be compact, use transmission gates to avoid threshold.

Computer Arithmetic Nizamettin AYDIN

Chapter 6-2 Multiplier Multiplier Next Lecture Divider

Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.

3-1 Chapter 3 - Arithmetic Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles of Computer Architecture.

CS1Q Computer Systems Lecture 9 Simon Gay. Lecture 9CS1Q Computer Systems - Simon Gay2 Addition We want to be able to do arithmetic on computers and therefore.

ECE232: Hardware Organization and Design

Digital Integrated Circuits Chpt. 5Lec /29/2006 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (

Chapter 4 – Arithmetic Functions and HDLs Logic and Computer Design Fundamentals.

Chapter 6-1 ALU, Adder and Subtractor

Introduction to Computer Organization and Architecture Lecture 10 By Juthawut Chantharamalee wut_cha/home.htm.

07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.

CH09 Computer Arithmetic  CPU combines of ALU and Control Unit, this chapter discusses ALU The Arithmetic and Logic Unit (ALU) Number Systems Integer.

Number Systems. Why binary numbers? Digital systems process information in binary form. That is using 0s and 1s (LOW and HIGH, 0v and 5v). Digital designer.

 Recall grade school trick ◦ When multiplying by 9:  Multiply by 10 (easy, just shift digits left)  Subtract once ◦ E.g.  x 9 = x (10.

Reconfigurable Computing - Multipliers: Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on.

Description and Analysis of MULTIPLIERS using LAVA.

Spring 2002EECS150 - Lec12-cl3 Page 1 EECS150 - Digital Design Lecture 12 - Combinational Logic Circuits Part 3 March 4, 2002 John Wawrzynek.

55:035 Computer Architecture and Organization Lecture 5.

Advanced VLSI Design Unit 05: Datapath Units. Slide 2 Outline  Adders  Comparators  Shifters  Multi-input Adders  Multipliers.

Charles Kime & Thomas Kaminski © 2004 Pearson Education, Inc. Terms of Use (Hyperlinks are active in View Show mode) Terms of Use Logic and Computer Design.

ECE/CS 552: Arithmetic I Instructor:Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on set created by Mark Hill.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

Addition and multiplication1 Arithmetic is the most basic thing you can do with a computer, but it’s not as easy as you might expect! These next few lectures.

Chapter 9 Computer Arithmetic

William Stallings Computer Organization and Architecture 8th Edition

Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]

Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.

William Stallings Computer Organization and Architecture 7th Edition

Unsigned Multiplication

ECEG-3202 Computer Architecture and Organization

Chapter 8 Computer Arithmetic

UNIVERSITY OF MASSACHUSETTS Dept

ECE 352 Digital System Fundamentals

UNIVERSITY OF MASSACHUSETTS Dept

UNIVERSITY OF MASSACHUSETTS Dept

Description and Analysis of MULTIPLIERS using LAVA

Booth Recoding: Advantages and Disadvantages

Presentation transcript:

Reading Assignment: Weste: Chapter 8 Rabaey: Chapter 11 ELEC 516 VLSI System Design and Design Automation Spring 2010 Lecture 4 - Shifter and Multiplier Design Reading Assignment: Weste: Chapter 8 Rabaey: Chapter 11 Note: some of the figures in this slide set are adapted from the slide set of “ Digital Integrated Circuits” by Rabaey. Et. al. 2002

Shifter Design Shifting operations are important and are used extensively for arithmetic shifting, logical shifting, rotation, floating point operations, scaling and multiplications by constant number Data alignment Field extraction/combination Address generation Shifting a data-word left or right over a constant amount is trivial hardware operation. A programmable shifter, however, is more complex. E.g. shift left or right for a variable number of bit Design style Two dimension arrays Variable size Rotate Padding with zeros/ones

A simple shifter The above design will rapidly become complex and slow for larger shift values More structural approach is advisable: Two commonly used shift structures, the barrel shifter and the logarithmic shifter.

Barrel Shifter It consists of array of transmission gates, where the number of row equals the word length of the data and the number of columns equals the maximum shift length. A major advantage for this shifter is that the signal has to pass through at most one transmission gate and hence the delay is theoretically constant and independent of the shift value or shifter size. This is not true in reality since the capacitance at the input of the buffers rise linearly with the maximum shift- width.

Barrel Shifter (2) Area Dominated by Wiring : Data Wire : Control Wire 3 2 1 B : Control Wire : Data Wire Area Dominated by Wiring

Logarithmic Shifter While the barrel shifter implements the whole shifter as a single array of pass-transistors, the log. shifter uses a staged approach. It uses stages of multiplexers which decompose the shift into power- of-two stages. A shifter with a maximum shift width of M consists of log2M stages, where the ith stage either shifts over 2i or passes the data unchanged. Log. shifter is usually smaller than the barrel shifter. For larger values, of M, it is definitely the structure of choice. The speed depends upon the shift-width in a log. way since a n-bit shifter requires log2n stages. Other shift options are frequently required, for instance, shuffles, bit reversals, and interchanges.

Logarithmic Shifter (2) In general, it can be concluded that a barrel-shifter is appropriate for smaller shifters. For large shift values, the log. shifter becomes more effective, in terms of area and speed. Also log. shifter is more regular and hence can be easily generated automatically.

Multiplexer-based shifter

Shifter design - Summary The design of a shifter is a trade-off between area, delay. Barrel shifter: fastest but requires more transistors Speed: O(1), area: n2 transistors Logarithmic shifter: Slower but less transistors: Speed: O(log n), area: n log n transistors Barrel shifter is wire-dominated circuit

The Multiplier Very important operation. Often the speed of multiplication limits the performance of the digital processor. Multiplications are used in many digital signal processing applications: correlations, convolution, filtering, and frequency analysis. Vector product, matrix multiplication. Weighted sums required in many DSP such as Neural network, Filtering etc… Multipliers are in fact complex adder arrays. The analysis of the multiplier gives us some further insight on how to optimize the performance (or the area) of complex circuit topologies.

Example Example: 10x5 Multiplicand: 1 0 1 0 10 Multiplier: 0 1 0 1 5 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 1 0 50 4 partial products The multiplication process may be viewed to consist of two steps: Evaluation of partial products Accumulation of the shifted partial products. Partial products can be generated using an array of AND gates.

The Multiplier(II) Binary multiplication is equivalent AND operation. Evaluation of the partial products consists of the logical ANDing of the multiplicand and the relevant multiplier bit. Different techniques exist. The choice of technique is based on factors such as speed, throughput, numerical accuracy and area. N*N multiplier has 2n bits output Integer multiplier – takes the n LSB bits Floating point multiplier (or fixed point with decimal point in the MSB) e.g. FP, 1.XXX * 1.XXX, takes the n MSB bits

Simple multiplier Generates and add one partial product at each cycles. Takes n cycles. multiplicand Partial Product generation multiplier Shift right every cycle Adder Shift

Issues for design fast multiplier Reduce the number of partial products Fast adder cells Reducing the number of addition required to sum the partial products – e.g. use tree adders

The Array Multiplier Consider two unsigned binary number X and Y that are M and N bits wide, respectively Pk the partial product terms called summands. There are M*N summands which are generated in parallel by a set of M*N AND gates

The Array Multiplier (II) A n*n multiplier requires n(n-2) full adders, n half adders, and n2 AND gates. The worst case delay is (2n+1)tg, where tg is the worst case adder delay.

The Array Multiplier (III) The following is a basic cell used in array multiplier B C Y Y X + CO X PO

A 4*4 array multiplier Y0 x3 x2 x1 x0

The MxN Array Multiplier - Critical Path

Carry-Save Adder (old style) We don’t need to optimize the carry chain of each of the rows. Postpone the carry to a later stage Delay=N.tcarry+ tand + tmerge HA HA HA HA HA FA FA FA CSA M HA FA FA FA N Vector merging stage HA FA HA FA FA HA [Rab96] p.411

Booth Encoding The multiplier we studied before use radix-2 multiplication, i.e. by observing one bit of the multiplicand at a time. Higher radix multipliers may be designed to reduce the number of adders and hence the delay required to compute the partial sums. Booth encoding - perform two’s complement multiplication and perform several steps of the multiplication at once. It takes the advantage of the fact that an add-subtracter is nearly as fast and small as a simple adder. The most common form of Booth’s algorithm looks at three bits of the multiplier at a time to perform two stages of multiplication.

Booth Multiplier: Example 2a = 2a+1- 2a and hence we can recode each 1 in multiplier as “+2-1” Converts sequences of 1 to 10…0(-1) Might reduce the number of 1’s 0 0 1 1 1 1 1 1 0 0 +1 -1 0 1 0 0 0 0 0 -1 0 0 Less 1’s in this sequence Based on the idea that 1111 can be rewritten as 10000-1 Does this recoding help us speedup the “Sequential Multiplier”? No! The datapath still goes through the adder Useful in constant addition, or asynchronous multipliers The real benefit would be when we group the digits into pairs, something that we will see in “Modified Booth Encoding” When does this encoding reduce number of 1’s? [© K. Bazaragan]

Booth Recoding: Multiplication Example 0 1 1 1 0 14 +1 0 0 -1 0 0 0 0 0 0 1 1 0 1 0 (-6) 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 84 Sign extension Only two rows of partial sums 1 1 1 We first recode the multiplier (14) Then multiply the recoded number to the multiplicand We should do sign-extension so that the addition of the partial products is correct Whenever adding a k-bit negative number to a L-bit number (L>k), we should first convert the k-bit representation to L-bit representation (just extending the sign bit to the higher bit positions does it) and then perform the addition [© K. Bazaragan]

Booth Recoding: Advantages and Disadvantages Major advantage: Can reduce the number of 1’s in multiplier So far: We did not improve the speed of the multiplier as we still have to wait for the critical path, e.g., the shift-add delay in sequential multiplier. Booth recording results in increased area as we need recoding circuitry AND subtraction

Modified Booth Multiplier We can reduce the # of partial sums –Group more bits Group pairs, leaving –2, -1, 0, 1, 2 Grouping reduces # of partial products by half Booth recoding results in: Gets rid of 3’s (sequences of 1’s in general) 0 1 1 0 1 1 1 0 0 0 1 0 (+1 -1) (+1 -1) (+1 -1) (+1 -1) (+1 -1) (+1 -1) +1 0 -1 +1 0 0 -1 0 0 +1 -1 0 +2 -1 0 -2 +1 -2 [©Hauck]

Modified Booth Encoding (II) Consider the two’s complement representation of the multiplier y: We can rewrite 2a = 2a+1- 2a and hence Look at the first two terms

Modified Booth Multiplier Can encode the digits by looking at three bits at a time (reduce the partial sums) Booth recoding table: Must be able to add multiplicand times –2, -1, 0, 1 and 2 Since Booth recoding got rid of 3’s, generating partial products is not that hard (shifting and negating) i+1 i i-1 add 0 0 0 0*M 0 0 1 1*M 0 1 0 1*M 0 1 1 2*M 1 0 0 –2*M 1 0 1 –1*M 1 1 0 –1*M 1 1 1 0*M [©Hauck]

Booth Multiplier: Example Retire two bits per shift operation Addition: signed Sign extend 2 bits if adding two partial products at a time 0 0 1 1 0 1 13 1 1 1 0 1 0 -6 -1 -2 1 1 1 0 0 1 1 0 1 1 1 1 0 0 1 1 i+1 i i-1 add 0 0 0 0*M 0 0 1 1*M 0 1 0 1*M 0 1 1 2*M 1 0 0 –2*M 1 0 1 –1*M 1 1 0 –1*M 1 1 1 0*M 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 [© K. Bazaragan]

Booth Multiplier The following shows a structure of a Booth multiplier Stage j+1 Stage j Pj+1 Left shift 2 Left shift 2 code yi+4 code yi+2 Adder/subtractor Adder/subtractor yi+3 yi+1 Pj+1 yi+2 yi mux mux sel sel x 2x x 2x Pj

Modified Booth Multiplier -Summary Uses high-radix to reduce number of intermediate addition operands Can go higher: radix-8, radix-16 Radix-8 should implement *3, *-3, *4, *-4 Recoding and partial product generation becomes more complex Can automatically take care of signed multiplication

Wallace-Tree Based Multiplier Principle Sum N shifted partial products Do N-input addition efficiently Reduced N-input addition in steps Use counters, e.g. carry-save adder (CSA) (3/2 reduction) CSA is simple, it is just a full adder At the end of the array you need to add two parts together. This take a fast adder, but you only need one at the end, not one for each partial product.

Reduction by Carry-save adders Example: X(2,1,0)*Y(2,1,0), Let A0=X(0)*Y(0), A1 = X(1)*Y(0), X(2)*Y(0), etc. A2 A1 A0 B2 B1 B0 C2 C1 C0 C0 A2 B1 CSA B2 C1 CSA C2 A1 B0 A0 CPA

Carry-Save Multiplier

Wallace Tree Multiplier The Wallace tree multiplier uses logic tricks to speed up the required addition. It is an adder tree built from carry save adders using 3-to-2 reduction A 1-bit adder provides a 3:2 compression in the number of bits. The addition of partial products in a column of an array multiplier may be thought of as totaling up the number of 1’s in that column, with an carry being passed to the next column to the left.

Wallace Tree Multiplier Multiplicand Partial Product Generator Partial Products Summation Network Two 2n bit operands Carry Propagate Adder Final 2n bit Product

Wallace-Tree Multiplier

Wallace Tree Example Delay = 4 CSA + 1 CLA [Par00] p130 [© Oxford U Press]

Wallace-Tree Multiplier

Wallace-Tree Based Multiplier

The issues of sign extension When the partial product is negative, we need to do sign extension. If we do it just by copying of bit, there is impact on the delay since the fanout can be large. We can do some tricks Pre-add the triangle of 1’s The to clear out 1’s by adding 1 to the row 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 S 0 0 0 0 0 0 0 0 or 1 1 1 1 1 1 1 1 (S=0) (S=1)

The issues of sign extension Now you only need to add few bits S S S 1S 1 S 1 0 1 0 1 0 11 Adding these few bits is equivalent to complete sign extension

Other Multiplier structures Serial Multiplier: Very compact but very slow: M+N bit product requires Td= MN clock cycles Serial/Parallel Multiplier: Very modular, good trade-off: Td=M+N cycles

Multipliers —Summary

Floating-point units More complex operation/more time Fewer access Often designed outside the normal ALU Co-processor Floating point representation Data = (-1)sign*0.1 Fraction*2exp Normalization: 1 < Data <= ½ (Exp =0, Sign =0) First Decimal Digit is one No need for representing it IEEE standard: sign – 1 bit, exponent – 11 bits, fraction – 52 bits => total 64 bits

Floating Point Addition Align operands Check exponents Shift data Add fractional bits Integer addition Normalization Increment or decrement exponents Rounding data

Floating point adder sign exponent mantissa +/- A B A B A B C C sign Exp. Diff. Shift Align Sign Unit Adder (Mantissa) Exp. update Norm Round C C sign mantissa C exponent

Floating Point Multiplication Add exponents 11 bit addition Multiply the mantissa Integer multiplication Normalization Shift data (at most by one) Decrement exponent Rounding data

Floating Point Multiplier sign exponent mantissa A B A B A B Exp. Add Ex-or Multiplier (Mantissa) Exp. update Norm Round C C sign mantissa C exponent

Comparator A = B, A > B, A < B

High speed comparator A single-cycle comparator based on the priority-encoding algorithm and dynamic circuit design technique [Huang 2002] 4 steps: XOR gate is used to determine whether each corresponding bit of the two numbers is equal or not. A priority encoder is used to set the most significant unequal bit of the result from step 1 to ‘1’ and reset all other bits to ‘0’. The result of step 2 is “ANDed” with the two input numbers. All the bits of the results of step 3 are “ORed” together to determine which number is greater.

Dynamic Priority Encoder Critical path: 7 transistors because of the NAND gate implementation

Wide bit width comparator – 64 bits Hierarchical- multistages Phase pipelining to achieve single clock

New comparator not using Priority encoder New algorithm uses a parallel MSBs bit checking method instead of priority encoding to determine the location of the first significant bit that the two inputs are different. Using this method facilitates the use of NOR-type logic gate and results in faster speed for dynamic logic implementation

New algorithm 4 steps Both AB’ and A’B are computed. Unlike the original PE algorithm which uses XOR gate to find the bits that A and B are different, the information of which number is larger at that particular bit location. E.g :4’b0010 indicates that at bit 1, A is larger than B. A data conversion (calculating A* and B*) is done to determine the most significant bit that is a ‘1’ in the result of step 1. Different from the priority encoder, instead of setting the most significant 1-bit to 1 and resetting all the other bits to ‘0’, we set all the preceding bits of the most significant 1-bit (not including the most significant 1-bit itself) to 1 and reset all the other bits to zero. By doing so the implementation can be done using NOR type of dynamic logic. we calculate (A*)’B* and A*(B*)’. If A* has a longer running length of zero, A*(B*)’. will be all zero and (A*)’B* will have some bits equal to 1, and vice versa. We check whether the result of step 3 is an all zero vector or not by ORing all the bits together. A corresponding zero vector means that the other input is the greater one.

Implementation