CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu

Slides:



Advertisements
Similar presentations
Arithmetic for Computers
Advertisements

Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.
EECS Components and Design Techniques for Digital Systems Lec 18 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
Chapter 6 Arithmetic. Addition Carry in Carry out
Arithmetic.
ECE 645 – Computer Arithmetic Lecture 9: Basic Dividers ECE 645—Computer Arithmetic 4/1/08.
ECE 4110– Sequential Logic Design
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Digital Integrated Circuits Chpt. 5Lec /29/2006 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 10-1 The ALU.
07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Reconfigurable Computing - Multipliers: Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on.
Multiplication of signed-operands
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
L/O/G/O CPU Arithmetic Chapter 7 CS.216 Computer Architecture and Organization.
55:035 Computer Architecture and Organization Lecture 5.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3:
Digital Integrated Circuits 2e: Chapter Copyright  2002 Prentice Hall PTR, Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture.
Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.
CSE477 L21 Multiplier Design.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
CSE575 Multiplication.1 © MJIrwin, PSU, 2005 Computer Arithmetic CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (
William Stallings Computer Organization and Architecture 8th Edition
Arithmetic UNIT-V.
Computer System Design Lecture 3
More Binary Arithmetic - Multiplication
Chapter Contents 3.1 Overview 3.2 Fixed Point Addition and Subtraction
Integer Multiplication and Division
Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
CSE477 VLSI Digital Circuits Fall 2003 Lecture 21: Multiplier Design
UNIVERSITY OF MASSACHUSETTS Dept
EKT 221 : Digital 2 Serial Transfers & Microoperations
Addition and multiplication
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Morgan Kaufmann Publishers
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
CSE 575 Computer Arithmetic Spring 2005 Mary Jane Irwin (www. cse. psu
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Lecture 8: Addition, Multiplication & Division
CSE 575 Computer Arithmetic Spring 2002 Mary Jane Irwin (www. cse. psu
Computer Organization and Design Arithmetic & Logic Circuits
Tree and Array Multipliers
CSE Winter 2001 – Arithmetic Unit - 1
Unsigned Multiplication
Computer Organization and Design Arithmetic & Logic Circuits
VLSI Arithmetic Lecture 10: Multipliers
ECEG-3202 Computer Architecture and Organization
12/7/
Topics Multipliers..
CS 140 Lecture 14 Standard Combinational Modules
Part III The Arithmetic/Logic Unit
Reading: Study Chapter (including Booth coding)
CSE 140 Lecture 14 Standard Combinational Modules
Addition and multiplication
Montek Singh Mon, Mar 28, 2011 Lecture 11
UNIVERSITY OF MASSACHUSETTS Dept
UNIVERSITY OF MASSACHUSETTS Dept
Addition and multiplication
ECE 352 Digital System Fundamentals
UNIVERSITY OF MASSACHUSETTS Dept
Lecture 9 Digital VLSI System Design Laboratory
UNIVERSITY OF MASSACHUSETTS Dept
Arithmetic Logic Unit A.R. Hurson Electrical and Computer Engineering Missouri University of Science & Technology A.R. Hurson.
Appendix J Authors: John Hennessy & David Patterson.
Booth Recoding: Advantages and Disadvantages
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Number Representation
UNIVERSITY OF MASSACHUSETTS Dept
Presentation transcript:

CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www.cse.psu.edu/~mji)

* and / Considerations It is possible to build really fast multipliers Wallace tree: 2logn with fast CPA and “sort of” fast dividers base 4 SRT: n/2 add (CSA) cycles at the cost of silicon area and energy. What if area (and energy) are more important metrics than performance?

Array Multipliers & Dividers Slow, but Very regular structure Use only short wires to nearest neighbor cells Thus, very simple and efficient layout in VLSI Can be easily and efficiently pipelined

Multiply Review Right shift and add (serial) integer multiplication Partial products accumulated from top Only requires an n bit adder n multiplicand - D multiplier - Q partial product array n double precision product - P 2n

Example Array Multiplier d3 d2 d1 d0 lsb q0 M03 M02 M01 M00 p0 q1 M13 M12 M11 M10 p1 q2 M23 M22 M21 M20 shifts from correct positioning of cells need O(n**2) cells (signed will need more!) delay increases linearly with operand length as shown on next slide note that first row really doesn’t do any work - just adds in zeros, so can be reduced to and gates p2 q3 M33 M32 M31 M30 p7 p6 p5 p4 p3

Square Array Multiplier d0 d1 d2 d3 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 lsb carry sum shifts from correct positioning of cells need O(n**2) cells (signed will need more!) delay increases linearly with operand length as shown on next slide note that first row really doesn’t do any work - just adds in zeros, so can be reduced to and gates

Array Multiplier Delay q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 for lecture also notice the computational wavefronts (back diagonals in green) of signal glitching – so probably not energy efficient Longest delay path 2n + n - 2 = 3n - 2

Multiplier Cell Structure sum input dj 2D 1D qi carry out for lecture want to design the cells so that the tsum ~ tcarry for delay balancing add extra delays to qi and di lines to complete delay balancing - Leap Frog multiplier – so that all inputs to a multiplier cell arrive simultaneously have to treat the top row, left column and right column as special cases FA carry in sum output

Identical Delays for Carry and Sum Delay Balanced FA Identical Delays for Carry and Sum !p p cin !cout x y !y p !p p !p s cin y x !y p Want balanced delays from inputs to both sum and carry outputs to minimize glitching but notice that !cout is produced – does the inverter to form cout spoil the balance? Sum generation 22 transistors Signal set-up Carry generation

Pipelined Array Multiplier clk p7 M00 M01 M02 M03 M10 M11 M12 M13 M20 M21 M22 M23 M30 M31 M32 M33 d0 d1 d2 d3 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 time between clks is ripple add time across one row Is there a faster way?

Array Multiplier with Recoding q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 CTRL M10 M21 M20 M32 M31 M30 CTRL does differentiating recoding of ‘ier Note now shifting right rather than left recode to take care of negative ‘ier, and sign extend to accommodate negative ‘icand 2n - 1 by n cells with a worst case delay path of 3n – 2 ? (or is it 2n-1?) can pipeline just like previous unsigned scheme (with increase in delay per row since have to wait for worst case timing as exhibited by the last row carry ripple of 2n - 1 cells)

Multiplier Cell Structure dj sum input Z q’i A/S carry out Z A/S zero 0 0 previous partial product + 0 add 1 0 previous partial product + D subt 1 1 previous partial product - D FA carry in sum output

CSA Array Multiplier M00 M01 M02 M03 M10 M11 M12 M13 M20 M21 M22 M23 d0 d1 d2 d3 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 CSA dj sum input qi carry in output out delay is still linear - but less! once again, first row doesn’t do any real work, just forms the first row of partial product terms (and gates) last row of cells has to propagate the carry, so are slightly different micro-architecture, in particular the cells have four things to add together so can do with a CSA feeding a CPA

CSA Array Multiplier Longest delay path n + n - 1 = 2n - 1 M00 M01 M02 q0 p0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 delay is still linear - but less! only have to pay for the carry to ripple across the last row Longest delay path n + n - 1 = 2n - 1

Pipelined CSA Array Multiplier clk d3 d2 d1 d0 q0 M03 M02 M01 M00 p0 q1 M13 M12 M11 M10 p1 but what about delay in last row - will set the rate for the clk, so no faster than previous design! q2 M23 M22 M21 M20 p2 q3 M’33 M’32 M’31 M’30 p7 p6 p5 p4 p3

Augmented Pipelined CSA Array Multiplier clk d0 d1 d2 d3 M00 M01 M02 M03 M10 M11 M12 M13 M20 M21 M22 M23 M30 M31 M32 M33 q0 q1 p1 q2 p2 q3 p3 p4 p5 p6 p7 M41 M42 M43 M52 M53 M63 p0 now delay is each row is defined by one CSA time - but latency is increased for msbits of product

Constructing Big Mult’s from Small Can synthesize a 2b x 2b multiplier from four b x b multipliers and a three operand addition operation AH AL BH BL AL BL 3b bits AL BH AL BL BH AH b bits AH BL AH BH 4b product

Division Operation Left shift and subtract (serial) fractional division n n . Q quotient . . P dividend D divisor 2n P < D 1/2  D <1 partial remainder array (pra) . R remainder n

Restoring Array Divider p1 p2 p3 p4 q1 1 R11 R12 R13 R14 p5 q2 1 R22 R23 R24 R25 p6 q3 1 layout resembles dots in dot diagram Difference in each row between the ppr and the divisor is formed (trial subtraction) if the result is positive, cout = 0 so qi+1 = 1 if the result is negative, cout = 1 so qi+1 = 0 restoring division R33 R34 R35 R36 p7 lsb q4 1 R44 R45 R46 R47 r5 r6 r7 r8

Restoring Divider Cell Structure partial remainder input di 1 carry out FA carry in subtractor cell mux is used to select the previous ppr if qi+1 = 0, otherwise output of FA is selected 1 partial remainder output

Restoring Array Divider Delay q4 r8 q3 r7 q2 r6 q1 r5 R25 R35 R36 R45 R46 R47 p1 p2 p3 p4 p5 p6 p7 1 For lecture need O(n**2) cells and O(n**2) delay since have to wait for ripple in each row and all n rows Longest delay path n * n = n2

Pipelined Restoring Array Divider clk q1 1 R11 R12 R13 R14 p5 q2 1 R22 R23 R24 R25 p6 q3 pipelining speeds up delay to O(n) defined by ripple time per row 1 R33 R34 R35 R36 p7 q4 1 R44 R45 R46 R47 r5 r6 r7 r8

Nonrestoring Array Divider p1 p2 p3 p4 1 R’11 R’12 R’13 R’14 q1’ p5 R’22 R’23 R’24 R’25 q2’ p6 Same size and ~speed of the restoring array (still O(n**2)) Difference in each row between the ppr and the divisor is formed if control is 1 (top left input) - note that that input wraps around and sets the carry in (on subtract if 1 meaning do subtract cause qi+1 = 1) Also note that the carry out of the previous row becomes the control input of the next row (if carry out = 1 then subtract in that row and add in the next row and vica versa) R’33 R’34 R’35 R’36 q3’ p7 R’44 R’45 R’46 R’47 q4’ r5 r6 r7 r8

R’ Divider Cell Structure partial remainder input di carry out FA carry in partial remainder output

Pipelined Nonrestoring Array Divider clk 1 R’11 R’12 R’13 R’14 q1 p5 R’22 R’23 R’24 R’25 q2 p6 R’33 R’34 R’35 R’36 q3 p7 R’44 R’45 R’46 R’47 q4 r5 r6 r7 r8

Key References Agrawal, High-speed arithmetic arrays, IEEE Trans. on Computers, 28(3):215-224, 1979. Baugh, Wooley, A two’s complement parallel array multiplication algorithm, IEEE Trans. Computers, 22: 1045-1047, Dec. 1973. Cappa, Hamacher, An augmented iterative array for high-speed binary division, IEEE Trans. on Computers, 22:172-175, Feb. 1973. Denver, Myers, Carry-save arrays for VLSI signal processing, Proc. of VLSI 81, pp. 151-160, 1981. Kamal, A generalized pipeline array, IEEE Trans. on Computers, 23(5):533-536, 1974. Parhami, Computer Arithmetic, Oxford Univ. Press, 1999. Mori, A 10-ns 54b by 54b parallel structured full array multiplier with 0.5 mm CMOS technology, IEEE J. SSC, 26(4):600-605, April 1991. Pezaris, A 40-ns 17b by 17b array multiplier, IEEE Trans. Computers, 20:442-447, April 1971.