EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.

Slides:



Advertisements
Similar presentations
Multiplication and Shift Circuits Dec 2012 Shmuel Wimer Bar Ilan University, Engineering Faculty Technion, EE Faculty 1.
Advertisements

Multipliers Multipliers Booth’s Multiplier Floating Point Arithmetic.
Lecture 11 Oct 12 Circuits for floating-point operations addition multiplication division (only sketchy)
Computer Organization CS224 Fall 2012 Lesson 19. Floating-Point Example  What number is represented by the single-precision float …00 
EE141 © Digital Integrated Circuits 2nd Arithmetic Circuits 1 Digital Integrated Circuits A Design Perspective Arithmetic Circuits Jan M. Rabaey Anantha.
Datorteknik ArithmeticCircuits bild 1 Computer arithmetic Somet things you should know about digital arithmetic: Principles Architecture Design.
Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
Datapath Functional Units. Outline  Comparators  Shifters  Multi-input Adders  Multipliers.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3:
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3: IT Students.
UNIVERSITY OF MASSACHUSETTS Dept
Chapter 3 Arithmetic for Computers. Multiplication More complicated than addition accomplished via shifting and addition More time and more area Let's.
UNIVERSITY OF MASSACHUSETTS Dept
What You Have Always Wanted to Know about FP Hardware Implementation (But Were Afraid to Ask) Acknowledgements: Based on Prof. Shaaban lecture notes, Prof.
Energy and Delay Improvement via Decimal Floating Point Hossam A.H.Fahmy, Electronics and Communications Department, CairoUniversity Egypt and.
EE466: VLSI Design Lecture 14: Datapath Functional Units.
Introduction to CMOS VLSI Design Datapath Functional Units
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
COE 308: Computer Architecture (T041) Dr. Marwan Abu-Amara Integer & Floating-Point Arithmetic (Appendix A, Computer Architecture: A Quantitative Approach,
CPSC 321 Computer Architecture ALU Design – Integer Addition, Multiplication & Division Copyright 2002 David H. Albonesi and the University of Rochester.
Computer Arithmetic Integers: signed / unsigned (can overflow) Fixed point (can overflow) Floating point (can overflow, underflow) (Boolean / Character)
Lecture 18: Datapath Functional Units
Aug Shift Operations Source: David Harris. Aug Shifter Implementation Regular layout, can be compact, use transmission gates to avoid threshold.
Computer Arithmetic Nizamettin AYDIN
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
3-1 Chapter 3 - Arithmetic Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles of Computer Architecture.
AICCSA’06 Sharja 1 A CAD Tool for Scalable Floating Point Adder Design and Generation Using C++/VHDL By Asim J. Al-Khalili.
Copyright 1995 by Coherence LTD., all rights reserved (Revised: Oct 97 by Rafi Lohev, Oct 99 by Yair Wiseman, Sep 04 Oren Kapah) IBM י ב מ 10-1 The ALU.
1 Appendix J Authors: John Hennessy & David Patterson.
Chapter # 5: Arithmetic Circuits
Introduction to Computer Organization and Architecture Lecture 10 By Juthawut Chantharamalee wut_cha/home.htm.
07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.
Computer Arithmetic II Instructor: Mozafar Bag-Mohammadi Spring 2006 University of Ilam.
Multiplication of signed-operands
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
L/O/G/O CPU Arithmetic Chapter 7 CS.216 Computer Architecture and Organization.
55:035 Computer Architecture and Organization Lecture 5.
Advanced VLSI Design Unit 05: Datapath Units. Slide 2 Outline  Adders  Comparators  Shifters  Multi-input Adders  Multipliers.
Lecture 6: Multiply, Shift, and Divide
Computer Arithmetic II Instructor: Mozafar Bag-Mohammadi Ilam University.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 3:
Lecture notes Reading: Section 3.4, 3.5, 3.6 Multiplication
IT253: Computer Organization
Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.
CSE 246: Computer Arithmetic Algorithms and Hardware Design Instructor: Prof. Chung-Kuan Cheng Fall 2006 Lecture 7 Division.
CSE 8351 Computer Arithmetic Fall 2005 Instructors: Peter-Michael Seidel.
Chapter 8 Computer Arithmetic. 8.1 Unsigned Notation Non-negative notation  It treats every number as either zero or a positive value  Range: 0 to 2.
CSE477 L21 Multiplier Design.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
S 2/e C D A Computer Systems Design and Architecture Second Edition© 2004 Prentice Hall Chapter 6 Overview Number Systems and Radix Conversion Fixed point.
EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003.
1 Lecture 5Multiplication and Division ECE 0142 Computer Organization.
Full Adder Truth Table Conjugate Symmetry A B C CARRY SUM
Arithmetic for Computers Chapter 3 1. Arithmetic for Computers  Operations on integers  Addition and subtraction  Multiplication and division  Dealing.
William Stallings Computer Organization and Architecture 8th Edition
More Binary Arithmetic - Multiplication
Somet things you should know about digital arithmetic:
Computer Architecture & Operations I
MIPS mul/div instructions
Morgan Kaufmann Publishers Arithmetic for Computers
UNIVERSITY OF MASSACHUSETTS Dept
Morgan Kaufmann Publishers
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
Arithmetic for Computers
UNIVERSITY OF MASSACHUSETTS Dept
CSCE 350 Computer Architecture
مظفر بگ محمدی دانشگاه ایلام
Morgan Kaufmann Publishers Arithmetic for Computers
Appendix J Authors: John Hennessy & David Patterson.
CSE 575 Computer Arithmetic Spring 2003 Mary Jane Irwin (www. cse. psu
1 Lecture 5Multiplication and Division ECE 0142 Computer Organization.
Presentation transcript:

EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under the AT (area-time) rule, area is (almost) as important. So it’s important to know the latency, bandwidth and area that any particular algorithm requires.

EE 382 Processor DesignWinter 98/99Michael Flynn 2 Integer addition Adders are the fundamental building block of the processor, defining  t. Adder types include –carry chain, carry select (conditional sum), carry lookahead (Brent-Kung), canonic (prefix) carry skip, Ling Most high speed 32b adders take about the same area (f normalized)…1 A to 1.5A

EE 382 Processor DesignWinter 98/99Michael Flynn 3 Integer addition Both area and time scale as n, the adder precision. The delay, t, scales slowly (log n) Area scale about linearly with n; so a 64b adder takes 2-3 A, but still fits into  t …maybe by definition of a “cycle”.

EE 382 Processor DesignWinter 98/99Michael Flynn 4 Carry skip adder

EE 382 Processor DesignWinter 98/99Michael Flynn 5 Manchester carry chain

EE 382 Processor DesignWinter 98/99Michael Flynn 6 Carry skip logic

EE 382 Processor DesignWinter 98/99Michael Flynn 7 Carry select addition

EE 382 Processor DesignWinter 98/99Michael Flynn 8 FP addition A basic FP adder has 5 steps –exponent difference, pre align, significand add, post align, and round. Assuming that a full shifter has about the same complexity (delay and area) as an add, then 64b FP addition takes A, and has about 5  t execution

EE 382 Processor DesignWinter 98/99Michael Flynn 9 FP addition Advanced FP adders are faster and use more area: 1) Two path FADD creates separate paths for operands; a path for operands whose exponents close in value (subtract)  this is the only case when we need a full shift to renormalize the result a path for other cases where the exponent difference is > 2 (this is the only case that uses a full shift to prealign significands) 2) A FADD with integrated rounding. Here the rounding step is eliminated by computing both the sum/difference and the result plus 1… this is done by using 2 adders (or a compound adder) and then MUXing out the final result.

EE 382 Processor DesignWinter 98/99Michael Flynn 10 FP adders The two path FP adder uses an additional significand adder and exponent adder… about 3-4 A. It reduces FADD delay by one  t Integrated rounding adds another rounding adder plus MUX…another 3-4 A while reducing delay by another  t

EE 382 Processor DesignWinter 98/99Michael Flynn 11 FP adders Net area time tradeoff Basic… Area 10 A and delay 4-5  t Two path… Area 13.5 A and delay 3-4  t Integrated round (with two paths)… area 17 A and delay 2-3  t For pipelining add 1 A per pipe stage and use upper range on  t

EE 382 Processor DesignWinter 98/99Michael Flynn 12 Multipliers After add, the most important arithmetic op Approaches –encode the multiplier bits (Booth 2, Booth 3...) –assimilate the partial products one, two or n pass (iterated arrays or trees) arrays (simple, double, higher level) trees (Wallace, binary[4:2], ZD,….) –CPA to produce product

EE 382 Processor DesignWinter 98/99Michael Flynn 13 Multipliers Integer and FP multipliers usually have about the same execution time (with same precision, n) Booth reduces number of pp’s but adds MUXs to generate the pp’s. Most of the area, and probably delay too, is in the pp reduction tree.

EE 382 Processor DesignWinter 98/99Michael Flynn bit Booth 2 multiply

EE 382 Processor DesignWinter 98/99Michael Flynn bit Booth 2 example

EE 382 Processor DesignWinter 98/99Michael Flynn bit Booth 2 pp selector logic

EE 382 Processor DesignWinter 98/99Michael Flynn bit Booth 3 multiply

EE 382 Processor DesignWinter 98/99Michael Flynn 18 5 x 5 unsigned multiplication

EE 382 Processor DesignWinter 98/99Michael Flynn 19 1-bit adder

EE 382 Processor DesignWinter 98/99Michael Flynn 20 Wallace tree

EE 382 Processor DesignWinter 98/99Michael Flynn 21 Wallace tree reduction

EE 382 Processor DesignWinter 98/99Michael Flynn 22 Multipliers A full tree implementation of a 54b (FP type) with Booth 2 has tree height 28 and uses about 2500 CSAs (or about 50 A in the tree). Maybe a total of 10 A in MUXs plus 50 A in tree and 3A in the CPA, 62A total.The fastest multiplier is, maybe, 2  t Using a 2 pass tree reduces the hardware considerably; height is 14 using about 700 CSAs or 14 A…total area = 22A; 3-4  t

EE 382 Processor DesignWinter 98/99Michael Flynn 23 Multipliers To pipeline the Multiplier we need a full tree implementation; probably 3-4  t. Perhaps Booth3, followed by a full tree (h = 17) and CPA stage. Probably area = A

EE 382 Processor DesignWinter 98/99Michael Flynn 24 Divide Infrequent op, but long latency can affect IPC achieved. Algorithms: –SRT 2 or 3 bit (  t) maybe 6-10 A –NR or Binomial expansion (  t); needs at least 6 A for table and control plus use of MPY –Bipartite tables for small n (less than 24b)

EE 382 Processor DesignWinter 98/99Michael Flynn 25 Divide SRT creates quotient 2 or 3 bits/iteration –uses divisor - partial remainder lookup table for trial quotient then subtracts –result (partial rem.) is in redundant form so no restoration is needed; also result is left as a sum and carry pair (no cpa needed) –fast iteration is possible, sometimes 2 x per  t

EE 382 Processor DesignWinter 98/99Michael Flynn 26 Divide Multiply based use either Newton Raphson or Binomial series –if f(x) = b - 1/x; root is at x = 1/b then NR iteration is x i+1 = x i (2  b x i ) –converges is quadratic, doubles precision of result each iteration –so start with table lookup of 1/b to 8b, then 3 iterations gives 64b result then a x (1/b) is quotient

EE 382 Processor DesignWinter 98/99Michael Flynn 27 Divide Divide is not usually pipelined, except for small n implementations. Frequently combined with square root in the same implementation.

EE 382 Processor DesignWinter 98/99Michael Flynn 28 Sub word concurrency Provides 8, 16, 32b concurrent ops within “existing” integer or FP hardware In 64b integer unit can do 8 x 8, or 4 x 16, or 2 x 32 ops concurrently Since FP units are designed to be faster, may be use it: 8 x 4, or 2 x 16, or 2 x 24.

EE 382 Processor DesignWinter 98/99Michael Flynn 29 Sub word concurrency Usually only for add and multiply Implementations straightforward for add; more complicated for multiply –requires reorganizing partitions of the pp tree –affects multiply area and delay marginally (maybe 10% delay and 20% area) isa must define “saturating” arithmetic.